Kernel Buffer Cache in Linux: Complete Guide

 

 

Kernel Buffer Cache in Linux: Complete Guide

🗄️ Kernel Buffer Cache

Why your write() doesn’t go to disk immediately — and why that’s actually a good thing.

Every time you call write(), do you think the data goes straight to disk? It doesn’t. Linux puts the data in a memory region called the kernel buffer cache first, then writes to disk later in the background. This page explains exactly how that works, why it exists, and what it means for your programs.

1. What Is the Kernel Buffer Cache?

The kernel buffer cache (also called the page cache in Linux 2.4+) is a region of RAM managed by the kernel. It acts as a middle layer between your program and the disk.

When you call write(fd, "hello", 5), the kernel copies those 5 bytes from your process memory into a kernel buffer. write() then returns immediately — it does NOT wait for the disk. The kernel schedules the actual disk write to happen later, in the background.

Similarly, when you call read(), the kernel checks: “Is this data already in the buffer cache?” If yes, it copies from cache to your buffer — no disk access needed. This is very fast. If not, it reads from disk into the cache, then copies to your buffer.

How write() Works Internally

💻
Your Process
write(“hello”,5)
syscall
📦
Kernel Buffer Cache
Data sits here in RAM
write() returns HERE ✅
background
flush
💽
Disk
Data arrives later

write() returns as soon as data is in the kernel buffer. The disk write happens asynchronously.

Code Example: Basic write() and the Buffer Cache

#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <string.h>

int main(void) {
    int fd = open("test.txt", O_WRONLY | O_CREAT | O_TRUNC, 0644);
    if (fd == -1) { perror("open"); return 1; }

    /* This write() returns almost immediately.
       Data goes to kernel buffer cache, NOT disk yet. */
    const char *msg = "Hello, buffer cache!\n";
    ssize_t bytes = write(fd, msg, strlen(msg));

    printf("write() returned: %ld bytes written (but not on disk yet!)\n", bytes);

    close(fd);  /* This may flush, but not guaranteed */

    /* To guarantee data is on disk, you must call fsync() — see Part 3 */
    return 0;
}

Code Example: How read() Benefits from the Cache

#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>

int main(void) {
    int fd = open("bigfile.bin", O_RDONLY);
    if (fd == -1) { perror("open"); return 1; }

    char buf[4096];
    ssize_t n;

    /* First read: kernel reads from disk into buffer cache */
    n = read(fd, buf, sizeof(buf));
    printf("First read: got %ld bytes (disk was accessed)\n", n);

    /* If we read the same data again (seek back), kernel 
       serves it from cache — much faster, no disk I/O */
    lseek(fd, 0, SEEK_SET);
    n = read(fd, buf, sizeof(buf));
    printf("Second read: got %ld bytes (from cache, no disk!)\n", n);

    /* For sequential access, Linux also does read-ahead:
       it prefetches the NEXT blocks into cache automatically */

    close(fd);
    return 0;
}

2. Why Buffer Size Matters Enormously

Every read() or write() is a system call — your program switches from user mode to kernel mode. This switch takes time, even though it’s faster than a disk access.

If you write 1 byte at a time, you make 1 million system calls to write 1 MB. If you write 4096 bytes at a time, you make only 256 system calls for the same 1 MB. The total data is the same, but the overhead is completely different.

Elapsed Time: Copying 100 MB with Different Buffer Sizes

Time to copy a 100 MB file (lower = better). Real data from a Linux 2.6.30 ext2 system:

1 byte buffer107 sec
64 bytes2.19 sec
512 bytes2.06 sec
4096 bytes (optimal)2.05 sec
65536 bytes2.06 sec
📌 Key insight: Going from 1-byte to 4096-byte buffers gives a 50× speedup. Beyond 4096 bytes (one disk block), the improvement is minimal — disk I/O is now the bottleneck.

Full Performance Data Table (100 MB file copy)

Buffer Size (bytes) Elapsed (sec) Total CPU (sec) User CPU (sec) System CPU (sec) Approx System Calls
1 107.43 107.32 8.20 99.12 ~100 million
16 7.50 7.14 0.51 6.63 ~6.25 million
128 2.16 1.59 0.11 1.48 ~781 thousand
512 2.06 1.03 0.05 0.98 ~195 thousand
4096 2.05 0.38 0.01 0.38 ~24 thousand
65536 2.06 0.32 0.00 0.32 ~1.5 thousand
💡 Practical Rule: For file I/O, use a buffer size that is a multiple of the filesystem block size — typically 4096 bytes (4 KB). This gives near-optimal performance. Going larger helps slightly but the gain diminishes quickly.

Code Example: Measuring the Impact of Buffer Size

#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>

/* Write 100 MB using a given buffer size and measure time */
void benchmark_write(const char *filename, size_t buf_size, size_t total_bytes) {
    int fd = open(filename, O_WRONLY | O_CREAT | O_TRUNC, 0644);
    if (fd == -1) { perror("open"); return; }

    char *buf = malloc(buf_size);
    if (!buf) { perror("malloc"); close(fd); return; }
    memset(buf, 'A', buf_size);   /* Fill buffer with dummy data */

    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);

    size_t written = 0;
    while (written < total_bytes) {
        size_t chunk = (total_bytes - written < buf_size) ?
                       (total_bytes - written) : buf_size;
        ssize_t n = write(fd, buf, chunk);
        if (n == -1) { perror("write"); break; }
        written += n;
    }

    clock_gettime(CLOCK_MONOTONIC, &end);

    double elapsed = (end.tv_sec - start.tv_sec) +
                     (end.tv_nsec - start.tv_nsec) / 1e9;

    printf("Buffer size: %6zu bytes | Elapsed: %.3f sec | Calls: ~%zu\n",
           buf_size, elapsed, total_bytes / buf_size);

    free(buf);
    close(fd);
    unlink(filename);  /* Clean up */
}

int main(void) {
    size_t total = 10 * 1024 * 1024;  /* 10 MB for quick demo */
    printf("Writing %zu MB with different buffer sizes:\n\n", total / (1024*1024));

    size_t sizes[] = {1, 64, 512, 4096, 65536};
    for (int i = 0; i < 5; i++) {
        benchmark_write("/tmp/bench_test.bin", sizes[i], total);
    }
    return 0;
}

/* Compile: gcc -O2 -o benchmark benchmark.c
   Run:     ./benchmark
   Expected output (roughly):
   Buffer size:      1 bytes | Elapsed: 8.700 sec | Calls: ~10485760
   Buffer size:     64 bytes | Elapsed: 0.120 sec | Calls: ~163840
   Buffer size:    512 bytes | Elapsed: 0.040 sec | Calls: ~20480
   Buffer size:   4096 bytes | Elapsed: 0.020 sec | Calls: ~2560
   Buffer size:  65536 bytes | Elapsed: 0.018 sec | Calls: ~160
*/

3. Kernel Read-Ahead

For sequential file reads, the Linux kernel doesn’t wait for you to ask for the next block — it reads it in advance into the cache. This is called read-ahead.

By the time you call read() for the next chunk, the data is already in the page cache. Your read() returns instantly from cache. This is why sequential reads of large files can be very fast even if the file wasn’t cached before.

You can influence this behavior using posix_fadvise() — covered in Part 4.

Code Example: File Copy with Efficient Buffer

#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>

#define BUF_SIZE 4096   /* One filesystem block = optimal */

int copy_file(const char *src, const char *dst) {
    int in_fd  = open(src, O_RDONLY);
    if (in_fd == -1) { perror("open src"); return -1; }

    int out_fd = open(dst, O_WRONLY | O_CREAT | O_TRUNC, 0644);
    if (out_fd == -1) { perror("open dst"); close(in_fd); return -1; }

    char buf[BUF_SIZE];
    ssize_t nr, nw;
    long total = 0;

    while ((nr = read(in_fd, buf, BUF_SIZE)) > 0) {
        char *ptr = buf;
        ssize_t remaining = nr;
        /* Handle partial writes (important for robustness) */
        while (remaining > 0) {
            nw = write(out_fd, ptr, remaining);
            if (nw == -1) { perror("write"); goto done; }
            ptr += nw;
            remaining -= nw;
            total += nw;
        }
    }
    if (nr == -1) perror("read");

done:
    printf("Copied %ld bytes\n", total);
    close(in_fd);
    close(out_fd);
    return (nr == -1) ? -1 : 0;
}

int main(int argc, char *argv[]) {
    if (argc != 3) {
        fprintf(stderr, "Usage: %s source dest\n", argv[0]);
        return 1;
    }
    return copy_file(argv[1], argv[2]);
}

/* Compile: gcc -o copy_file copy_file.c
   Run: ./copy_file /etc/passwd /tmp/passwd_copy */
📝 Important: Notice the inner loop for partial writes. A write() may write fewer bytes than you asked for. This is rare for regular files but can happen with sockets, pipes, and when writing large chunks. Always handle it.

4. Buffer Cache vs Page Cache

In old UNIX systems, there was a separate buffer cache just for disk blocks. Linux 2.4+ unified this into the page cache, which handles all file data — disk file content, memory-mapped files, etc. — in one place.

The kernel can use as much RAM as it has for the page cache. If memory is tight, it writes dirty pages to disk and reclaims them. The pdflush kernel thread flushes pages that have been dirty for more than 30 seconds (configurable via /proc/sys/vm/dirty_expire_centisecs).

Code Example: Check and Trigger Cache Flush (Shell)

/* From the shell — check page cache usage */

# Show memory split between free, buffers, and cache
free -h

# Show dirty pages in the page cache (in pages, 1 page = 4 KB usually)
cat /proc/meminfo | grep -i dirty

# Check the dirty expiry threshold (in centiseconds, default = 3000 = 30 sec)
cat /proc/sys/vm/dirty_expire_centisecs

# Force all dirty buffers to flush to disk (as root)
sync

# Drop page cache to force fresh reads from disk (useful for benchmarking)
echo 1 | sudo tee /proc/sys/vm/drop_caches   # drop page cache only
echo 3 | sudo tee /proc/sys/vm/drop_caches   # drop page cache + slab cache

/* After drop_caches, your next file read will be from disk — much slower */

🎯 Interview Questions – Kernel Buffer Cache

Q1. Does write() guarantee data is on disk when it returns? Explain.
No. write() copies data from your process memory to the kernel buffer cache (page cache) and returns. The kernel flushes to disk asynchronously. To guarantee disk write, you need fsync() or the O_SYNC flag.
Q2. Why does using a 1-byte buffer for file I/O perform so much worse than a 4096-byte buffer?
Each call to read() or write() is a system call, which requires a context switch from user mode to kernel mode. Writing 1 MB with a 1-byte buffer means ~1 million system calls. With a 4096-byte buffer it’s only ~256 calls. The total disk I/O is the same, but the system call overhead is enormous at small buffer sizes.
Q3. What is the optimal buffer size for file I/O and why?
Typically 4096 bytes (4 KB) — the filesystem block size. At this size, the overhead of system calls becomes negligible compared to the data transfer cost. Going larger improves things only slightly since disk I/O is now the bottleneck, not system call overhead.
Q4. Two processes write to the same file — one via direct I/O and one normally. What happens?
This is a dangerous scenario. The normal-I/O process uses the page cache, while the direct I/O process bypasses it. There is no cache coherency between them — each process may see stale data. This situation must be avoided in production code.
Q5. What is read-ahead and how does the kernel implement it?
Read-ahead is when the kernel proactively loads upcoming file blocks into the page cache before your process asks for them, based on the assumption that you’ll read sequentially. The default read-ahead window is 128 KB. You can influence this with posix_fadvise() using POSIX_FADV_SEQUENTIAL (doubles window) or POSIX_FADV_RANDOM (disables it).
Q6. What is a “dirty page” in the kernel page cache?
A dirty page is a page in the page cache that has been modified by write() but has not yet been written back to disk. The kernel maintains a dirty bit for each page. The pdflush thread periodically flushes dirty pages. Pages that have been dirty for more than dirty_expire_centisecs are force-flushed.
Q7. What is the difference between the buffer cache and the page cache?
Historically, Unix had two separate caches: a buffer cache for disk blocks and a page cache for file pages. Since Linux 2.4, they were unified into one page cache that handles everything. The term “buffer cache” is still used informally for historical reasons.

✅ Summary of Part 1

  • write() copies data to the kernel page cache and returns. Disk write happens later.
  • read() fetches from the page cache if available; otherwise reads disk into cache first.
  • The kernel does read-ahead automatically for sequential reads.
  • Buffer size drastically affects performance: use 4 KB or larger for file I/O.
  • The page cache is sized by available RAM — no fixed limit.
  • Dirty pages are flushed by pdflush every ~30 seconds (or when memory is tight).

Leave a Reply

Your email address will not be published. Required fields are marked *