Kernel Buffer Cache in Linux: Complete Guide
🗄️ Kernel Buffer Cache
Why your write() doesn’t go to disk immediately — and why that’s actually a good thing.
Every time you call write(), do you think the data goes straight to disk? It doesn’t. Linux puts the data in a memory region called the kernel buffer cache first, then writes to disk later in the background. This page explains exactly how that works, why it exists, and what it means for your programs.
1. What Is the Kernel Buffer Cache?
The kernel buffer cache (also called the page cache in Linux 2.4+) is a region of RAM managed by the kernel. It acts as a middle layer between your program and the disk.
When you call write(fd, "hello", 5), the kernel copies those 5 bytes from your process memory into a kernel buffer. write() then returns immediately — it does NOT wait for the disk. The kernel schedules the actual disk write to happen later, in the background.
Similarly, when you call read(), the kernel checks: “Is this data already in the buffer cache?” If yes, it copies from cache to your buffer — no disk access needed. This is very fast. If not, it reads from disk into the cache, then copies to your buffer.
How write() Works Internally
flush
write() returns as soon as data is in the kernel buffer. The disk write happens asynchronously.
Code Example: Basic write() and the Buffer Cache
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <string.h>
int main(void) {
int fd = open("test.txt", O_WRONLY | O_CREAT | O_TRUNC, 0644);
if (fd == -1) { perror("open"); return 1; }
/* This write() returns almost immediately.
Data goes to kernel buffer cache, NOT disk yet. */
const char *msg = "Hello, buffer cache!\n";
ssize_t bytes = write(fd, msg, strlen(msg));
printf("write() returned: %ld bytes written (but not on disk yet!)\n", bytes);
close(fd); /* This may flush, but not guaranteed */
/* To guarantee data is on disk, you must call fsync() — see Part 3 */
return 0;
}
Code Example: How read() Benefits from the Cache
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
int main(void) {
int fd = open("bigfile.bin", O_RDONLY);
if (fd == -1) { perror("open"); return 1; }
char buf[4096];
ssize_t n;
/* First read: kernel reads from disk into buffer cache */
n = read(fd, buf, sizeof(buf));
printf("First read: got %ld bytes (disk was accessed)\n", n);
/* If we read the same data again (seek back), kernel
serves it from cache — much faster, no disk I/O */
lseek(fd, 0, SEEK_SET);
n = read(fd, buf, sizeof(buf));
printf("Second read: got %ld bytes (from cache, no disk!)\n", n);
/* For sequential access, Linux also does read-ahead:
it prefetches the NEXT blocks into cache automatically */
close(fd);
return 0;
}
2. Why Buffer Size Matters Enormously
Every read() or write() is a system call — your program switches from user mode to kernel mode. This switch takes time, even though it’s faster than a disk access.
If you write 1 byte at a time, you make 1 million system calls to write 1 MB. If you write 4096 bytes at a time, you make only 256 system calls for the same 1 MB. The total data is the same, but the overhead is completely different.
Elapsed Time: Copying 100 MB with Different Buffer Sizes
Time to copy a 100 MB file (lower = better). Real data from a Linux 2.6.30 ext2 system:
Full Performance Data Table (100 MB file copy)
| Buffer Size (bytes) | Elapsed (sec) | Total CPU (sec) | User CPU (sec) | System CPU (sec) | Approx System Calls |
|---|---|---|---|---|---|
| 1 | 107.43 | 107.32 | 8.20 | 99.12 | ~100 million |
| 16 | 7.50 | 7.14 | 0.51 | 6.63 | ~6.25 million |
| 128 | 2.16 | 1.59 | 0.11 | 1.48 | ~781 thousand |
| 512 | 2.06 | 1.03 | 0.05 | 0.98 | ~195 thousand |
| 4096 | 2.05 | 0.38 | 0.01 | 0.38 | ~24 thousand |
| 65536 | 2.06 | 0.32 | 0.00 | 0.32 | ~1.5 thousand |
Code Example: Measuring the Impact of Buffer Size
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
/* Write 100 MB using a given buffer size and measure time */
void benchmark_write(const char *filename, size_t buf_size, size_t total_bytes) {
int fd = open(filename, O_WRONLY | O_CREAT | O_TRUNC, 0644);
if (fd == -1) { perror("open"); return; }
char *buf = malloc(buf_size);
if (!buf) { perror("malloc"); close(fd); return; }
memset(buf, 'A', buf_size); /* Fill buffer with dummy data */
struct timespec start, end;
clock_gettime(CLOCK_MONOTONIC, &start);
size_t written = 0;
while (written < total_bytes) {
size_t chunk = (total_bytes - written < buf_size) ?
(total_bytes - written) : buf_size;
ssize_t n = write(fd, buf, chunk);
if (n == -1) { perror("write"); break; }
written += n;
}
clock_gettime(CLOCK_MONOTONIC, &end);
double elapsed = (end.tv_sec - start.tv_sec) +
(end.tv_nsec - start.tv_nsec) / 1e9;
printf("Buffer size: %6zu bytes | Elapsed: %.3f sec | Calls: ~%zu\n",
buf_size, elapsed, total_bytes / buf_size);
free(buf);
close(fd);
unlink(filename); /* Clean up */
}
int main(void) {
size_t total = 10 * 1024 * 1024; /* 10 MB for quick demo */
printf("Writing %zu MB with different buffer sizes:\n\n", total / (1024*1024));
size_t sizes[] = {1, 64, 512, 4096, 65536};
for (int i = 0; i < 5; i++) {
benchmark_write("/tmp/bench_test.bin", sizes[i], total);
}
return 0;
}
/* Compile: gcc -O2 -o benchmark benchmark.c
Run: ./benchmark
Expected output (roughly):
Buffer size: 1 bytes | Elapsed: 8.700 sec | Calls: ~10485760
Buffer size: 64 bytes | Elapsed: 0.120 sec | Calls: ~163840
Buffer size: 512 bytes | Elapsed: 0.040 sec | Calls: ~20480
Buffer size: 4096 bytes | Elapsed: 0.020 sec | Calls: ~2560
Buffer size: 65536 bytes | Elapsed: 0.018 sec | Calls: ~160
*/
3. Kernel Read-Ahead
For sequential file reads, the Linux kernel doesn’t wait for you to ask for the next block — it reads it in advance into the cache. This is called read-ahead.
By the time you call read() for the next chunk, the data is already in the page cache. Your read() returns instantly from cache. This is why sequential reads of large files can be very fast even if the file wasn’t cached before.
You can influence this behavior using posix_fadvise() — covered in Part 4.
Code Example: File Copy with Efficient Buffer
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#define BUF_SIZE 4096 /* One filesystem block = optimal */
int copy_file(const char *src, const char *dst) {
int in_fd = open(src, O_RDONLY);
if (in_fd == -1) { perror("open src"); return -1; }
int out_fd = open(dst, O_WRONLY | O_CREAT | O_TRUNC, 0644);
if (out_fd == -1) { perror("open dst"); close(in_fd); return -1; }
char buf[BUF_SIZE];
ssize_t nr, nw;
long total = 0;
while ((nr = read(in_fd, buf, BUF_SIZE)) > 0) {
char *ptr = buf;
ssize_t remaining = nr;
/* Handle partial writes (important for robustness) */
while (remaining > 0) {
nw = write(out_fd, ptr, remaining);
if (nw == -1) { perror("write"); goto done; }
ptr += nw;
remaining -= nw;
total += nw;
}
}
if (nr == -1) perror("read");
done:
printf("Copied %ld bytes\n", total);
close(in_fd);
close(out_fd);
return (nr == -1) ? -1 : 0;
}
int main(int argc, char *argv[]) {
if (argc != 3) {
fprintf(stderr, "Usage: %s source dest\n", argv[0]);
return 1;
}
return copy_file(argv[1], argv[2]);
}
/* Compile: gcc -o copy_file copy_file.c
Run: ./copy_file /etc/passwd /tmp/passwd_copy */
write() may write fewer bytes than you asked for. This is rare for regular files but can happen with sockets, pipes, and when writing large chunks. Always handle it.4. Buffer Cache vs Page Cache
In old UNIX systems, there was a separate buffer cache just for disk blocks. Linux 2.4+ unified this into the page cache, which handles all file data — disk file content, memory-mapped files, etc. — in one place.
The kernel can use as much RAM as it has for the page cache. If memory is tight, it writes dirty pages to disk and reclaims them. The pdflush kernel thread flushes pages that have been dirty for more than 30 seconds (configurable via /proc/sys/vm/dirty_expire_centisecs).
Code Example: Check and Trigger Cache Flush (Shell)
/* From the shell — check page cache usage */
# Show memory split between free, buffers, and cache
free -h
# Show dirty pages in the page cache (in pages, 1 page = 4 KB usually)
cat /proc/meminfo | grep -i dirty
# Check the dirty expiry threshold (in centiseconds, default = 3000 = 30 sec)
cat /proc/sys/vm/dirty_expire_centisecs
# Force all dirty buffers to flush to disk (as root)
sync
# Drop page cache to force fresh reads from disk (useful for benchmarking)
echo 1 | sudo tee /proc/sys/vm/drop_caches # drop page cache only
echo 3 | sudo tee /proc/sys/vm/drop_caches # drop page cache + slab cache
/* After drop_caches, your next file read will be from disk — much slower */
🎯 Interview Questions – Kernel Buffer Cache
✅ Summary of Part 1
write()copies data to the kernel page cache and returns. Disk write happens later.read()fetches from the page cache if available; otherwise reads disk into cache first.- The kernel does read-ahead automatically for sequential reads.
- Buffer size drastically affects performance: use 4 KB or larger for file I/O.
- The page cache is sized by available RAM — no fixed limit.
- Dirty pages are flushed by
pdflushevery ~30 seconds (or when memory is tight).
