Controlling Kernel Buffering in Linux: Complete Guide
💾
Controlling Kernel Buffering in Linux: Complete Guide
fsync, fdatasync, sync, and O_SYNC — making sure data actually hits the disk.
We’ve seen that write() puts data in the kernel buffer cache, not on disk. But what if your program crashes, or the power goes out? The data in the kernel buffer is lost. For critical applications — databases, journaling systems, financial data — you must guarantee data is on disk. This part covers the system calls and flags that let you do exactly that.
1. Two Types of Synchronized I/O
POSIX defines two levels of guaranteed I/O. The difference is in the metadata (information about a file: timestamps, size, permissions, etc.).
📋 Data Integrity Completion
For a write: the data itself + only the metadata needed to read it back (e.g., file size if file grew) are flushed to disk. Timestamps and other non-essential metadata are not required.
Implemented by: fdatasync() · O_DSYNC
📋 File Integrity Completion
For a write: the data + all updated file metadata (including timestamps, permissions, etc.) are flushed to disk. This is a superset of data integrity.
Implemented by: fsync() · O_SYNC
fdatasync() is enough and faster. Use fsync() when you also need metadata consistency (e.g., a file’s modification time must be accurate for recovery).2. fsync() — Flush Data + All Metadata
#include <unistd.h>
int fsync(int fd);
/* Returns: 0 on success, -1 on error */ /* BLOCKS until data is physically on disk (or disk cache) */
fsync(fd) forces all data and all metadata for the file (or device) pointed to by fd to be written to the underlying disk. The call does not return until the disk confirms the write.
After fsync() returns successfully, you are guaranteed that the data survived a power failure.
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <string.h>
/* Safely write a critical record to disk */
int write_critical_record(const char *filename, const char *data, size_t len) {
int fd = open(filename, O_WRONLY | O_CREAT | O_APPEND, 0644);
if (fd == -1) { perror("open"); return -1; }
/* Step 1: Write data to kernel buffer cache */
ssize_t n = write(fd, data, len);
if (n == -1) { perror("write"); close(fd); return -1; }
if ((size_t)n != len) {
fprintf(stderr, "Partial write: only %zd of %zu bytes\n", n, len);
close(fd); return -1;
}
/* Step 2: Force kernel buffer → physical disk
fsync() blocks until the disk controller acknowledges the write.
This is the guarantee that data survives a crash. */
if (fsync(fd) == -1) {
perror("fsync");
close(fd);
return -1;
}
close(fd);
printf("Record written and synced to disk\n");
return 0;
}
int main(void) {
const char record[] = "TXN:001 AMOUNT:5000 STATUS:COMMITTED\n";
return write_critical_record("/var/log/transactions.log", record, strlen(record));
}
/* Compile: gcc -o write_critical write_critical.c
Note: fsync() on a typical spinning disk may take 5-15ms per call.
On SSD it's faster, but still much slower than a plain write(). */
3. fdatasync() — Flush Data Only (Faster)
#include <unistd.h>
int fdatasync(int fd);
/* Returns: 0 on success, -1 on error */
fdatasync() is like fsync() but smarter: it only flushes the file metadata that is strictly necessary to read the data back. The file modification timestamp, for example, is not flushed.
This saves one extra disk seek (data and metadata often live in different disk locations), making fdatasync() significantly faster than fsync() when doing many updates to a file.
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <string.h>
#include <time.h>
/* Database-style log writer — uses fdatasync for performance */
typedef struct {
int fd;
long record_count;
} LogDB;
int logdb_open(LogDB *db, const char *path) {
db->fd = open(path, O_WRONLY | O_CREAT | O_APPEND, 0644);
db->record_count = 0;
return db->fd;
}
int logdb_write(LogDB *db, const char *key, const char *value) {
char entry[256];
int len = snprintf(entry, sizeof(entry),
"%ld|%s|%s\n", (long)time(NULL), key, value);
/* Write to kernel buffer */
if (write(db->fd, entry, len) != len) {
perror("write"); return -1;
}
db->record_count++;
/* Use fdatasync() instead of fsync():
- Does NOT sync the file's mtime (we don't care about that)
- Does sync the file data and file size (we DO care about those)
- Saves ~1 disk seek compared to fsync() */
if (fdatasync(db->fd) == -1) {
perror("fdatasync"); return -1;
}
return 0;
}
int main(void) {
LogDB db;
if (logdb_open(&db, "/tmp/mydb.log") == -1) return 1;
logdb_write(&db, "sensor_temp", "36.7");
logdb_write(&db, "sensor_press", "1013");
logdb_write(&db, "ble_rssi", "-72");
close(db.fd);
printf("Wrote %ld records\n", db.record_count);
return 0;
}
4. sync() — Flush Everything (System-Wide)
#include <unistd.h>
void sync(void);
/* No return value. Flushes ALL dirty buffers system-wide. */
sync() schedules all dirty kernel buffers in the entire system to be written to disk. Unlike fsync() and fdatasync(), it is not file-specific and does not take a file descriptor.
In Linux, sync() waits until all data has been sent to the disk. In some other UNIX systems, it may return before the writes complete (it just schedules them).
Common uses: before system shutdown, before unmounting a filesystem, in the sync shell command.
#include <unistd.h>
#include <stdio.h>
int main(void) {
/* Write something to disk first */
int fd = open("/tmp/test.txt", O_WRONLY|O_CREAT|O_TRUNC, 0644);
write(fd, "test data\n", 10);
close(fd);
/* sync() flushes ALL dirty buffers in the system to disk.
This is a heavy operation — do not call in a tight loop. */
printf("Calling sync()...\n");
sync();
printf("sync() completed — all dirty pages flushed\n");
/* In a shell: the 'sync' command does the same thing */
/* $ sync */
return 0;
}
/* When to use sync() vs fsync():
sync() — system administrator use, shutdown scripts, unmounting
fsync(fd) — application use, when you care about a specific file
fdatasync() — application use, faster alternative when metadata timing is unimportant */
5. O_SYNC Flag — Make Every Write Synchronous
Instead of manually calling fsync() after every write, you can open a file with O_SYNC. This makes every write() call automatically wait for the data to be on disk before returning. This is equivalent to calling fsync() after every single write().
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <string.h>
int main(void) {
/* Open with O_SYNC — every write() will block until disk confirms */
int fd = open("/tmp/sync_test.txt",
O_WRONLY | O_CREAT | O_TRUNC | O_SYNC,
0644);
if (fd == -1) { perror("open"); return 1; }
const char *line = "Critical record\n";
/* This write() will NOT return until data is on disk.
This may take 5-15ms on a spinning disk.
No need to call fsync() — O_SYNC handles it automatically. */
write(fd, line, strlen(line));
printf("write() returned — data is on disk\n");
close(fd);
return 0;
}
/* Combine O_SYNC with O_APPEND for an atomic append-only log:
int fd = open("audit.log", O_WRONLY | O_CREAT | O_APPEND | O_SYNC, 0644);
Each write() is atomic (won't interleave with other processes) AND synchronous */
Performance Impact: With vs Without O_SYNC
Writing 1 MB (1,000,000 bytes) to a new file on ext2 (real benchmark data from the book):
| Buffer Size | Without O_SYNC | With O_SYNC | Slowdown Factor | ||
|---|---|---|---|---|---|
| Elapsed | CPU | Elapsed | CPU | ||
| 1 byte | 0.73s | 0.73s | 1030s | 98.8s | ~1400× |
| 16 bytes | 0.05s | 0.05s | 65.0s | 0.40s | ~1300× |
| 256 bytes | 0.02s | 0.02s | 4.07s | 0.03s | ~200× |
| 4096 bytes | 0.01s | 0.01s | 0.34s | 0.03s | ~34× |
6. O_DSYNC and O_RSYNC
| Flag | What it does | Equivalent to | Available since |
|---|---|---|---|
O_SYNC |
Every write() blocks until data + ALL metadata is on disk | fsync() after each write() | Always |
O_DSYNC |
Every write() blocks until data + essential metadata is on disk | fdatasync() after each write() | Linux 2.6.33+ |
O_RSYNC |
Used with O_SYNC or O_DSYNC — applies sync behavior to reads too | — | Planned (not yet in kernel) |
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <string.h>
int main(void) {
/* O_DSYNC: faster than O_SYNC because it skips non-essential metadata.
Best for database-style writes where you don't care about timestamps. */
int fd = open("/tmp/dsync_test.txt",
O_WRONLY | O_CREAT | O_TRUNC | O_DSYNC,
0644);
if (fd == -1) { perror("open O_DSYNC"); return 1; }
const char *data = "Record: fdatasync-level guarantee\n";
write(fd, data, strlen(data));
/* write() returns only after: data + file size are on disk
(mtime is NOT guaranteed — that's the O_DSYNC vs O_SYNC difference) */
close(fd);
printf("O_DSYNC write complete\n");
return 0;
}
/* Performance recommendation:
Need durability? Use O_DSYNC (faster) for most cases.
Need full metadata? Use O_SYNC or manually call fsync().
Better approach: Use large buffers + occasional fsync() — more control. */
7. Best Practices for Durable Writes
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <string.h>
/* PATTERN 1: Write many records, fsync() at logical boundaries (best performance) */
void pattern_batch_fsync(int fd, const char **records, int count) {
for (int i = 0; i < count; i++) {
write(fd, records[i], strlen(records[i]));
}
/* Single fsync() for all the records above — much more efficient
than one fsync() per record */
fsync(fd);
printf("Batch of %d records committed to disk\n", count);
}
/* PATTERN 2: Atomic update — write new data to temp file, then rename */
/* This ensures the file is never in a corrupt state */
int atomic_file_update(const char *target, const char *data, size_t len) {
char tmp[256];
snprintf(tmp, sizeof(tmp), "%s.tmp.%d", target, (int)getpid());
int fd = open(tmp, O_WRONLY | O_CREAT | O_TRUNC, 0644);
if (fd == -1) { perror("open tmp"); return -1; }
/* Write new content to temp file */
if (write(fd, data, len) != (ssize_t)len) {
perror("write"); close(fd); unlink(tmp); return -1;
}
/* Sync the new data to disk */
if (fsync(fd) == -1) {
perror("fsync"); close(fd); unlink(tmp); return -1;
}
close(fd);
/* Atomic rename: either old file exists or new file exists, never corrupt */
if (rename(tmp, target) == -1) {
perror("rename"); unlink(tmp); return -1;
}
/* Also sync the directory to make rename persistent */
int dir_fd = open(".", O_RDONLY);
if (dir_fd != -1) { fsync(dir_fd); close(dir_fd); }
return 0;
}
int main(void) {
/* Pattern 2 demo */
const char *new_config = "[server]\nport=8080\nhost=0.0.0.0\n";
atomic_file_update("/etc/myapp/config.ini", new_config, strlen(new_config));
return 0;
}
fdatasync() for durability, then do batch writes to data files. They never call fsync() after every single record — that would be impossibly slow.🎯 Interview Questions – Kernel Sync
✅ Summary of Part 3
fsync(fd)— flush data + all metadata. Guarantees disk write. Slower.fdatasync(fd)— flush data + essential metadata only. Faster. Prefer this for databases.sync()— flush everything system-wide. Use for shutdown scripts, not application code.O_SYNC— every write() blocks until full disk write. Very slow with small buffers.O_DSYNC— every write() blocks until data + essential metadata are on disk. Faster than O_SYNC.- Best pattern: batch writes + periodic
fdatasync()orfsync(). - The atomic rename pattern (temp file + rename) prevents file corruption on crash.
