Controlling Kernel Buffering in Linux: Complete Guide

 

Controlling Kernel Buffering in Linux: Complete Guide

💾

Controlling Kernel Buffering in Linux: Complete Guide

fsync, fdatasync, sync, and O_SYNC — making sure data actually hits the disk.

We’ve seen that write() puts data in the kernel buffer cache, not on disk. But what if your program crashes, or the power goes out? The data in the kernel buffer is lost. For critical applications — databases, journaling systems, financial data — you must guarantee data is on disk. This part covers the system calls and flags that let you do exactly that.

1. Two Types of Synchronized I/O

POSIX defines two levels of guaranteed I/O. The difference is in the metadata (information about a file: timestamps, size, permissions, etc.).

📋 Data Integrity Completion

For a write: the data itself + only the metadata needed to read it back (e.g., file size if file grew) are flushed to disk. Timestamps and other non-essential metadata are not required.

Implemented by: fdatasync() · O_DSYNC

📋 File Integrity Completion

For a write: the data + all updated file metadata (including timestamps, permissions, etc.) are flushed to disk. This is a superset of data integrity.

Implemented by: fsync() · O_SYNC

📌 Which to use? For most cases (e.g., a database writing records), fdatasync() is enough and faster. Use fsync() when you also need metadata consistency (e.g., a file’s modification time must be accurate for recovery).

2. fsync() — Flush Data + All Metadata

#include <unistd.h>

int fsync(int fd);
/* Returns: 0 on success, -1 on error */ /* BLOCKS until data is physically on disk (or disk cache) */

fsync(fd) forces all data and all metadata for the file (or device) pointed to by fd to be written to the underlying disk. The call does not return until the disk confirms the write.

After fsync() returns successfully, you are guaranteed that the data survived a power failure.

#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <string.h>

/* Safely write a critical record to disk */
int write_critical_record(const char *filename, const char *data, size_t len) {
    int fd = open(filename, O_WRONLY | O_CREAT | O_APPEND, 0644);
    if (fd == -1) { perror("open"); return -1; }

    /* Step 1: Write data to kernel buffer cache */
    ssize_t n = write(fd, data, len);
    if (n == -1) { perror("write"); close(fd); return -1; }
    if ((size_t)n != len) {
        fprintf(stderr, "Partial write: only %zd of %zu bytes\n", n, len);
        close(fd); return -1;
    }

    /* Step 2: Force kernel buffer → physical disk
       fsync() blocks until the disk controller acknowledges the write.
       This is the guarantee that data survives a crash. */
    if (fsync(fd) == -1) {
        perror("fsync");
        close(fd);
        return -1;
    }

    close(fd);
    printf("Record written and synced to disk\n");
    return 0;
}

int main(void) {
    const char record[] = "TXN:001 AMOUNT:5000 STATUS:COMMITTED\n";
    return write_critical_record("/var/log/transactions.log", record, strlen(record));
}

/* Compile: gcc -o write_critical write_critical.c
   Note: fsync() on a typical spinning disk may take 5-15ms per call.
   On SSD it's faster, but still much slower than a plain write(). */

3. fdatasync() — Flush Data Only (Faster)

#include <unistd.h>

int fdatasync(int fd);
/* Returns: 0 on success, -1 on error */

fdatasync() is like fsync() but smarter: it only flushes the file metadata that is strictly necessary to read the data back. The file modification timestamp, for example, is not flushed.

This saves one extra disk seek (data and metadata often live in different disk locations), making fdatasync() significantly faster than fsync() when doing many updates to a file.

#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <string.h>
#include <time.h>

/* Database-style log writer — uses fdatasync for performance */
typedef struct {
    int fd;
    long record_count;
} LogDB;

int logdb_open(LogDB *db, const char *path) {
    db->fd = open(path, O_WRONLY | O_CREAT | O_APPEND, 0644);
    db->record_count = 0;
    return db->fd;
}

int logdb_write(LogDB *db, const char *key, const char *value) {
    char entry[256];
    int len = snprintf(entry, sizeof(entry),
                       "%ld|%s|%s\n", (long)time(NULL), key, value);

    /* Write to kernel buffer */
    if (write(db->fd, entry, len) != len) {
        perror("write"); return -1;
    }
    db->record_count++;

    /* Use fdatasync() instead of fsync():
       - Does NOT sync the file's mtime (we don't care about that)
       - Does sync the file data and file size (we DO care about those)
       - Saves ~1 disk seek compared to fsync() */
    if (fdatasync(db->fd) == -1) {
        perror("fdatasync"); return -1;
    }
    return 0;
}

int main(void) {
    LogDB db;
    if (logdb_open(&db, "/tmp/mydb.log") == -1) return 1;

    logdb_write(&db, "sensor_temp",   "36.7");
    logdb_write(&db, "sensor_press",  "1013");
    logdb_write(&db, "ble_rssi",      "-72");

    close(db.fd);
    printf("Wrote %ld records\n", db.record_count);
    return 0;
}

4. sync() — Flush Everything (System-Wide)

#include <unistd.h>

void sync(void);
/* No return value. Flushes ALL dirty buffers system-wide. */

sync() schedules all dirty kernel buffers in the entire system to be written to disk. Unlike fsync() and fdatasync(), it is not file-specific and does not take a file descriptor.

In Linux, sync() waits until all data has been sent to the disk. In some other UNIX systems, it may return before the writes complete (it just schedules them).

Common uses: before system shutdown, before unmounting a filesystem, in the sync shell command.

#include <unistd.h>
#include <stdio.h>

int main(void) {
    /* Write something to disk first */
    int fd = open("/tmp/test.txt", O_WRONLY|O_CREAT|O_TRUNC, 0644);
    write(fd, "test data\n", 10);
    close(fd);

    /* sync() flushes ALL dirty buffers in the system to disk.
       This is a heavy operation — do not call in a tight loop. */
    printf("Calling sync()...\n");
    sync();
    printf("sync() completed — all dirty pages flushed\n");

    /* In a shell: the 'sync' command does the same thing */
    /* $ sync */
    return 0;
}

/* When to use sync() vs fsync():
   sync()      — system administrator use, shutdown scripts, unmounting
   fsync(fd)   — application use, when you care about a specific file
   fdatasync() — application use, faster alternative when metadata timing is unimportant */

5. O_SYNC Flag — Make Every Write Synchronous

Instead of manually calling fsync() after every write, you can open a file with O_SYNC. This makes every write() call automatically wait for the data to be on disk before returning. This is equivalent to calling fsync() after every single write().

#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <string.h>

int main(void) {
    /* Open with O_SYNC — every write() will block until disk confirms */
    int fd = open("/tmp/sync_test.txt",
                  O_WRONLY | O_CREAT | O_TRUNC | O_SYNC,
                  0644);
    if (fd == -1) { perror("open"); return 1; }

    const char *line = "Critical record\n";

    /* This write() will NOT return until data is on disk.
       This may take 5-15ms on a spinning disk.
       No need to call fsync() — O_SYNC handles it automatically. */
    write(fd, line, strlen(line));
    printf("write() returned — data is on disk\n");

    close(fd);
    return 0;
}

/* Combine O_SYNC with O_APPEND for an atomic append-only log:
   int fd = open("audit.log", O_WRONLY | O_CREAT | O_APPEND | O_SYNC, 0644);
   Each write() is atomic (won't interleave with other processes) AND synchronous */

Performance Impact: With vs Without O_SYNC

Writing 1 MB (1,000,000 bytes) to a new file on ext2 (real benchmark data from the book):

Buffer Size Without O_SYNC With O_SYNC Slowdown Factor
Elapsed CPU Elapsed CPU
1 byte 0.73s 0.73s 1030s 98.8s ~1400×
16 bytes 0.05s 0.05s 65.0s 0.40s ~1300×
256 bytes 0.02s 0.02s 4.07s 0.03s ~200×
4096 bytes 0.01s 0.01s 0.34s 0.03s ~34×
🚨 Key takeaway: O_SYNC with a 1-byte buffer is 1400× slower than plain write(). Even with a 4096-byte buffer, it’s 34× slower. This is because every write() must wait for disk hardware. Never use O_SYNC with tiny buffers.

6. O_DSYNC and O_RSYNC

Flag What it does Equivalent to Available since
O_SYNC Every write() blocks until data + ALL metadata is on disk fsync() after each write() Always
O_DSYNC Every write() blocks until data + essential metadata is on disk fdatasync() after each write() Linux 2.6.33+
O_RSYNC Used with O_SYNC or O_DSYNC — applies sync behavior to reads too Planned (not yet in kernel)
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <string.h>

int main(void) {
    /* O_DSYNC: faster than O_SYNC because it skips non-essential metadata.
       Best for database-style writes where you don't care about timestamps. */
    int fd = open("/tmp/dsync_test.txt",
                  O_WRONLY | O_CREAT | O_TRUNC | O_DSYNC,
                  0644);
    if (fd == -1) { perror("open O_DSYNC"); return 1; }

    const char *data = "Record: fdatasync-level guarantee\n";
    write(fd, data, strlen(data));
    /* write() returns only after: data + file size are on disk
       (mtime is NOT guaranteed — that's the O_DSYNC vs O_SYNC difference) */

    close(fd);
    printf("O_DSYNC write complete\n");
    return 0;
}

/* Performance recommendation:
   Need durability?   Use O_DSYNC (faster) for most cases.
   Need full metadata? Use O_SYNC or manually call fsync().
   Better approach:   Use large buffers + occasional fsync() — more control. */

7. Best Practices for Durable Writes

#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <string.h>

/* PATTERN 1: Write many records, fsync() at logical boundaries (best performance) */
void pattern_batch_fsync(int fd, const char **records, int count) {
    for (int i = 0; i < count; i++) {
        write(fd, records[i], strlen(records[i]));
    }
    /* Single fsync() for all the records above — much more efficient
       than one fsync() per record */
    fsync(fd);
    printf("Batch of %d records committed to disk\n", count);
}

/* PATTERN 2: Atomic update — write new data to temp file, then rename */
/* This ensures the file is never in a corrupt state */
int atomic_file_update(const char *target, const char *data, size_t len) {
    char tmp[256];
    snprintf(tmp, sizeof(tmp), "%s.tmp.%d", target, (int)getpid());

    int fd = open(tmp, O_WRONLY | O_CREAT | O_TRUNC, 0644);
    if (fd == -1) { perror("open tmp"); return -1; }

    /* Write new content to temp file */
    if (write(fd, data, len) != (ssize_t)len) {
        perror("write"); close(fd); unlink(tmp); return -1;
    }

    /* Sync the new data to disk */
    if (fsync(fd) == -1) {
        perror("fsync"); close(fd); unlink(tmp); return -1;
    }
    close(fd);

    /* Atomic rename: either old file exists or new file exists, never corrupt */
    if (rename(tmp, target) == -1) {
        perror("rename"); unlink(tmp); return -1;
    }

    /* Also sync the directory to make rename persistent */
    int dir_fd = open(".", O_RDONLY);
    if (dir_fd != -1) { fsync(dir_fd); close(dir_fd); }

    return 0;
}

int main(void) {
    /* Pattern 2 demo */
    const char *new_config = "[server]\nport=8080\nhost=0.0.0.0\n";
    atomic_file_update("/etc/myapp/config.ini", new_config, strlen(new_config));
    return 0;
}
💡 Database Pattern: Real databases write to a Write-Ahead Log (WAL) using fdatasync() for durability, then do batch writes to data files. They never call fsync() after every single record — that would be impossibly slow.

🎯 Interview Questions – Kernel Sync

Q1. What is the difference between fsync() and fdatasync()?
Both force data to disk, but fsync() flushes all file metadata too (timestamps, permissions, etc.), while fdatasync() only flushes the metadata that is strictly necessary to retrieve the data (like file size if it grew). fdatasync() can be significantly faster because it avoids a disk seek between the data area and the metadata/inode area of the disk.
Q2. Does close() guarantee that data is on disk?
No. close() just closes the file descriptor and may flush stdio buffers. It does not call fsync(). The data may still be in the kernel buffer cache when close() returns. If a crash occurs after close() but before the kernel flushes, the data is lost. You must call fsync() before close() if you need disk durability.
Q3. What is the difference between O_SYNC and O_DSYNC?
O_SYNC gives file integrity completion — every write() syncs data + all metadata to disk (equivalent to fsync() after each write). O_DSYNC gives data integrity completion — every write() syncs data + only essential metadata (equivalent to fdatasync() after each write). O_DSYNC is faster because it skips syncing non-essential metadata like timestamps.
Q4. A database developer says “we never use O_SYNC, we use fdatasync()” — why is this better?
O_SYNC makes every write() synchronous, which forces a disk flush after every write() call regardless of write size. fdatasync() gives you control — you can batch many write() calls and then call fdatasync() once per transaction. This way you get durability guarantees with far fewer disk flushes, dramatically improving throughput.
Q5. What is the difference between sync() and fsync()?
sync() flushes ALL dirty kernel buffers for all files in the entire system — it’s a system-wide operation with no file descriptor argument. fsync(fd) flushes only the specific file identified by fd. sync() is used by system administrators and shutdown scripts. fsync() is used by applications that need to protect specific files.
Q6. Why is the atomic rename pattern (write to temp + rename) better than overwriting directly?
If you write directly to a file and crash halfway, the file is corrupted — half old data, half new. With the temp + rename pattern: you write the complete new data to a temp file, call fsync() on it, then rename. The rename() syscall is atomic on Linux — the old file either exists completely or the new file exists completely. There is no intermediate corrupt state.
Q7. From the benchmark: O_SYNC with 1-byte buffer takes 1030 seconds vs 0.73 without. Why is there such a huge CPU time difference too (98.8s vs 0.73s)?
Without O_SYNC, the kernel buffers writes and returns immediately. CPU time is low because write() is cheap. With O_SYNC and 1-byte buffer, there are ~1 million write() calls each requiring a full disk flush. Each flush involves disk controller communication. The high system CPU (98.8s) comes from 1 million system calls + 1 million disk I/O operations being processed by the kernel.

✅ Summary of Part 3

  • fsync(fd) — flush data + all metadata. Guarantees disk write. Slower.
  • fdatasync(fd) — flush data + essential metadata only. Faster. Prefer this for databases.
  • sync() — flush everything system-wide. Use for shutdown scripts, not application code.
  • O_SYNC — every write() blocks until full disk write. Very slow with small buffers.
  • O_DSYNC — every write() blocks until data + essential metadata are on disk. Faster than O_SYNC.
  • Best pattern: batch writes + periodic fdatasync() or fsync().
  • The atomic rename pattern (temp file + rename) prevents file corruption on crash.

Leave a Reply

Your email address will not be published. Required fields are marked *