posix_fadvise() and Direct I/O in Linux Explained

 

posix_fadvise() and Direct I/O in Linux Explained

🚀

posix_fadvise() and Direct I/O in Linux Explained

Hint to the kernel how you’ll read a file — or bypass its cache entirely.

The kernel makes assumptions about how you’ll use a file and tunes the buffer cache accordingly. But sometimes you know better — for example, you’re reading a file once and will never need it again. posix_fadvise() lets you tell the kernel your access pattern so it can optimize accordingly.

And for very specialized applications (like database engines), there’s an even more radical option: bypass the kernel buffer cache entirely using O_DIRECT. This is called Direct I/O.

1. posix_fadvise() — Advise the Kernel

#define _XOPEN_SOURCE 600
#include <fcntl.h>int posix_fadvise(int fd, off_t offset, off_t len, int advice);
/* Returns: 0 on success, or a positive error number */ /* offset + len define the file region. len=0 means “to end of file” */

posix_fadvise() is a hint to the kernel — not a command. The kernel can ignore it. But when it listens, it can dramatically improve performance by:

  • Pre-loading data into the buffer cache before you ask for it
  • Adjusting the read-ahead window size
  • Freeing cache pages for data you’ve already used

Calling it has no effect on the semantics of your program — correctness is never affected.

The Six Advice Values

POSIX_FADV_NORMAL

Default. No special access pattern. Kernel uses the standard 128 KB read-ahead window.

POSIX_FADV_SEQUENTIAL

You’ll read from low to high offset. Kernel doubles the read-ahead window to 256 KB. Great for streaming large files.

POSIX_FADV_RANDOM

You’ll jump around the file randomly. Kernel disables read-ahead entirely — no point predicting the next block.

POSIX_FADV_WILLNEED

You’ll access this region soon. Kernel pre-loads it into cache NOW. Future read() calls return from cache with no disk wait.

POSIX_FADV_DONTNEED

You won’t need this region again. Kernel flushes dirty pages and frees the cache entries. Reduces cache pollution.

POSIX_FADV_NOREUSE

You’ll use this data once, never again. Currently has no effect in Linux (hint noted but not acted on).

Code Example: posix_fadvise() in Practice

#define _XOPEN_SOURCE 600
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>

/* Example 1: Sequential large file processing (e.g., log parser) */
int process_log_file(const char *path) {
    int fd = open(path, O_RDONLY);
    if (fd == -1) { perror("open"); return -1; }

    /* Tell kernel: I'll read this file from start to end, sequentially */
    posix_fadvise(fd, 0, 0, POSIX_FADV_SEQUENTIAL);
    /* Kernel doubles read-ahead window → more data pre-loaded → fewer wait times */

    char buf[65536];
    ssize_t n;
    long total = 0;
    while ((n = read(fd, buf, sizeof(buf))) > 0) {
        /* Process buf here (e.g., count log lines, parse records) */
        total += n;
    }
    printf("Processed %ld bytes sequentially\n", total);
    close(fd);
    return 0;
}

/* Example 2: Random access — disable read-ahead */
int random_access_file(const char *path, off_t *offsets, int count) {
    int fd = open(path, O_RDONLY);
    if (fd == -1) { perror("open"); return -1; }

    /* Tell kernel: I'll access random offsets — don't bother pre-loading */
    posix_fadvise(fd, 0, 0, POSIX_FADV_RANDOM);

    char buf[512];
    for (int i = 0; i < count; i++) {
        lseek(fd, offsets[i], SEEK_SET);
        read(fd, buf, sizeof(buf));
        /* Process buf */
    }
    close(fd);
    return 0;
}

/* Example 3: Prefetch data for later use */
int prefetch_then_read(const char *path) {
    int fd = open(path, O_RDONLY);
    if (fd == -1) { perror("open"); return -1; }

    /* Tell kernel: load the first 1 MB into cache RIGHT NOW.
       This is asynchronous — posix_fadvise() returns immediately.
       The kernel starts the disk reads in the background. */
    posix_fadvise(fd, 0, 1024*1024, POSIX_FADV_WILLNEED);

    /* Do other work while the kernel prefetches... */
    printf("Doing other work while kernel prefetches...\n");
    sleep(1);  /* Simulate other work */

    /* Now read — data should be in cache, no disk wait */
    char buf[65536];
    ssize_t n = read(fd, buf, sizeof(buf));
    printf("Read %zd bytes (probably from cache)\n", n);

    close(fd);
    return 0;
}

/* Example 4: Single-pass media streaming (free cache after reading) */
int stream_media(const char *path) {
    int fd = open(path, O_RDONLY);
    if (fd == -1) { perror("open"); return -1; }

    posix_fadvise(fd, 0, 0, POSIX_FADV_SEQUENTIAL);

    char buf[131072]; /* 128 KB chunks */
    off_t offset = 0;
    ssize_t n;

    while ((n = read(fd, buf, sizeof(buf))) > 0) {
        /* Stream buf to network/audio/video output */
        /* ...send(socket, buf, n, 0) etc... */

        /* After sending, we'll never need this data again.
           Tell kernel to free the cache pages to save RAM.
           Very useful for media servers — prevents single stream
           from evicting other important cached data. */
        posix_fadvise(fd, offset, n, POSIX_FADV_DONTNEED);
        offset += n;
    }
    close(fd);
    return 0;
}

int main(void) {
    process_log_file("/var/log/syslog");
    prefetch_then_read("/etc/passwd");
    return 0;
}
/* Compile: gcc -D_XOPEN_SOURCE=600 -o fadvise_demo fadvise_demo.c */

2. Direct I/O — Bypassing the Buffer Cache

Normally, all file I/O goes through the kernel buffer cache. But using O_DIRECT when opening a file, your reads and writes transfer data directly between your user-space buffer and the disk, bypassing the kernel cache entirely.

This is called Direct I/O or Raw I/O.

Normal I/O vs Direct I/O

✅ Normal I/O (Default)

Your Process
↕ copy
Kernel Buffer Cache
↕ DMA
Disk

Caching, read-ahead, buffer sharing. Best for most apps.

⚡ Direct I/O (O_DIRECT)

Your Process
Kernel Cache BYPASSED
Disk (directly)

No caching. Must manage your own cache. For databases only.

⚠️ Warning: Direct I/O is Slower for Most Applications

Direct I/O loses all kernel optimizations: no read-ahead, no buffer sharing between processes, no write coalescing. For most applications, using O_DIRECT makes things significantly slower, not faster.

Direct I/O makes sense only when your application implements its own caching (like a database engine), and you want to:

  • Avoid double-buffering (your own cache + kernel cache = wasted RAM)
  • Get predictable I/O latency (no kernel cache eviction surprises)
  • Reduce kernel CPU overhead for your specialized I/O patterns

3. Alignment Rules for Direct I/O

🚨 Direct I/O has strict alignment requirements. Violating them gives EINVAL.

All Three Must Be Multiples of Block Size (usually 512 bytes):

📍
Buffer Address
Must be aligned to block size (512 bytes min). Use memalign().
📁
File Offset
The position in the file where you start reading/writing must be aligned.
📏
Transfer Length
The number of bytes to read/write must be a multiple of block size.

Code Example: O_DIRECT with Proper Alignment

#define _GNU_SOURCE  /* Required to get O_DIRECT from fcntl.h */
#include <fcntl.h>
#include <unistd.h>
#include <malloc.h>   /* for memalign() */
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>

#define BLOCK_SIZE  512     /* Physical disk block size */
#define BUF_SIZE    4096    /* Must be a multiple of BLOCK_SIZE */
#define FILE_OFFSET 0       /* Must be a multiple of BLOCK_SIZE */

int main(void) {
    /* Open with O_DIRECT — bypasses kernel buffer cache */
    int fd = open("/tmp/test_direct.bin", 
                  O_RDWR | O_CREAT | O_TRUNC | O_DIRECT, 0644);
    if (fd == -1) {
        if (errno == EINVAL) {
            fprintf(stderr, "O_DIRECT not supported on this filesystem\n");
        } else {
            perror("open");
        }
        return 1;
    }

    /* RULE 1: Buffer must be aligned to BLOCK_SIZE.
       malloc() does NOT guarantee this alignment.
       Use memalign() or posix_memalign() instead. */
    void *buf = memalign(BLOCK_SIZE, BUF_SIZE);
    if (!buf) { perror("memalign"); close(fd); return 1; }

    /* Fill buffer with test data */
    memset(buf, 'X', BUF_SIZE);

    /* RULE 2: File offset must be a multiple of BLOCK_SIZE.
       We're writing at offset 0, which is fine. */

    /* RULE 3: Transfer length must be a multiple of BLOCK_SIZE.
       BUF_SIZE = 4096 = 8 × 512, so this is fine. */

    /* Write: goes directly to disk, bypassing kernel cache */
    ssize_t n = write(fd, buf, BUF_SIZE);
    if (n == -1) {
        perror("write O_DIRECT");  /* EINVAL if alignment is wrong */
        free(buf); close(fd); return 1;
    }
    printf("Direct write: %zd bytes written to disk\n", n);

    /* Seek back to start for reading */
    if (lseek(fd, FILE_OFFSET, SEEK_SET) == -1) {
        perror("lseek"); free(buf); close(fd); return 1;
    }

    /* Clear buffer and read back */
    memset(buf, 0, BUF_SIZE);
    n = read(fd, buf, BUF_SIZE);
    if (n == -1) {
        perror("read O_DIRECT");
        free(buf); close(fd); return 1;
    }
    printf("Direct read: %zd bytes read from disk\n", n);
    printf("First char: '%c' (expected 'X')\n", ((char*)buf)[0]);

    free(buf);
    close(fd);
    unlink("/tmp/test_direct.bin");
    return 0;
}
/* Compile: gcc -D_GNU_SOURCE -o direct_io direct_io.c
   Note: O_DIRECT may not work on tmpfs (RAM-based filesystems) */

Code Example: posix_memalign() (POSIX-compliant alignment)

#include <stdlib.h>
#include <stdio.h>

/* posix_memalign() is the POSIX-standard way to get aligned memory.
   memalign() is older and not in POSIX but widely available. */

int main(void) {
    void *buf = NULL;
    size_t alignment = 4096;   /* Page-aligned is always safe for O_DIRECT */
    size_t size      = 4096;

    /* posix_memalign: alignment must be a power of 2 and >= sizeof(void*) */
    int ret = posix_memalign(&buf, alignment, size);
    if (ret != 0) {
        fprintf(stderr, "posix_memalign failed: %d\n", ret);
        return 1;
    }
    printf("Buffer address: %p (aligned to %zu bytes)\n", buf, alignment);
    printf("Is aligned: %s\n",
           ((unsigned long)buf % alignment == 0) ? "YES ✅" : "NO ❌");

    free(buf);   /* posix_memalign memory is freed with regular free() */
    return 0;
}

/* Quick check: verify your buffer is aligned */
static inline int is_aligned(const void *ptr, size_t alignment) {
    return ((unsigned long)ptr % alignment) == 0;
}
/* Use: if (!is_aligned(buf, 512)) { handle_error(); } */

🎯 Interview Questions – posix_fadvise & Direct I/O

Q1. What does posix_fadvise() actually do? Can it make a program incorrect?
posix_fadvise() gives the kernel a hint about how your process intends to access a file region. The kernel may use this to optimize buffer cache usage — for example, prefetching data or freeing unneeded cache. It never affects program correctness: data returned by read() is always accurate regardless of advice. The kernel is free to ignore the hint.
Q2. When would POSIX_FADV_DONTNEED be useful?
It’s useful for media streaming or single-pass file processing where you read each block once and never look at it again. After reading each chunk, you call fadvise with DONTNEED to tell the kernel to free those cache pages. This prevents a single streaming process from evicting other important data from the page cache, which is critical for server workloads.
Q3. Why doesn’t Direct I/O always improve performance?
The kernel buffer cache provides many optimizations for free: sequential read-ahead, write coalescing, buffer sharing between processes reading the same file. Direct I/O bypasses all of these. For a typical application, these lost optimizations more than outweigh any benefit. Direct I/O only helps when your app implements its own, better-suited cache (like a database engine), avoiding redundant double-buffering of data in both your cache and the kernel’s.
Q4. What are the three alignment requirements for O_DIRECT?
All three must be multiples of the device’s physical block size (typically 512 bytes, sometimes 4096 for modern drives): (1) the memory buffer address must be block-aligned — use memalign() or posix_memalign(); (2) the file/device offset at which the transfer starts must be block-aligned; (3) the length of the transfer must be a multiple of the block size. Violating any of these causes read()/write() to fail with EINVAL.
Q5. Why can’t you use malloc() to allocate a buffer for O_DIRECT?
malloc() returns a pointer aligned to sizeof(void*) — typically 8 bytes on 64-bit systems. O_DIRECT requires alignment to the disk block size — at least 512 bytes, often 4096 bytes. malloc()’s alignment guarantee is too weak. You must use memalign(block_size, buf_size) or posix_memalign(&ptr, block_size, buf_size) which guarantee the specific alignment required.
Q6. What happens if Process A opens a file with O_DIRECT and Process B opens the same file normally?
This is dangerous. Process B’s writes go through the page cache. Process A’s reads bypass the page cache and go to disk. A may read stale data that doesn’t reflect B’s recent writes (still in B’s page cache). Conversely, if A writes directly to disk, B’s cached view of the file becomes stale. There is no cache coherency. This scenario must be avoided in production.
Q7. What is the difference between POSIX_FADV_WILLNEED and POSIX_FADV_SEQUENTIAL?
POSIX_FADV_SEQUENTIAL tells the kernel to double the read-ahead window for automatic prefetching during sequential reads — it adjusts the ongoing read behavior. POSIX_FADV_WILLNEED is a one-time explicit request: “load this specific region into cache right now, before I ask for it.” WILLNEED is like a manual prefetch; SEQUENTIAL is like telling the kernel “my access pattern is sequential, adjust your heuristics.”

✅ Summary of Part 4

  • posix_fadvise(fd, offset, len, advice) hints the kernel about access patterns. Never affects correctness.
  • POSIX_FADV_SEQUENTIAL — doubles read-ahead. Use for large sequential scans.
  • POSIX_FADV_RANDOM — disables read-ahead. Use for random access workloads.
  • POSIX_FADV_WILLNEED — prefetch a region now. Async — returns immediately.
  • POSIX_FADV_DONTNEED — free cache pages. Use after single-pass reads to save RAM.
  • O_DIRECT — bypasses kernel buffer cache. Slower for most apps. Only for databases that manage their own cache.
  • O_DIRECT requires buffer, offset, and length all aligned to block size (512+ bytes). Use posix_memalign().

Leave a Reply

Your email address will not be published. Required fields are marked *