posix_fadvise() and Direct I/O in Linux Explained
🚀
posix_fadvise() and Direct I/O in Linux Explained
Hint to the kernel how you’ll read a file — or bypass its cache entirely.
The kernel makes assumptions about how you’ll use a file and tunes the buffer cache accordingly. But sometimes you know better — for example, you’re reading a file once and will never need it again. posix_fadvise() lets you tell the kernel your access pattern so it can optimize accordingly.
And for very specialized applications (like database engines), there’s an even more radical option: bypass the kernel buffer cache entirely using O_DIRECT. This is called Direct I/O.
1. posix_fadvise() — Advise the Kernel
#include <fcntl.h>int posix_fadvise(int fd, off_t offset, off_t len, int advice);
/* Returns: 0 on success, or a positive error number */ /* offset + len define the file region. len=0 means “to end of file” */
posix_fadvise() is a hint to the kernel — not a command. The kernel can ignore it. But when it listens, it can dramatically improve performance by:
- Pre-loading data into the buffer cache before you ask for it
- Adjusting the read-ahead window size
- Freeing cache pages for data you’ve already used
Calling it has no effect on the semantics of your program — correctness is never affected.
The Six Advice Values
Default. No special access pattern. Kernel uses the standard 128 KB read-ahead window.
You’ll read from low to high offset. Kernel doubles the read-ahead window to 256 KB. Great for streaming large files.
You’ll jump around the file randomly. Kernel disables read-ahead entirely — no point predicting the next block.
You’ll access this region soon. Kernel pre-loads it into cache NOW. Future read() calls return from cache with no disk wait.
You won’t need this region again. Kernel flushes dirty pages and frees the cache entries. Reduces cache pollution.
You’ll use this data once, never again. Currently has no effect in Linux (hint noted but not acted on).
Code Example: posix_fadvise() in Practice
#define _XOPEN_SOURCE 600
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
/* Example 1: Sequential large file processing (e.g., log parser) */
int process_log_file(const char *path) {
int fd = open(path, O_RDONLY);
if (fd == -1) { perror("open"); return -1; }
/* Tell kernel: I'll read this file from start to end, sequentially */
posix_fadvise(fd, 0, 0, POSIX_FADV_SEQUENTIAL);
/* Kernel doubles read-ahead window → more data pre-loaded → fewer wait times */
char buf[65536];
ssize_t n;
long total = 0;
while ((n = read(fd, buf, sizeof(buf))) > 0) {
/* Process buf here (e.g., count log lines, parse records) */
total += n;
}
printf("Processed %ld bytes sequentially\n", total);
close(fd);
return 0;
}
/* Example 2: Random access — disable read-ahead */
int random_access_file(const char *path, off_t *offsets, int count) {
int fd = open(path, O_RDONLY);
if (fd == -1) { perror("open"); return -1; }
/* Tell kernel: I'll access random offsets — don't bother pre-loading */
posix_fadvise(fd, 0, 0, POSIX_FADV_RANDOM);
char buf[512];
for (int i = 0; i < count; i++) {
lseek(fd, offsets[i], SEEK_SET);
read(fd, buf, sizeof(buf));
/* Process buf */
}
close(fd);
return 0;
}
/* Example 3: Prefetch data for later use */
int prefetch_then_read(const char *path) {
int fd = open(path, O_RDONLY);
if (fd == -1) { perror("open"); return -1; }
/* Tell kernel: load the first 1 MB into cache RIGHT NOW.
This is asynchronous — posix_fadvise() returns immediately.
The kernel starts the disk reads in the background. */
posix_fadvise(fd, 0, 1024*1024, POSIX_FADV_WILLNEED);
/* Do other work while the kernel prefetches... */
printf("Doing other work while kernel prefetches...\n");
sleep(1); /* Simulate other work */
/* Now read — data should be in cache, no disk wait */
char buf[65536];
ssize_t n = read(fd, buf, sizeof(buf));
printf("Read %zd bytes (probably from cache)\n", n);
close(fd);
return 0;
}
/* Example 4: Single-pass media streaming (free cache after reading) */
int stream_media(const char *path) {
int fd = open(path, O_RDONLY);
if (fd == -1) { perror("open"); return -1; }
posix_fadvise(fd, 0, 0, POSIX_FADV_SEQUENTIAL);
char buf[131072]; /* 128 KB chunks */
off_t offset = 0;
ssize_t n;
while ((n = read(fd, buf, sizeof(buf))) > 0) {
/* Stream buf to network/audio/video output */
/* ...send(socket, buf, n, 0) etc... */
/* After sending, we'll never need this data again.
Tell kernel to free the cache pages to save RAM.
Very useful for media servers — prevents single stream
from evicting other important cached data. */
posix_fadvise(fd, offset, n, POSIX_FADV_DONTNEED);
offset += n;
}
close(fd);
return 0;
}
int main(void) {
process_log_file("/var/log/syslog");
prefetch_then_read("/etc/passwd");
return 0;
}
/* Compile: gcc -D_XOPEN_SOURCE=600 -o fadvise_demo fadvise_demo.c */
2. Direct I/O — Bypassing the Buffer Cache
Normally, all file I/O goes through the kernel buffer cache. But using O_DIRECT when opening a file, your reads and writes transfer data directly between your user-space buffer and the disk, bypassing the kernel cache entirely.
This is called Direct I/O or Raw I/O.
Normal I/O vs Direct I/O
✅ Normal I/O (Default)
Caching, read-ahead, buffer sharing. Best for most apps.
⚡ Direct I/O (O_DIRECT)
No caching. Must manage your own cache. For databases only.
⚠️ Warning: Direct I/O is Slower for Most Applications
Direct I/O loses all kernel optimizations: no read-ahead, no buffer sharing between processes, no write coalescing. For most applications, using O_DIRECT makes things significantly slower, not faster.
Direct I/O makes sense only when your application implements its own caching (like a database engine), and you want to:
- Avoid double-buffering (your own cache + kernel cache = wasted RAM)
- Get predictable I/O latency (no kernel cache eviction surprises)
- Reduce kernel CPU overhead for your specialized I/O patterns
3. Alignment Rules for Direct I/O
All Three Must Be Multiples of Block Size (usually 512 bytes):
Code Example: O_DIRECT with Proper Alignment
#define _GNU_SOURCE /* Required to get O_DIRECT from fcntl.h */
#include <fcntl.h>
#include <unistd.h>
#include <malloc.h> /* for memalign() */
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#define BLOCK_SIZE 512 /* Physical disk block size */
#define BUF_SIZE 4096 /* Must be a multiple of BLOCK_SIZE */
#define FILE_OFFSET 0 /* Must be a multiple of BLOCK_SIZE */
int main(void) {
/* Open with O_DIRECT — bypasses kernel buffer cache */
int fd = open("/tmp/test_direct.bin",
O_RDWR | O_CREAT | O_TRUNC | O_DIRECT, 0644);
if (fd == -1) {
if (errno == EINVAL) {
fprintf(stderr, "O_DIRECT not supported on this filesystem\n");
} else {
perror("open");
}
return 1;
}
/* RULE 1: Buffer must be aligned to BLOCK_SIZE.
malloc() does NOT guarantee this alignment.
Use memalign() or posix_memalign() instead. */
void *buf = memalign(BLOCK_SIZE, BUF_SIZE);
if (!buf) { perror("memalign"); close(fd); return 1; }
/* Fill buffer with test data */
memset(buf, 'X', BUF_SIZE);
/* RULE 2: File offset must be a multiple of BLOCK_SIZE.
We're writing at offset 0, which is fine. */
/* RULE 3: Transfer length must be a multiple of BLOCK_SIZE.
BUF_SIZE = 4096 = 8 × 512, so this is fine. */
/* Write: goes directly to disk, bypassing kernel cache */
ssize_t n = write(fd, buf, BUF_SIZE);
if (n == -1) {
perror("write O_DIRECT"); /* EINVAL if alignment is wrong */
free(buf); close(fd); return 1;
}
printf("Direct write: %zd bytes written to disk\n", n);
/* Seek back to start for reading */
if (lseek(fd, FILE_OFFSET, SEEK_SET) == -1) {
perror("lseek"); free(buf); close(fd); return 1;
}
/* Clear buffer and read back */
memset(buf, 0, BUF_SIZE);
n = read(fd, buf, BUF_SIZE);
if (n == -1) {
perror("read O_DIRECT");
free(buf); close(fd); return 1;
}
printf("Direct read: %zd bytes read from disk\n", n);
printf("First char: '%c' (expected 'X')\n", ((char*)buf)[0]);
free(buf);
close(fd);
unlink("/tmp/test_direct.bin");
return 0;
}
/* Compile: gcc -D_GNU_SOURCE -o direct_io direct_io.c
Note: O_DIRECT may not work on tmpfs (RAM-based filesystems) */
Code Example: posix_memalign() (POSIX-compliant alignment)
#include <stdlib.h>
#include <stdio.h>
/* posix_memalign() is the POSIX-standard way to get aligned memory.
memalign() is older and not in POSIX but widely available. */
int main(void) {
void *buf = NULL;
size_t alignment = 4096; /* Page-aligned is always safe for O_DIRECT */
size_t size = 4096;
/* posix_memalign: alignment must be a power of 2 and >= sizeof(void*) */
int ret = posix_memalign(&buf, alignment, size);
if (ret != 0) {
fprintf(stderr, "posix_memalign failed: %d\n", ret);
return 1;
}
printf("Buffer address: %p (aligned to %zu bytes)\n", buf, alignment);
printf("Is aligned: %s\n",
((unsigned long)buf % alignment == 0) ? "YES ✅" : "NO ❌");
free(buf); /* posix_memalign memory is freed with regular free() */
return 0;
}
/* Quick check: verify your buffer is aligned */
static inline int is_aligned(const void *ptr, size_t alignment) {
return ((unsigned long)ptr % alignment) == 0;
}
/* Use: if (!is_aligned(buf, 512)) { handle_error(); } */
🎯 Interview Questions – posix_fadvise & Direct I/O
✅ Summary of Part 4
posix_fadvise(fd, offset, len, advice)hints the kernel about access patterns. Never affects correctness.POSIX_FADV_SEQUENTIAL— doubles read-ahead. Use for large sequential scans.POSIX_FADV_RANDOM— disables read-ahead. Use for random access workloads.POSIX_FADV_WILLNEED— prefetch a region now. Async — returns immediately.POSIX_FADV_DONTNEED— free cache pages. Use after single-pass reads to save RAM.O_DIRECT— bypasses kernel buffer cache. Slower for most apps. Only for databases that manage their own cache.- O_DIRECT requires buffer, offset, and length all aligned to block size (512+ bytes). Use
posix_memalign().
