What is madvise()?
The Linux kernel’s page cache and read-ahead engine tries to predict your memory access patterns and pre-load pages before you need them. But the kernel is guessing — it does not know whether you will scan a large file sequentially, jump around randomly, or access a region just once and never again.
madvise() lets you tell the kernel exactly how you intend to access memory. With this information, the kernel can optimize read-ahead (loading more pages in advance for sequential access), minimize I/O (loading less for random access), or free pages earlier (for one-time access patterns). This is purely a performance hint — it never changes the correctness of your program, and the kernel is free to ignore it.
Available on Linux since kernel version 2.4.
Function Signature
#define _BSD_SOURCE
#include <sys/mman.h>
int madvise(void *addr, size_t length, int advice);
/* Returns: 0 on success, -1 on error */
| Parameter | Description |
|---|---|
addr |
Start of region — must be page-aligned |
length |
Length in bytes (rounded up to page boundary) |
advice |
One of the MADV_* constants describing your access pattern |
Advice Flags – What Each One Does
MADV_NORMAL
The default behavior. The kernel uses its normal read-ahead heuristic — pages are fetched in clusters (a few pages at a time), giving some read-behind and read-ahead. Use this to reset a region to default behavior after applying another hint.
MADV_RANDOM
You will access pages in random order. Read-ahead is useless because the next page you need is not adjacent to the current one. The kernel should fetch the minimum amount of data on each page fault (one page at a time). Use this for hash tables, B-tree indexes, or any data structure with pointer-chasing access.
MADV_SEQUENTIAL
You will access pages once, from start to end. The kernel can aggressively read ahead (pre-load many pages ahead of current position) and free pages quickly after they are accessed (since you won’t need them again). Use this for streaming file reads, log processing, or image loading.
MADV_WILLNEED
“I will need these pages soon — please pre-load them now.” The kernel initiates asynchronous read-ahead immediately without waiting for a page fault. Similar in effect to the Linux-specific readahead() system call and posix_fadvise(POSIX_FADV_WILLNEED). Use before a section of code that will heavily access a mapped region, to hide I/O latency.
MADV_DONTNEED ⚠ Destructive on Linux!
“I no longer need these pages in memory.” But the behavior differs importantly between mapping types and between Linux and other Unix systems:
- MAP_PRIVATE region: Pages are explicitly discarded. Any modifications to those pages are lost. The virtual address range remains accessible, but the next access causes a page fault that reinitializes the page (either from the underlying file, or zeros for anonymous mapping). This can be used to explicitly reinitialize a MAP_PRIVATE region back to its original state.
- MAP_SHARED region: On x86, the kernel does not discard modified shared pages. On other architectures, it might.
- Other Unix systems: On some systems,
MADV_DONTNEEDis merely a hint that pages can be swapped out if needed — it is NOT destructive. Do not rely on Linux’s destructive semantics for portable code.
Advice Flags Comparison
| Flag | Read-ahead | Effect | Best For |
|---|---|---|---|
MADV_NORMAL |
Moderate | Default cluster transfer | General purpose |
MADV_RANDOM |
None (min 1 page) | Minimal I/O per fault | Hash tables, B-trees, random DB lookups |
MADV_SEQUENTIAL |
Aggressive | Aggressively pre-loads; free after use | Log streaming, linear file scan |
MADV_WILLNEED |
Immediate async | Loads pages now, before access | Pre-loading data before RT section |
MADV_DONTNEED |
N/A | ⚠ Pages discarded on Linux (MAP_PRIVATE) | Reinitialize/discard private mapped region |
Example 1: Sequential File Processing
You are processing a large log file from start to end. Tell the kernel to read ahead aggressively and free pages as you go:
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <string.h>
int main(int argc, char *argv[])
{
if (argc != 2) {
fprintf(stderr, "Usage: %s <logfile>\n", argv[0]);
exit(1);
}
int fd = open(argv[1], O_RDONLY);
if (fd == -1) { perror("open"); exit(1); }
struct stat st;
fstat(fd, &st);
size_t file_size = st.st_size;
/* Map the file */
char *data = mmap(NULL, file_size, PROT_READ, MAP_PRIVATE, fd, 0);
if (data == MAP_FAILED) { perror("mmap"); exit(1); }
close(fd);
/*
* Tell the kernel: "I will read this sequentially, once."
* Kernel will: aggressively read ahead AND free pages after access.
* This is much better than the default for large sequential scans.
*/
if (madvise(data, file_size, MADV_SEQUENTIAL) == -1)
perror("madvise SEQUENTIAL (non-fatal, continuing)");
/* Process the file sequentially */
size_t newlines = 0;
for (size_t i = 0; i < file_size; i++)
if (data[i] == '\n') newlines++;
printf("File: %s\n", argv[1]);
printf("Size: %zu bytes\n", file_size);
printf("Lines: %zu\n", newlines);
munmap(data, file_size);
return 0;
}
Example 2: Random Access – Disable Read-ahead
You are implementing a hash table backed by a memory-mapped file. Accesses jump around randomly. Tell the kernel not to waste time reading ahead:
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <unistd.h>
#include <time.h>
int main(void)
{
long pagesize = sysconf(_SC_PAGESIZE);
int num_pages = 100;
size_t len = num_pages * pagesize;
char *buf = mmap(NULL, len, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (buf == MAP_FAILED) { perror("mmap"); exit(1); }
/* Initialize all pages */
for (int i = 0; i < num_pages; i++)
buf[i * pagesize] = i;
/*
* Tell the kernel: "Access pattern is random."
* The kernel will fetch only 1 page per fault, not a cluster.
* Good for hash tables where adjacent pages are unrelated.
*/
if (madvise(buf, len, MADV_RANDOM) == -1)
perror("madvise RANDOM");
/* Simulate random access */
srand(42);
long sum = 0;
for (int i = 0; i < 1000; i++) {
int page = rand() % num_pages;
sum += buf[page * pagesize]; /* Random page access */
}
printf("Random access sum (checksum): %ld\n", sum);
printf("Read-ahead was disabled for this region.\n");
munmap(buf, len);
return 0;
}
Example 3: MADV_WILLNEED – Pre-loading Pages Before Use
You have a two-phase program: in phase 1 you do some setup work, and in phase 2 you need fast access to a large data region. Use MADV_WILLNEED during phase 1 to pre-load the pages asynchronously while phase 1 runs, so they are in RAM when phase 2 starts:
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <unistd.h>
#include <string.h>
#include <time.h>
/* Simulate some setup work */
static void do_setup_work(void)
{
struct timespec ts = { .tv_sec = 0, .tv_nsec = 50000000 }; /* 50ms */
nanosleep(&ts, NULL);
printf("Setup work complete\n");
}
int main(void)
{
long pagesize = sysconf(_SC_PAGESIZE);
size_t data_size = 16 * 1024 * 1024; /* 16 MB data region */
char *data = mmap(NULL, data_size, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (data == MAP_FAILED) { perror("mmap"); exit(1); }
/* Initialize data (simulating reading a DB index into memory) */
memset(data, 0xAB, data_size);
/*
* Phase 1: Do setup work.
* Meanwhile, tell the kernel to start loading our data region
* asynchronously — by the time phase 1 finishes, data should be in RAM.
*/
printf("Phase 1: Giving kernel WILLNEED hint for %zu MB...\n",
data_size / (1024 * 1024));
if (madvise(data, data_size, MADV_WILLNEED) == -1)
perror("madvise WILLNEED");
do_setup_work(); /* This runs while kernel pre-loads pages */
/*
* Phase 2: Access the data — should be mostly in RAM already.
* Minimal page faults expected.
*/
printf("Phase 2: Accessing pre-loaded data...\n");
long checksum = 0;
for (size_t i = 0; i < data_size; i += pagesize)
checksum += data[i];
printf("Checksum: %ld\n", checksum);
printf("Pre-loading reduced page fault latency during Phase 2.\n");
munmap(data, data_size);
return 0;
}
Example 4: MADV_DONTNEED – Reinitializing a MAP_PRIVATE Region
This is an unusual but legitimate use of MADV_DONTNEED on Linux. For MAP_PRIVATE anonymous mappings, it discards the current page content and the next access returns a freshly zeroed page. This is faster than calling memset() for large regions because no physical memory is written:
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <unistd.h>
#include <string.h>
int main(void)
{
long pagesize = sysconf(_SC_PAGESIZE);
size_t len = 4 * pagesize;
/*
* MAP_PRIVATE anonymous mapping — pages are initially zeroed on access.
*/
char *buf = mmap(NULL, len, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (buf == MAP_FAILED) { perror("mmap"); exit(1); }
/* Write data into the pages */
memset(buf, 0xFF, len);
printf("After write: buf[0] = 0x%X\n", (unsigned char)buf[0]);
/*
* MADV_DONTNEED on Linux (MAP_PRIVATE):
* Discards the pages. The virtual address range remains valid,
* but the next access will page-fault and get ZEROED pages.
* Modifications are LOST.
*
* WARNING: This behavior is Linux-specific.
* On other Unix systems, MADV_DONTNEED is just a hint.
*/
if (madvise(buf, len, MADV_DONTNEED) == -1) {
perror("madvise DONTNEED");
exit(1);
}
/*
* Access the region again — pages are freshly zeroed.
* This is because the kernel discarded the 0xFF pages.
*/
printf("After MADV_DONTNEED: buf[0] = 0x%X (expect 0x0)\n",
(unsigned char)buf[0]);
/*
* Use case: Efficiently "reset" a scratch buffer to zero
* without actually writing zeros to every byte.
* The kernel marks the pages as not-present; next access
* gets zero-filled pages from the zero page.
*/
munmap(buf, len);
return 0;
}
After write: buf[0] = 0xFF
After MADV_DONTNEED: buf[0] = 0x0
posix_madvise() – The Portable POSIX Version
SUSv3 (Single Unix Specification v3) standardizes memory advice under the name posix_madvise() with POSIX-prefixed constants. This is the portable version for applications that must run on multiple Unix systems:
#include <sys/mman.h>
int posix_madvise(void *addr, size_t len, int advice);
/* advice values:
* POSIX_MADV_NORMAL -- same as MADV_NORMAL
* POSIX_MADV_RANDOM -- same as MADV_RANDOM
* POSIX_MADV_SEQUENTIAL -- same as MADV_SEQUENTIAL
* POSIX_MADV_WILLNEED -- same as MADV_WILLNEED
* POSIX_MADV_DONTNEED -- DIFFERENT behavior! (see note below)
*/
SUSv3 says posix_madvise() must NOT affect program semantics — it is purely a performance hint. But in glibc versions before 2.7, POSIX_MADV_DONTNEED was implemented using MADV_DONTNEED, which does affect semantics (discards pages on MAP_PRIVATE).
Since glibc 2.7, POSIX_MADV_DONTNEED does nothing (to comply with the POSIX requirement). So if you need the destructive discard behavior, use madvise(MADV_DONTNEED) directly.
Example 5: posix_madvise() for Portable Code
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
/*
* Portable memory advice using POSIX interface.
* Works on Linux (glibc), macOS, FreeBSD.
*/
void advise_file_access(void *map, size_t size, int is_sequential)
{
int advice = is_sequential ? POSIX_MADV_SEQUENTIAL : POSIX_MADV_RANDOM;
const char *advice_str = is_sequential ? "SEQUENTIAL" : "RANDOM";
if (posix_madvise(map, size, advice) != 0) {
perror("posix_madvise"); /* Non-fatal: program still works */
fprintf(stderr, "Note: kernel may not honor %s hint\n", advice_str);
} else {
printf("Advised kernel of %s access pattern for %zu bytes\n",
advice_str, size);
}
}
int main(int argc, char *argv[])
{
if (argc != 3) {
fprintf(stderr, "Usage: %s <file> <seq|rand>\n", argv[0]);
exit(1);
}
int is_seq = (argv[2][0] == 's');
int fd = open(argv[1], O_RDONLY);
if (fd == -1) { perror("open"); exit(1); }
struct stat st;
fstat(fd, &st);
size_t sz = st.st_size;
void *map = mmap(NULL, sz, PROT_READ, MAP_PRIVATE, fd, 0);
if (map == MAP_FAILED) { perror("mmap"); exit(1); }
close(fd);
/* Give the kernel our access pattern hint */
advise_file_access(map, sz, is_seq);
/* Process the file (same code regardless of hint) */
long sum = 0;
unsigned char *data = map;
for (size_t i = 0; i < sz; i++)
sum += data[i];
printf("Checksum: %ld\n", sum);
munmap(map, sz);
return 0;
}
/*
* Compile: gcc -o advise advise.c
* Sequential: ./advise /var/log/syslog seq
* Random: ./advise /var/log/syslog rand
*/
Additional Linux-Specific Advice Flags
Linux added more nonstandard MADV_* flags in later kernel versions for special use cases:
| Flag | Kernel | Purpose |
|---|---|---|
MADV_DONTFORK |
2.6.16 | Don’t include region in child after fork() |
MADV_DOFORK |
2.6.16 | Undo MADV_DONTFORK |
MADV_REMOVE |
2.6.16 | Free storage for a given range (tmpfs/shmem) |
MADV_MERGEABLE |
2.6.32 | Enable KSM (Kernel Same-page Merging) for this region |
MADV_UNMERGEABLE |
2.6.32 | Undo MADV_MERGEABLE |
MADV_HWPOISON |
2.6.32 | Simulate hardware memory corruption (testing only) |
Example 6: Real-World Combined Approach – Database Buffer Pool
This shows a realistic usage combining multiple advice flags for a database-style workload:
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <unistd.h>
#include <string.h>
#define DB_FILE_SIZE (64 * 1024 * 1024) /* 64 MB database file */
#define INDEX_SIZE ( 4 * 1024 * 1024) /* 4 MB index region */
#define LOG_SIZE ( 8 * 1024 * 1024) /* 8 MB WAL log region */
int main(void)
{
char *db = mmap(NULL, DB_FILE_SIZE,
PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (db == MAP_FAILED) { perror("mmap"); exit(1); }
char *index = db; /* First 4 MB = index */
char *log = db + INDEX_SIZE; /* Next 8 MB = write-ahead log */
char *data = db + INDEX_SIZE + LOG_SIZE; /* Rest = data pages */
size_t data_size = DB_FILE_SIZE - INDEX_SIZE - LOG_SIZE;
/* Initialize */
memset(db, 0, DB_FILE_SIZE);
/*
* Index region: Random access (B-tree lookups)
* → Disable read-ahead, fetch one page per fault
*/
madvise(index, INDEX_SIZE, MADV_RANDOM);
printf("Index (%d MB): MADV_RANDOM (no read-ahead)\n",
INDEX_SIZE / (1024 * 1024));
/*
* WAL log region: Sequential write, will be replayed linearly
* → Aggressive read-ahead when replaying
*/
madvise(log, LOG_SIZE, MADV_SEQUENTIAL);
printf("WAL log (%d MB): MADV_SEQUENTIAL (aggressive read-ahead)\n",
LOG_SIZE / (1024 * 1024));
/*
* Data region we're about to query: Pre-load now
* → Read pages asynchronously in background
*/
madvise(data, data_size, MADV_WILLNEED);
printf("Data (%zu MB): MADV_WILLNEED (async pre-load started)\n",
data_size / (1024 * 1024));
printf("\nAll memory advice applied. Access patterns optimized.\n");
munmap(db, DB_FILE_SIZE);
return 0;
}
Interview Questions & Answers
madvise() is a system call that lets a process give the kernel hints about how it plans to access a region of virtual memory. The kernel can use these hints to optimize read-ahead, page-out behavior, and I/O patterns. Main use cases: speeding up sequential file processing with MADV_SEQUENTIAL; improving random-access performance with MADV_RANDOM by disabling wasteful read-ahead; pre-loading data into memory asynchronously with MADV_WILLNEED; and discarding unneeded pages with MADV_DONTNEED.
On Linux, MADV_DONTNEED on a MAP_PRIVATE region immediately discards the pages — any modifications are lost. The virtual address range remains accessible, but the next access triggers a page fault that reinitializes the page from the original file or with zeros (for anonymous mappings). This is dangerous because it is destructive — unlike on many other Unix systems where it is merely a swap-out hint. Portable applications should not rely on this behavior. Use it intentionally to “reset” a private anonymous mapping without writing zeros.
MADV_WILLNEED tells the kernel to start loading pages asynchronously, but the pages can still be swapped out later under memory pressure. There is no guarantee the pages will be in RAM when you actually access them — it is just a hint. mlock() provides a hard guarantee: once locked, pages will remain in RAM until explicitly unlocked, regardless of memory pressure. Use MADV_WILLNEED to hide latency for non-critical pre-loading; use mlock() when you need a deterministic guarantee that no page fault will occur.
Use MADV_RANDOM when pages will be accessed in arbitrary order with no locality — for example, hash table lookups, pointer-chasing through a tree, or random queries into a large dataset. Read-ahead would waste I/O bandwidth on pages you’ll never need. Use MADV_SEQUENTIAL when you will scan a region linearly from start to end exactly once — for example, processing a log file, a video stream, or a backup. The kernel will aggressively pre-load pages ahead of your position and free already-read pages quickly to save RAM.
Both functions give the kernel access pattern hints. madvise() is Linux/BSD-specific and uses MADV_* constants. posix_madvise() is the POSIX-standardized version with POSIX_MADV_* constants — it is implemented in glibc by calling madvise(). The key difference is POSIX_MADV_DONTNEED: POSIX guarantees it must NOT affect program semantics, so since glibc 2.7, it does nothing. If you need the destructive discard behavior, you must use madvise(MADV_DONTNEED) directly.
No. madvise() advice flags are hints only — the kernel is free to ignore them. The function returning 0 (success) means only that the system call was received without error; it does not mean the kernel acted on the hint or that pages are actually in memory. The only way to guarantee pages are in RAM is mlock(). That said, the Linux kernel generally does honor madvise() hints on real workloads.
For a large MAP_PRIVATE | MAP_ANONYMOUS mapping, calling madvise(addr, len, MADV_DONTNEED) discards the pages immediately. The next access to any page in the range triggers a page fault that maps a fresh zero-filled page. This achieves the effect of zeroing the region without actually writing zeros to physical memory — the kernel simply marks the page table entries as not-present and returns the virtual addresses to the zero-page pool. This is significantly faster than memset(buf, 0, large_size) because no physical memory writes occur. This is used in some allocator implementations to efficiently “free” large memory pools.
Topic Summary
madvise(addr, len, advice)gives the kernel access-pattern hints.MADV_NORMAL: default;MADV_RANDOM: disable read-ahead;MADV_SEQUENTIAL: aggressive read-ahead.MADV_WILLNEED: pre-load now;MADV_DONTNEED: discard pages (Linux MAP_PRIVATE is destructive).- Hints only — kernel can ignore them; does not guarantee residency (use
mlock()for that). posix_madvise()is the portable POSIX version;POSIX_MADV_DONTNEEDis safe (does nothing since glibc 2.7).- Available since Linux kernel 2.4; requires
_BSD_SOURCEor_GNU_SOURCE.
