msync() – Synchronizing Mapped Regions

 

msync() – Synchronizing Mapped Regions
Chapter 49 – Topic 3 | MS_SYNC, MS_ASYNC, MS_INVALIDATE, Unified VM

← Index | ← Topic 2 | Topic 3 of 3

The Problem msync() Solves

When you write to a MAP_SHARED mapping, the kernel eventually writes those changes to the underlying file on disk. But “eventually” can mean seconds or even minutes later — the kernel batches writes for efficiency. If the system crashes before flushing, your data is lost.

msync() gives you explicit control over when the flush happens. Databases, journaling systems, and any application requiring crash safety use it to ensure data is durable on disk before proceeding.

The msync() API
#include <sys/mman.h>

int msync(void   *addr,    /* Start of the mapped region to sync          */
          size_t  length,  /* Number of bytes to sync                     */
          int     flags);  /* MS_SYNC | MS_ASYNC | MS_INVALIDATE          */

/* Returns: 0 on success, -1 on error (errno set) */
Important: The addr argument must be page-aligned (SUSv3). The length is internally rounded up to the next page boundary. Always pass the original mmap() return address as addr.

msync() Flags Explained

Flag Behaviour When to Use Blocks?
MS_SYNC Waits until all modified pages have been physically written to the disk storage device. Database commits, safety-critical writes, before shutdown. Yes – blocks until done
MS_ASYNC Schedules a write of modified pages to disk. Returns immediately. Pages become visible to other processes reading the file via read() right away. Best-effort flushing without stalling the process. No – returns immediately
MS_INVALIDATE After flushing, marks cached pages as stale. On next access, pages are re-read from the file. Makes changes by other processes (that wrote via write()) visible in the mapping. When another process has updated the file via write() and you need to see the new data. No additional blocking

MS_SYNC vs MS_ASYNC – Timeline Comparison
MS_SYNC your code calls msync() ← kernel writes dirty pages to disk → msync() returns
(guaranteed on disk)
your code continues
MS_ASYNC your code calls msync() msync() returns immediately your code continues kernel writes later
(pdflush/writeback)

Code Example 1 – msync() with MS_SYNC (blocking, guaranteed flush)
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <string.h>
#include <unistd.h>

#define MAP_SIZE 4096

int main(void)
{
    int fd;
    char *addr;

    fd = open("journal.dat", O_RDWR | O_CREAT, 0600);
    if (fd == -1) { perror("open"); exit(EXIT_FAILURE); }

    /* Ensure file is at least MAP_SIZE bytes */
    if (ftruncate(fd, MAP_SIZE) == -1) {
        perror("ftruncate"); exit(EXIT_FAILURE);
    }

    addr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
    if (addr == MAP_FAILED) { perror("mmap"); exit(EXIT_FAILURE); }
    close(fd);

    /* Write data to the mapping – this modifies kernel page cache */
    strcpy(addr, "TRANSACTION: debit account=1234 amount=500");
    printf("Written to mapping: %s\n", addr);

    /* MS_SYNC: blocks until data is physically on disk.
     * Safe to power-cycle the machine after this returns.
     * Use this for database commits, journaling, critical checkpoints.
     */
    if (msync(addr, MAP_SIZE, MS_SYNC) == -1) {
        perror("msync MS_SYNC");
        exit(EXIT_FAILURE);
    }
    printf("MS_SYNC done: data is guaranteed on disk.\n");

    munmap(addr, MAP_SIZE);
    return 0;
}

Code Example 2 – msync() with MS_ASYNC (non-blocking, schedule a flush)
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <string.h>
#include <unistd.h>

#define MAP_SIZE 4096

int main(void)
{
    int fd;
    char *addr;

    fd = open("log.dat", O_RDWR | O_CREAT, 0600);
    ftruncate(fd, MAP_SIZE);

    addr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
    if (addr == MAP_FAILED) { perror("mmap"); exit(EXIT_FAILURE); }
    close(fd);

    /* Write a log entry */
    snprintf(addr, MAP_SIZE, "LOG: process started pid=%d", (int)getpid());
    printf("Written: %s\n", addr);

    /* MS_ASYNC: returns immediately.
     * Kernel will schedule the write via pdflush/writeback kernel threads.
     * The changes are visible to other processes doing read() on the file
     * right away, but may not yet be on physical disk.
     */
    if (msync(addr, MAP_SIZE, MS_ASYNC) == -1) {
        perror("msync MS_ASYNC");
        exit(EXIT_FAILURE);
    }
    printf("MS_ASYNC: write scheduled, we did not block.\n");

    /* If you also want hard disk flush later, follow up with fsync()
     * on a file descriptor pointing to the same file.
     */
    int fd2 = open("log.dat", O_RDONLY);
    if (fsync(fd2) == 0)
        printf("fsync() after MS_ASYNC: now on disk.\n");
    close(fd2);

    munmap(addr, MAP_SIZE);
    return 0;
}

Code Example 3 – MS_INVALIDATE: refresh mapping after another process wrote to the file
/*
 * Scenario: Process A has a MAP_SHARED mapping of shared.dat.
 * Process B uses write() to update the same file.
 * Without MS_INVALIDATE, Process A might still see the old cached data.
 * MS_INVALIDATE tells the kernel: "discard my cached pages, re-read from file".
 *
 * On Linux with a unified VM system this is usually not needed because
 * mmap() and read()/write() share the same page cache. But for
 * portability to non-unified systems (older UNIX), you should use it.
 */

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <string.h>
#include <unistd.h>

#define MAP_SIZE 256

/* Simulates Process A (reader via mmap) */
int main(void)
{
    int fd;
    char *addr;

    fd = open("shared.dat", O_RDWR | O_CREAT, 0600);
    ftruncate(fd, MAP_SIZE);

    addr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
    if (addr == MAP_FAILED) { perror("mmap"); exit(EXIT_FAILURE); }

    /* Initial content */
    memcpy(addr, "INITIAL DATA", 13);
    printf("Before external write: '%s'\n", addr);

    /* Simulate another process updating the file using write():
     * In real code, this would be a separate process.
     */
    lseek(fd, 0, SEEK_SET);
    write(fd, "UPDATED BY WRITER", 18);

    /* Without MS_INVALIDATE, we might still see "INITIAL DATA"
     * in the mapping on a non-unified VM system.
     * MS_INVALIDATE discards stale cached pages and re-reads from file.
     */
    if (msync(addr, MAP_SIZE, MS_INVALIDATE) == -1) {
        perror("msync MS_INVALIDATE");
        exit(EXIT_FAILURE);
    }

    printf("After MS_INVALIDATE: '%s'\n", addr);
    /* Should now show "UPDATED BY WRITER" */

    close(fd);
    munmap(addr, MAP_SIZE);
    return 0;
}

Code Example 4 – Combining MS_SYNC and MS_INVALIDATE for two-way sync
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <string.h>
#include <unistd.h>

#define MAP_SIZE 4096

/*
 * Full bidirectional sync pattern:
 * 1. Flush our dirty pages to disk (MS_SYNC).
 * 2. Invalidate our cached pages so we pick up changes made
 *    by other writers (MS_INVALIDATE).
 * Both flags can be OR'd together in one call.
 */
int main(void)
{
    int fd;
    char *addr;

    fd = open("db_page.dat", O_RDWR | O_CREAT, 0600);
    ftruncate(fd, MAP_SIZE);

    addr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
    if (addr == MAP_FAILED) { perror("mmap"); exit(EXIT_FAILURE); }
    close(fd);

    /* Write a database record into the page */
    snprintf(addr, MAP_SIZE,
             "DB_RECORD: id=42 value=99 checksum=0xDEAD");

    /* Sync: flush our writes AND invalidate stale cache in one call.
     * After this:
     *   - Our data is on disk (MS_SYNC).
     *   - Any external updates to the file are visible to us (MS_INVALIDATE).
     */
    if (msync(addr, MAP_SIZE, MS_SYNC | MS_INVALIDATE) == -1) {
        perror("msync");
        exit(EXIT_FAILURE);
    }

    printf("Record synced and cache refreshed: %s\n", addr);

    munmap(addr, MAP_SIZE);
    return 0;
}

Linux Unified Virtual Memory System

On Linux, mmap() and read()/write() share the same page cache. There is only one copy of a file’s data in memory, whether you accessed it via a mapping or via system calls. This is called a unified virtual memory (VM) system.

Linux Unified VM – One page cache, two access paths
Process A
Uses mmap()
(direct page access)
Kernel Page Cache
(single copy of file data)
mmap and read/write see the same pages
Process B
Uses read()/write()
(via system call)
On Linux: consistent views guaranteed. msync() only needed to push data from kernel cache → disk.

The practical consequence for Linux programmers:

  • You do not need MS_INVALIDATE to see writes made by other processes via write() — Linux already keeps the mapping and page cache in sync.
  • You do not need msync() at all for visibility between processes — only for disk durability.
  • However, for portability to non-Linux UNIX systems (which may not have a unified VM), you should still call msync() and use MS_INVALIDATE appropriately.

Code Example 5 – Three ways to ensure mapped data is on disk (comparison)
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <string.h>
#include <unistd.h>

#define MAP_SIZE 4096

int main(void)
{
    int fd;
    char *addr;

    fd = open("sync_test.dat", O_RDWR | O_CREAT, 0600);
    ftruncate(fd, MAP_SIZE);

    addr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
    if (addr == MAP_FAILED) { perror("mmap"); exit(EXIT_FAILURE); }

    strcpy(addr, "Test data for sync comparison");

    /* Method 1: msync(MS_SYNC)
     * Synchronous flush. Blocks until disk write is complete.
     * Works directly on the memory address.
     */
    if (msync(addr, MAP_SIZE, MS_SYNC) == 0)
        printf("Method 1: msync(MS_SYNC) – blocking flush – done.\n");

    /* Method 2: msync(MS_ASYNC) followed by fsync()
     * First schedules the write (non-blocking),
     * then fsync() blocks until the file descriptor's data hits disk.
     * (Linux-specific extension to the standard)
     */
    msync(addr, MAP_SIZE, MS_ASYNC);
    if (fsync(fd) == 0)
        printf("Method 2: msync(MS_ASYNC) + fsync() – done.\n");

    /* Method 3: msync(MS_ASYNC) followed by fdatasync()
     * fdatasync() is like fsync() but skips metadata (timestamps, etc.)
     * Faster for pure data durability.
     */
    msync(addr, MAP_SIZE, MS_ASYNC);
    if (fdatasync(fd) == 0)
        printf("Method 3: msync(MS_ASYNC) + fdatasync() – done (faster).\n");

    close(fd);
    munmap(addr, MAP_SIZE);
    return 0;
}

Code Example 6 – File-backed shared memory IPC with msync() (producer side)
/*
 * Producer process: writes records to a shared file via mmap().
 * Calls msync(MS_SYNC) after each record to ensure durability.
 * Consumer process (not shown) maps the same file and reads records.
 *
 * This pattern is used in file-backed IPC, embedded databases,
 * and any shared-memory-like design that survives process restart.
 */

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <string.h>
#include <unistd.h>
#include <time.h>

#define MAX_RECORDS  10
#define RECORD_SIZE  64
#define MAP_SIZE     (MAX_RECORDS * RECORD_SIZE)

/* Simple record structure */
typedef struct {
    int  id;
    char message[56];
} Record;

int main(void)
{
    int fd;
    Record *records;
    int i;

    fd = open("ipc_file.dat", O_RDWR | O_CREAT, 0600);
    if (fd == -1) { perror("open"); exit(EXIT_FAILURE); }
    ftruncate(fd, MAP_SIZE);

    records = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
    if (records == MAP_FAILED) { perror("mmap"); exit(EXIT_FAILURE); }
    close(fd);

    /* Write records one by one, syncing each to disk */
    for (i = 0; i < MAX_RECORDS; i++) {
        records[i].id = i + 1;
        snprintf(records[i].message, sizeof(records[i].message),
                 "Record %d from pid %d", i + 1, (int)getpid());

        /* Sync this specific record's page to disk before writing the next */
        if (msync(&records[i], sizeof(Record), MS_SYNC) == -1) {
            perror("msync");
            break;
        }
        printf("Synced record %d: '%s'\n", records[i].id, records[i].message);
        usleep(100000);  /* 100ms delay to simulate real work */
    }

    printf("All records written and synced.\n");
    printf("Consumer can now read ipc_file.dat via mmap().\n");

    munmap(records, MAP_SIZE);
    return 0;
}

Code Example 7 – Syncing only a specific sub-region (partial msync)
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <string.h>
#include <unistd.h>

/*
 * msync() does not have to cover the entire mapping.
 * You can sync just the pages that were modified.
 * This is important for performance on large mappings (e.g., 1GB database file).
 */

#define TOTAL_SIZE  (1024 * 1024)   /* 1 MB mapping */
#define PAGE_SIZE   4096

int main(void)
{
    int fd;
    char *base;
    long page_size;

    page_size = sysconf(_SC_PAGESIZE);

    fd = open("large_file.dat", O_RDWR | O_CREAT, 0600);
    ftruncate(fd, TOTAL_SIZE);

    base = mmap(NULL, TOTAL_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
    if (base == MAP_FAILED) { perror("mmap"); exit(EXIT_FAILURE); }
    close(fd);

    /* Modify only one page at offset 512KB */
    char *dirty_page = base + 512 * 1024;
    memcpy(dirty_page, "Modified page at 512KB", 23);

    /* Sync ONLY the dirty page, not the entire 1MB.
     * addr must be page-aligned, so round down to page boundary.
     */
    void *aligned_addr = (void *)((uintptr_t)dirty_page & ~(page_size - 1));

    if (msync(aligned_addr, page_size, MS_SYNC) == -1) {
        perror("partial msync");
        exit(EXIT_FAILURE);
    }

    printf("Synced only 1 page (4KB) instead of the full 1MB mapping.\n");
    printf("This is much faster for large files with sparse writes.\n");

    munmap(base, TOTAL_SIZE);
    return 0;
}

msync() Best Practices
  • Use MS_SYNC when you need a guarantee that data survived to disk (databases, journals, checkpoints).
  • Use MS_ASYNC when you want to hint the kernel to flush without stalling your thread.
  • Use MS_INVALIDATE when you want to pick up writes made to the file by other processes via write() — especially important for portability to non-Linux systems.
  • On Linux you can pair MS_ASYNC with fsync() or fdatasync() for a controlled flush.
  • The addr argument must be page-aligned. Pass the original mmap() return value, or round down to a page boundary yourself.
  • For large mappings with sparse writes, sync only the dirty pages — not the entire mapping.
  • Even though Linux’s unified VM makes msync() unnecessary for inter-process visibility, always use it in portable code.

Interview Questions – msync() and Mapped Region Synchronization
Q1. Why do you need msync() when the kernel already flushes dirty pages automatically?
The kernel flushes dirty pages on its own schedule (driven by dirty-ratio thresholds, pdflush/writeback threads, or memory pressure). This can take seconds. If the system crashes before the flush, data written to the mapping is lost. msync(MS_SYNC) gives you an explicit, synchronous guarantee: when it returns, the data is on the physical storage device. This is essential for any application requiring crash safety.
Q2. What is the difference between MS_SYNC and MS_ASYNC?
MS_SYNC is a blocking synchronous write. The call does not return until all modified pages have been physically written to disk. After MS_SYNC, the memory and the disk are guaranteed to be in sync.

MS_ASYNC is non-blocking. It schedules the write and returns immediately. The kernel’s writeback mechanism will flush the pages “soon.” After MS_ASYNC, the memory and the kernel buffer cache are in sync (other processes doing read() will see updated data), but the disk may not yet reflect the changes.

Q3. What does MS_INVALIDATE do and when would you use it?
MS_INVALIDATE marks the pages in the mapped region as invalid (stale). On the next access, the kernel re-fetches those pages from the underlying file. This makes changes written to the file by another process (via write()) visible inside the current process’s mapping. You use it when two or more processes interact with the same file — one via mapping (mmap) and another via I/O calls (write) — and you need the mapping to reflect the latest file content.
Q4. Does Linux need MS_INVALIDATE for mmap() and write() to see each other’s changes?
No, not on Linux. Linux uses a unified virtual memory system where mmap() and read()/write() share the same page cache. If Process B does write(fd, ...), Process A’s mapping of the same file will see the change immediately without needing MS_INVALIDATE. However, SUSv3 does not require this, and older or non-Linux UNIX systems may not have a unified VM. For portable code, use MS_INVALIDATE.
Q5. Can you pass a non-page-aligned address to msync()?
SUSv3 says addr must be page-aligned. Passing a non-aligned address may return EINVAL on many implementations. SUSv4 allows implementations to either require alignment or silently round down — but you should always pass a page-aligned address for portability. In practice, pass the original mmap() return value (which is always page-aligned) or round down with (addr & ~(pagesize-1)).
Q6. What happens after MS_ASYNC if you want the data on disk sooner?
You have two Linux-specific options after msync(MS_ASYNC):
1. Call fsync(fd) on the file descriptor — this blocks until the kernel buffer cache is written to disk (data + metadata).
2. Call fdatasync(fd) — like fsync() but skips updating metadata timestamps, making it faster when only data durability matters.
Both options are non-standard extensions beyond the strict SUSv3 specification.
Q7. A database uses mmap() for its page cache. Where should it call msync()?
At transaction commit time. The typical pattern:
1. Write the redo log page using mmap().
2. Call msync(log_page, page_size, MS_SYNC) to ensure the log is durable before modifying data pages.
3. Write data pages to the mapping.
4. Call msync(data_page, len, MS_SYNC) to commit data to disk.
Only MS_SYNC (not MS_ASYNC) provides the write-ahead logging guarantee that prevents data corruption on crash.

Leave a Reply

Your email address will not be published. Required fields are marked *