VFS and Journaling File Systems in Linux Explained

 

🔀 VFS and Journaling File Systems in Linux Explained
How Linux unifies all filesystems with VFS — and how journaling survives crashes
Topics 5 & 6
VFS + Journaling
FS Types
ext3, ext4, XFS, Btrfs
Concept
Abstraction Layer

Key Terms:

VFS Virtual File System Abstraction Layer Journaling Transaction fsck ext3/ext4 XFS Btrfs FUSE

Part 1: The Virtual File System (VFS)

Linux supports many different filesystem types: ext4 on your hard disk, FAT32 on a USB, NFS over the network, tmpfs in RAM. Each has completely different internal code. How can a program use the same open(), read(), write() calls on all of them?

Answer: The Virtual File System (VFS) — a kernel abstraction layer that sits between user programs and the actual filesystem implementations.

📊 VFS Architecture

User Space Program
open("/mnt/usb/file.txt", O_RDONLY)
⬇ System Call Interface

🔀 Virtual File System (VFS)
Generic Interface: open, read, write, mkdir, unlink, stat, mmap …
⬇ Dispatches to correct driver

ext4
Local disk
XFS
Local disk
FAT32/NTFS
USB/Windows
NFS
Network
tmpfs
RAM
FUSE
Userspace

Hardware: HDD / SSD / Network / RAM / USB
🔧 How VFS Works — Key Principle

VFS defines a standard set of function pointers (operations) that every filesystem must implement. The kernel calls these generic functions; each filesystem provides its own version.

VFS defines:

  • file_operations struct
    (read, write, open, mmap…)
  • inode_operations struct
    (create, link, unlink, mkdir…)
  • super_operations struct
    (read_super, write_super…)

Each filesystem implements:

  • Its own versions of those functions
  • Registers them with VFS at init time
  • VFS calls the right implementation via function pointer
📌 If a filesystem doesn’t support an operation (e.g., VFAT doesn’t support symlinks), it passes back ENOSYS and VFS returns that error to the application.

💻 Code Example 1: VFS in Action — Same Code, Different Filesystems
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/stat.h>

/* This exact same code works on ext4, XFS, FAT32, NFS, tmpfs —
   VFS routes each call to the right driver automatically */

int copy_file(const char *src, const char *dst) {
    int fd_in, fd_out;
    char buf[4096];
    ssize_t nread;

    /* open() — VFS calls the right fs driver's open() */
    fd_in = open(src, O_RDONLY);
    if (fd_in == -1) { perror("open src"); return -1; }

    fd_out = open(dst, O_WRONLY | O_CREAT | O_TRUNC, 0644);
    if (fd_out == -1) {
        perror("open dst");
        close(fd_in);
        return -1;
    }

    /* read()/write() — VFS handles the routing */
    while ((nread = read(fd_in, buf, sizeof(buf))) > 0)
        write(fd_out, buf, nread);

    close(fd_in);
    close(fd_out);
    return 0;
}

int main(void) {
    /* Copy from ext4 to tmpfs — VFS handles both transparently */
    copy_file("/etc/hostname", "/tmp/hostname_copy");

    /* Check what filesystem /tmp is on */
    struct statfs sf;
    statfs("/tmp", &sf);
    if (sf.f_type == 0x01021994)  /* tmpfs magic */
        printf("/tmp is on tmpfs (RAM-based)\n");

    printf("File copied successfully via VFS\n");
    return 0;
}

/* The key point: you never write filesystem-specific code.
   VFS gives you one API for everything. */

💻 Code Example 2: FUSE — Implementing Your Own Filesystem in Userspace

FUSE (Filesystem in Userspace, added in Linux 2.6.14) lets you write a complete filesystem in a regular program — no kernel module needed.

/* Minimal FUSE "hello world" filesystem.
   Install libfuse-dev, compile with: gcc -o hello hello_fs.c -lfuse3
   Mount with: ./hello /mnt/myfs
*/

#define FUSE_USE_VERSION 31
#include <fuse3/fuse.h>
#include <string.h>
#include <errno.h>

/* Our filesystem has one file: /hello.txt */

/* Called when ls runs — return file/dir names */
static int my_readdir(const char *path, void *buf,
                      fuse_fill_dir_t filler, off_t offset,
                      struct fuse_file_info *fi,
                      enum fuse_readdir_flags flags) {
    if (strcmp(path, "/") != 0)
        return -ENOENT;

    filler(buf, ".",        NULL, 0, 0);  /* current dir */
    filler(buf, "..",       NULL, 0, 0);  /* parent dir */
    filler(buf, "hello.txt", NULL, 0, 0); /* our file */
    return 0;
}

/* Called when stat() / ls -l runs on a file */
static int my_getattr(const char *path, struct stat *st,
                      struct fuse_file_info *fi) {
    memset(st, 0, sizeof(*st));
    if (strcmp(path, "/") == 0) {
        st->st_mode = S_IFDIR | 0755;
        st->st_nlink = 2;
    } else if (strcmp(path, "/hello.txt") == 0) {
        st->st_mode = S_IFREG | 0444;
        st->st_nlink = 1;
        st->st_size = 13;  /* "Hello, FUSE!\n" */
    } else {
        return -ENOENT;
    }
    return 0;
}

/* Called when a program reads the file */
static int my_read(const char *path, char *buf, size_t size,
                   off_t offset, struct fuse_file_info *fi) {
    const char *content = "Hello, FUSE!\n";
    size_t len = strlen(content);

    if (strcmp(path, "/hello.txt") != 0) return -ENOENT;
    if (offset >= (off_t)len) return 0;

    if (offset + size > len) size = len - offset;
    memcpy(buf, content + offset, size);
    return size;
}

/* Register our functions with FUSE */
static const struct fuse_operations ops = {
    .getattr = my_getattr,
    .readdir = my_readdir,
    .read    = my_read,
};

int main(int argc, char *argv[]) {
    return fuse_main(argc, argv, &ops, NULL);
}

/* After mounting:
   cat /mnt/myfs/hello.txt  -> "Hello, FUSE!"
   ls /mnt/myfs             -> hello.txt
   All routing done by VFS/FUSE kernel hooks!
*/

Part 2: Journaling File Systems

ext2 has a classic problem: a system crash mid-write can leave the filesystem in an inconsistent state (e.g., a file with an inode but no data blocks, or a directory pointing to a deleted file). Journaling solves this.

⚠️ The Problem with Non-Journaling FS

What happens when you write a file:

1. Update i-node (file size, timestamps)
2. Write data to data blocks
3. Update free block bitmap
4. Update directory entry
💥 If the system crashes after step 1 but before step 4 — the filesystem is inconsistent. Data exists but isn’t findable, or vice versa.

The fsck fix (slow):

At next boot, fsck must scan the entire filesystem to find and fix inconsistencies.

  • Small FS: a few seconds
  • 1 TB FS: hours
  • Unacceptable for servers needing high availability

📔 How Journaling Works
Journaling Write Flow
Step 1
Begin Transaction
Step 2
Write changes to Journal (log file on disk)
Step 3
Commit Transaction in Journal
Step 4
Apply changes to actual FS
Step 5
Mark transaction complete, clear journal
✅ Crash after Step 2
Transaction not yet committed → journal ignored. FS unchanged, consistent.
✅ Crash during Step 4
On reboot, journal has committed transaction → kernel replays it quickly. Seconds to recover.
✅ With journaling, after a crash, the kernel just replays the journal — recovery takes seconds, not hours.

📊 Linux Journaling Filesystems Comparison
Filesystem Added to Kernel Journal Type Notable Features
Reiserfs 2.4.1 Metadata only Tail packing — packs small files into metadata blocks, saves space
ext3 2.4.15 Metadata or Data Easy migration from ext2; same on-disk format with journal added
XFS 2.4.24 Metadata From SGI/Irix; excellent large file performance; wide-scale deployments
ext4 2.6.19+ Metadata or Data Extents, online defrag, nsec timestamps, larger max file/FS size; default on most distros
Btrfs 2.6.29 COW (always) Writable snapshots, checksums, online defrag, RAID built-in, subvolumes

💻 Code Example 3: Check Filesystem Type and Journal Mode
#include <stdio.h>
#include <sys/vfs.h>

/* Filesystem magic numbers */
#define EXT2_MAGIC    0xef53
#define EXT4_MAGIC    0xef53  /* same magic! ext3/ext4 extend ext2 */
#define XFS_MAGIC     0x58465342
#define BTRFS_MAGIC   0x9123683e
#define TMPFS_MAGIC   0x01021994
#define NFS_MAGIC     0x6969
#define FAT_MAGIC     0x4d44
#define REISERFS_MAGIC 0x52654973

const char *fs_name(long magic) {
    switch (magic) {
        case EXT2_MAGIC:     return "ext2/ext3/ext4";
        case XFS_MAGIC:      return "XFS";
        case BTRFS_MAGIC:    return "Btrfs";
        case TMPFS_MAGIC:    return "tmpfs";
        case NFS_MAGIC:      return "NFS";
        case FAT_MAGIC:      return "FAT";
        case REISERFS_MAGIC: return "ReiserFS";
        default:             return "unknown";
    }
}

int main(int argc, char *argv[]) {
    struct statfs sf;
    const char *paths[] = {"/", "/tmp", "/proc", "/sys", NULL};
    const char **p;

    for (p = paths; *p; p++) {
        if (statfs(*p, &sf) == 0)
            printf("%-10s -> %s (magic=0x%lx)\n",
                   *p, fs_name(sf.f_type), sf.f_type);
    }
    return 0;
}

/* Typical output:
/          -> ext2/ext3/ext4 (magic=0xef53)
/tmp       -> tmpfs           (magic=0x1021994)
/proc      -> proc            (magic=0x9fa0)
/sys       -> sysfs           (magic=0x62656572)
*/
💻 Code Example 4: Simulating a Journal — Write-Ahead Logging Concept
#include <stdio.h>
#include <string.h>
#include <fcntl.h>
#include <unistd.h>

/* Simplified demo of write-ahead logging concept.
   Real journaling is in the kernel, but this shows the IDEA. */

typedef struct {
    int  transaction_id;
    char operation[32];  /* e.g., "write_inode" */
    char data[64];       /* simplified data */
    int  committed;      /* 0=pending, 1=committed */
} JournalEntry;

int write_to_journal(const char *journal_path,
                     JournalEntry *entry) {
    int fd = open(journal_path, O_WRONLY | O_APPEND | O_CREAT, 0644);
    if (fd == -1) return -1;

    /* Step 1: Write the entry to journal */
    write(fd, entry, sizeof(*entry));

    /* Step 2: Flush to disk BEFORE applying to real FS */
    fsync(fd);  /* Force kernel to flush to disk NOW */

    close(fd);
    printf("[JOURNAL] Wrote tx#%d: %s\n",
           entry->transaction_id, entry->operation);
    return 0;
}

void commit_transaction(JournalEntry *entry) {
    entry->committed = 1;
    printf("[JOURNAL] Committed tx#%d\n", entry->transaction_id);
    /* Now safe to apply to actual filesystem */
}

int main(void) {
    JournalEntry tx1, tx2;

    /* Simulate writing a file (3 metadata updates) */
    tx1.transaction_id = 1;
    strncpy(tx1.operation, "update_inode_size", 31);
    strncpy(tx1.data,      "size=1024",         63);
    tx1.committed = 0;
    write_to_journal("/tmp/demo_journal.log", &tx1);

    tx2.transaction_id = 2;
    strncpy(tx2.operation, "update_dir_entry", 31);
    strncpy(tx2.data,      "name=newfile.txt", 63);
    tx2.committed = 0;
    write_to_journal("/tmp/demo_journal.log", &tx2);

    /* Commit both as a group (atomically) */
    commit_transaction(&tx1);
    commit_transaction(&tx2);

    printf("\nIf crash occurs BEFORE commit: journal ignored (FS safe)\n");
    printf("If crash occurs AFTER commit:  journal replayed (FS recovered)\n");

    return 0;
}

/* Key: fsync() is critical — it forces the journal write to
   physical disk, not just kernel buffer cache.
   Without fsync, the journal itself might be lost on crash. */

🎯 Interview Questions — VFS and Journaling

Q1. What is the Virtual File System (VFS) and why does Linux need it?

VFS is a kernel abstraction layer that provides a single, uniform interface for all filesystem operations. Without it, every program would need to know the internal details of each filesystem. VFS defines standard operation tables (function pointers) that each filesystem implements. Programs call VFS functions; VFS dispatches to the correct driver.

Q2. What are the key data structures in VFS?

The three main VFS objects are: superblock (represents a mounted filesystem), inode (represents a file/directory), and dentry (directory entry, maps names to inodes). Each has an associated operations structure (super_operations, inode_operations, file_operations) filled with function pointers by the specific filesystem driver.

Q3. What is FUSE and why is it useful?

FUSE (Filesystem in Userspace) is a kernel mechanism that lets you implement a complete filesystem as a userspace program. VFS hooks route FS calls to your program via the FUSE kernel module. Useful for: filesystems based on archives (squashfs-fuse), encrypted filesystems (encfs), network filesystems, testing, and development without writing kernel code.

Q4. What problem does journaling solve?

A file write involves multiple separate disk operations (update inode, write data, update bitmap, update directory). A system crash mid-write leaves the filesystem inconsistent. Non-journaling filesystems (like ext2) must run fsck to scan and repair the entire filesystem — which can take hours for large disks. Journaling records operations to a log first, allowing quick replay/rollback on reboot (seconds).

Q5. Explain write-ahead logging in the context of journaling filesystems.

Write-ahead logging (WAL) means: before making any change to the actual filesystem, first write a record of the intended change to the journal (log file). Flush the journal to disk. Then apply the change. If a crash happens: if the journal entry isn’t committed, ignore it (filesystem unchanged). If committed, replay it (recover quickly). The journal entry must hit physical disk before the actual change — hence fsync() matters.

Q6. What is the difference between metadata journaling and data journaling?

Metadata journaling (default in ext3/ext4) logs only filesystem metadata (inode changes, directory entries, block allocation). File data itself is not journaled. This is faster but file data may be lost/corrupted on crash. Data journaling logs both metadata AND file data. Safer but slower due to double-write (data goes to journal, then to filesystem). ext4 supports both modes via mount options (data=ordered, data=journal, data=writeback).

Leave a Reply

Your email address will not be published. Required fields are marked *