Linux File I/O: Advanced Concepts – Part 2

Linux File I/O: Advanced Concepts – Part 2

dup/dup2 · pread/pwrite · Scatter-Gather I/O · Non-blocking · Large Files · Temp Files

📋
dup / dup2
fd duplication
📍
pread/pwrite
Offset I/O
📦
readv/writev
Scatter-Gather
🗃️
LFS + tmpfiles
Large & temp files

Picking Up from Part 1

In Part 1 we covered atomicity, fcntl(), and the three-layer kernel model (fd → OFD → i-node). Now we go deeper into practical system calls that every Linux systems programmer — and interviewee — needs to know cold.

🔑 Key Terms Covered

dup() dup2() dup3() pread() pwrite() readv() writev() O_NONBLOCK _FILE_OFFSET_BITS off64_t /dev/fd mkstemp() tmpfile()

📋 1. Duplicating File Descriptors: dup(), dup2(), dup3()

What Does “Duplicate” Mean?

When you duplicate a file descriptor, you create a second fd number that points to the exact same Open File Description as the original. They share the file offset and status flags. Closing one does not close the other.

The classic use case: shell I/O redirection. When you run ./prog > out.txt 2>&1, the shell makes fd 2 (stderr) point to the same OFD as fd 1 (stdout) — which already points to out.txt. That is literally a dup2(1, 2) call.

#include <unistd.h>

int dup (int oldfd);
// Returns: lowest available new fd, or -1 on error

int dup2(int oldfd, int newfd);
// Returns: newfd, or -1 on error. Closes newfd first if it was open.

/* Linux-specific: dup3 lets you set O_CLOEXEC in one step */
int dup3(int oldfd, int newfd, int flags);

dup() vs dup2() — Key Differences
Feature dup(oldfd) dup2(oldfd, newfd)
Which new fd? Kernel picks lowest available You specify exact number
If newfd already open? N/A Silently closes it first
oldfd == newfd? N/A No-op; returns newfd unchanged
close-on-exec flag? Always OFF on new fd Always OFF (use dup3 for ON)

Practical Example: Redirect stdout to a File

The example below shows how to redirect a process’s standard output to a file programmatically — the same thing the shell does for > file.txt redirection.

/*
 * redirect_stdout.c
 * Redirect stdout (fd 1) to "output.txt" using dup2().
 * After dup2(), any printf() goes to the file, not the terminal.
 * Compile: gcc redirect_stdout.c -o redirect_stdout
 */
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>

int main(void)
{
    /* Open file that will receive stdout */
    int filefd = open("output.txt", O_WRONLY | O_CREAT | O_TRUNC, 0644);
    if (filefd == -1) { perror("open"); exit(1); }

    /* Make fd 1 (stdout) point to the same OFD as filefd */
    if (dup2(filefd, STDOUT_FILENO) == -1) { perror("dup2"); exit(1); }

    /* filefd is no longer needed — the OFD stays open via fd 1 */
    close(filefd);

    /* This printf now writes to output.txt, not the terminal */
    printf("Hello from redirected stdout!\n");
    printf("Line 2 also goes to the file.\n");

    return 0;
}
/* Run: ./redirect_stdout && cat output.txt */

Redirect Both stdout and stderr to the Same File

/*
 * redirect_both.c
 * Equivalent of shell: ./prog > out.txt 2>&1
 * Both stdout and stderr go to out.txt.
 * Compile: gcc redirect_both.c -o redirect_both
 */
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>

int main(void)
{
    int fd = open("out.txt", O_WRONLY | O_CREAT | O_TRUNC, 0644);
    if (fd == -1) { perror("open"); exit(1); }

    dup2(fd, STDOUT_FILENO);   /* stdout → out.txt */
    dup2(fd, STDERR_FILENO);   /* stderr → same OFD → out.txt */
    close(fd);

    fprintf(stdout, "This is stdout\n");
    fprintf(stderr, "This is stderr\n");

    /* Both lines appear in out.txt — no interleaving issues
       because they share the same OFD (same file offset) */
    return 0;
}

After dup2(fd,1) and dup2(fd,2): fd 1 and fd 2 share one OFD
fd 1 (stdout) ──┐ OFD
offset, flags
→ inode (out.txt)
fd 2 (stderr) ──┘
⚠️ Safety Tip: Before calling dup2(oldfd, newfd), explicitly close(newfd) yourself if newfd might be open. The silent close inside dup2 swallows any close error (e.g., flushing a file that was partially written). Catching it manually is safer.

📍 2. pread() and pwrite() – I/O at a Specific Offset Without Moving the File Pointer

The Problem They Solve

In a multi-threaded program, all threads share the same fd table — including the same file offset in every OFD. If thread A does lseek() to position 100 and then thread B also does lseek() to position 200 before thread A calls read(), thread A will read from the wrong position. This is a race condition.

pread() and pwrite() solve this by making the offset a parameter, not shared state. They read/write at the given offset and leave the file’s OFD offset completely unchanged.

#include <unistd.h>

ssize_t pread (int fd, void *buf, size_t count, off_t offset);
ssize_t pwrite(int fd, const void *buf, size_t count, off_t offset);

/* File offset in OFD is NOT changed by these calls */

lseek+read vs pread in a Multi-threaded Program
❌ Unsafe: lseek + read (race) ✅ Safe: pread (atomic)
lseek(fd, 100, SEEK_SET);
/* ← Thread B may lseek here! */
read(fd, buf, 50); /* reads wrong offset */
pread(fd, buf, 50, 100);
/* atomic: no race possible */
/* fd offset unchanged */
/*
 * pread_demo.c
 * Read different sections of a binary file simultaneously
 * without disturbing the shared file offset.
 * Compile: gcc pread_demo.c -o pread_demo
 *
 * Imagine a binary config file laid out as:
 *   bytes 0-3  : magic number
 *   bytes 4-7  : version
 *   bytes 8-11 : data length
 */
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdint.h>

int main(void)
{
    /* Create a simple binary file */
    int fd = open("config.bin", O_RDWR | O_CREAT | O_TRUNC, 0644);
    uint32_t data[] = { 0xDEADBEEF, 2, 1024 }; /* magic, version, length */
    write(fd, data, sizeof(data));

    /* Now read specific fields using pread — no lseek needed */
    uint32_t magic, version, datalen;

    pread(fd, &magic,   sizeof(magic),   0);  /* read bytes 0-3  */
    pread(fd, &version, sizeof(version), 4);  /* read bytes 4-7  */
    pread(fd, &datalen, sizeof(datalen), 8);  /* read bytes 8-11 */

    printf("Magic  : 0x%X\n", magic);
    printf("Version: %u\n",   version);
    printf("Length : %u\n",   datalen);

    /* File offset is still 12 (from the initial write).
       pread did NOT move it. */
    off_t pos = lseek(fd, 0, SEEK_CUR);
    printf("Current fd offset after pread calls: %ld\n", (long)pos);

    close(fd);
    return 0;
}

📦 3. Scatter-Gather I/O: readv() and writev()

The Concept in Plain English

Normally, read() fills one buffer and write() sends one buffer. But real programs often have data split across multiple variables or structs. Without scatter-gather, you would either:

  • Allocate one big temporary buffer, copy everything into it, then call write() — wasteful
  • Call write() multiple times — non-atomic, more syscall overhead

writev() takes an array of buffers and writes them all as one contiguous, atomic unit. readv() does the reverse: fills multiple buffers from one read. This is called gather output and scatter input.

iovec Array → Scatter Input / Gather Output
iov[0]
iov_base → buf_A
iov_len = 4
iov[1]
iov_base → buf_B
iov_len = 8
iov[2]
iov_base → buf_C
iov_len = 16
writev(): writes 4+8+16 = 28 bytes atomically

readv(): fills buf_A (4), buf_B (8), buf_C (16) from file

#include <sys/uio.h>

ssize_t readv (int fd, const struct iovec *iov, int iovcnt);
ssize_t writev(int fd, const struct iovec *iov, int iovcnt);

/* struct iovec { void *iov_base; size_t iov_len; }; */

/*
 * writev_demo.c
 * Write a simple protocol packet (header + payload) atomically
 * without copying into a single buffer first.
 * Compile: gcc writev_demo.c -o writev_demo
 */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/uio.h>

/* Simple custom protocol header */
struct pkt_header {
    uint8_t  type;
    uint16_t length;
    uint32_t seq_num;
} __attribute__((packed));

int main(void)
{
    struct pkt_header hdr = { .type = 1, .length = 13, .seq_num = 42 };
    char payload[] = "Hello, World!";

    struct iovec iov[2];
    iov[0].iov_base = &hdr;         /* first chunk: header */
    iov[0].iov_len  = sizeof(hdr);

    iov[1].iov_base = payload;      /* second chunk: payload */
    iov[1].iov_len  = strlen(payload);

    int fd = open("packet.bin", O_WRONLY | O_CREAT | O_TRUNC, 0644);
    if (fd == -1) { perror("open"); exit(1); }

    /* ONE system call writes both header and payload atomically.
       No temp buffer. No two separate write() calls. */
    ssize_t written = writev(fd, iov, 2);
    printf("Wrote %zd bytes (header %zu + payload %zu)\n",
           written, sizeof(hdr), strlen(payload));

    close(fd);
    return 0;
}
/*
 * readv_demo.c
 * Read the packet written above back into separate structs.
 * Compile: gcc readv_demo.c -o readv_demo
 */
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/uio.h>

struct pkt_header {
    uint8_t  type;
    uint16_t length;
    uint32_t seq_num;
} __attribute__((packed));

int main(void)
{
    struct pkt_header hdr;
    char payload[64];

    struct iovec iov[2];
    iov[0].iov_base = &hdr;
    iov[0].iov_len  = sizeof(hdr);
    iov[1].iov_base = payload;
    iov[1].iov_len  = sizeof(payload) - 1;

    int fd = open("packet.bin", O_RDONLY);
    if (fd == -1) { perror("open"); exit(1); }

    /* ONE call scatters data into hdr AND payload */
    ssize_t nread = readv(fd, iov, 2);
    payload[hdr.length] = '\0';

    printf("type=%u  length=%u  seq=%u\n", hdr.type, hdr.length, hdr.seq_num);
    printf("payload: %s\n", payload);

    close(fd);
    return 0;
}
🧠 Interview Tip: “Is writev() truly atomic for regular files?” — Yes, for regular files the kernel guarantees contiguous, uninterrupted output. For sockets and pipes, partial writes are possible and you must check the return value.

preadv() / pwritev() – Best of Both Worlds

Linux 2.6.30+ adds preadv() and pwritev(): scatter-gather I/O plus a specified offset (like pread/pwrite). Useful for multi-threaded programs that need both features simultaneously.

#define _BSD_SOURCE
#include <sys/uio.h>

ssize_t preadv (int fd, const struct iovec *iov, int iovcnt, off_t offset);
ssize_t pwritev(int fd, const struct iovec *iov, int iovcnt, off_t offset);

🚦 4. Non-blocking I/O – Don’t Wait, Come Back Later

The Blocking Problem

By default, reading from a pipe or FIFO with no data will make your process sleep (block) indefinitely. In a server handling many connections, one slow client could freeze the entire process.

O_NONBLOCK changes this: if the I/O cannot complete immediately, the system call returns right away with an error code (EAGAIN or EWOULDBLOCK — they are the same on Linux). Your program can then do other work and retry later.

Blocking vs Non-blocking read() on an Empty Pipe
Blocking (default) Non-blocking (O_NONBLOCK)
Process calls read()

No data in pipe

Process SLEEPS (blocked)
↓ (waits indefinitely …)
Writer writes data

read() returns
Process calls read()

No data in pipe

read() returns -1, errno=EAGAIN
↓ (process continues!)
Do other work…

Retry read() later
/*
 * nonblock_pipe.c
 * Demonstrates non-blocking read on a pipe.
 * Compile: gcc nonblock_pipe.c -o nonblock_pipe
 */
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>
#include <string.h>

int main(void)
{
    int pipefd[2];
    pipe(pipefd);   /* pipefd[0]=read end, pipefd[1]=write end */

    /* Enable O_NONBLOCK on the read end using fcntl */
    int flags = fcntl(pipefd[0], F_GETFL);
    fcntl(pipefd[0], F_SETFL, flags | O_NONBLOCK);

    char buf[64];
    ssize_t n;

    /* First attempt: pipe is empty */
    n = read(pipefd[0], buf, sizeof(buf));
    if (n == -1 && errno == EAGAIN)
        printf("Attempt 1: No data yet (EAGAIN) — doing other work...\n");

    /* Write something to the pipe */
    write(pipefd[1], "ping", 4);

    /* Second attempt: data is now available */
    n = read(pipefd[0], buf, sizeof(buf));
    if (n > 0) {
        buf[n] = '\0';
        printf("Attempt 2: Got data: '%s'\n", buf);
    }

    close(pipefd[0]);
    close(pipefd[1]);
    return 0;
}
📌 Note: O_NONBLOCK on regular disk files is generally ignored by the kernel because the buffer cache makes disk I/O appear instant. It matters for pipes, FIFOs, sockets, terminals, and device files.

🗃️ 5. Large File Support (LFS) – Files Bigger than 2 GB on 32-bit Systems

The Problem

On a 32-bit system, off_t (the file offset type) is a signed 32-bit integer. That limits offsets to 2 GB (2³¹ − 1 bytes). Modern log files, databases, and video files routinely exceed this. The Large File Summit (LFS) extensions solve this.

32-bit vs 64-bit File Offset Support
Type / Function 32-bit default LFS 64-bit version
File offset type off_t (32-bit) off64_t (64-bit)
Max file size 2 GB 8 Exabytes (theoretical)
open() open() open64() or use macro
lseek() lseek() lseek64()
stat() stat() stat64()

The Modern Way: _FILE_OFFSET_BITS=64

The cleanest approach is to define _FILE_OFFSET_BITS=64 at compile time. This automatically maps all 32-bit file functions to their 64-bit counterparts — no source code changes needed.

/*
 * largefile_demo.c
 * Demonstrates writing at a very large offset (beyond 2 GB).
 * Compile: gcc -D_FILE_OFFSET_BITS=64 largefile_demo.c -o largefile_demo
 *
 * NOTE: This creates a sparse file. The actual disk usage is tiny
 *       because the kernel does not allocate blocks for the hole.
 */
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>

int main(void)
{
    int fd = open("bigfile.dat", O_RDWR | O_CREAT | O_TRUNC, 0644);
    if (fd == -1) { perror("open"); exit(1); }

    /* Seek to 3 GB offset — impossible without LFS on 32-bit! */
    off_t big_offset = (off_t)3 * 1024 * 1024 * 1024;  /* 3 GB */

    if (lseek(fd, big_offset, SEEK_SET) == -1) {
        perror("lseek: maybe 32-bit without LFS?");
        exit(1);
    }

    if (write(fd, "END\n", 4) != 4) { perror("write"); exit(1); }

    printf("Wrote 4 bytes at offset %lld (3 GB)\n", (long long)big_offset);

    /* Verify: ls -lh bigfile.dat → shows ~3.0G but du -sh shows ~4K */
    close(fd);
    return 0;
}
/* Compile command:
   gcc -D_FILE_OFFSET_BITS=64 largefile_demo.c -o largefile_demo */
⚠️ printf with large offsets: When printing off_t values with LFS, cast to long long and use %lld. On some 32-bit platforms, off_t may be wider than long, so %ld will give wrong results.
printf("offset = %lld\n", (long long)offset);

📁 6. /dev/fd – Accessing Your Own Open Files by Number

What is /dev/fd?

The kernel provides a virtual directory /dev/fd that contains entries like /dev/fd/0, /dev/fd/1, /dev/fd/2, etc. Opening /dev/fd/N is equivalent to calling dup(N) — you get a new file descriptor pointing to the same OFD as fd N.

/dev/fd Equivalences
Path Meaning Equivalent to
/dev/fd/0 = /dev/stdin Standard input dup(0)
/dev/fd/1 = /dev/stdout Standard output dup(1)
/dev/fd/2 = /dev/stderr Standard error dup(2)
/dev/fd/N Any open fd dup(N)

Shell Use Case: Pass stdin as a Filename

Some command-line tools only accept filenames, not stdin. /dev/stdin bridges that gap:

## Shell examples using /dev/fd

# diff a live sorted output against a saved snapshot
# (diff expects two filenames — but one "file" is a pipe)
sort current_list.txt | diff /dev/fd/0 saved_snapshot.txt

# Use /dev/stdin where a tool only accepts filenames
echo "hello world" | wc -l /dev/stdin

# In C: open /dev/fd/1 to get another fd pointing to stdout
# (same as dup(1))
fd = open("/dev/fd/1", O_WRONLY);
📌 Internals: On Linux, /dev/fd is a symlink to /proc/self/fd. You can verify with ls -la /dev/fd. The /proc/self/fd/ directory contains one symlink per open file descriptor in your process.

🗑️ 7. Temporary Files – mkstemp() and tmpfile()

Why Not Just Use a Fixed Filename?

If two instances of your program both try to create /tmp/myapp.tmp, they collide. Worse, a malicious process could create that file first and trick your program into writing sensitive data there. Proper temp file APIs avoid both problems with a unique, random name and exclusive open.

mkstemp() vs tmpfile() Comparison
Feature mkstemp() tmpfile()
Returns int (file descriptor) FILE* (stdio stream)
You know the filename? Yes (template is filled in) No (hidden)
Auto-delete on close? No (you call unlink manually) Yes — automatically deleted
Interface POSIX syscall-level ANSI C stdio-level
Permissions 0600 (owner rw only) Implementation-defined
/*
 * temp_files.c
 * Demonstrates both mkstemp() and tmpfile() patterns.
 * Compile: gcc temp_files.c -o temp_files
 */
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>

/* ── mkstemp: use when you need the filename (e.g. pass to another process) ── */
void demo_mkstemp(void)
{
    /* Last 6 chars MUST be XXXXXX — replaced with unique random chars */
    char template[] = "/tmp/myapp-XXXXXX";

    int fd = mkstemp(template);
    if (fd == -1) { perror("mkstemp"); return; }

    printf("mkstemp created: %s\n", template);

    /* Unlink immediately so file is removed when fd is closed,
       even if the process crashes */
    unlink(template);

    /* Now use fd for I/O — file still accessible via fd */
    write(fd, "temporary data\n", 15);

    /* Seek back and read */
    lseek(fd, 0, SEEK_SET);
    char buf[32] = {0};
    read(fd, buf, 15);
    printf("mkstemp content: %s", buf);

    close(fd);
    /* File is now gone from filesystem (was unlinked earlier) */
}

/* ── tmpfile: use when you just need a throwaway scratch area ── */
void demo_tmpfile(void)
{
    FILE *fp = tmpfile();
    if (!fp) { perror("tmpfile"); return; }

    fprintf(fp, "scratch data line 1\n");
    fprintf(fp, "scratch data line 2\n");

    /* Rewind and read back */
    rewind(fp);
    char line[64];
    printf("\ntmpfile contents:\n");
    while (fgets(line, sizeof(line), fp))
        printf("  %s", line);

    fclose(fp);
    /* File is automatically deleted on fclose() */
}

int main(void)
{
    demo_mkstemp();
    demo_tmpfile();
    return 0;
}
📌 Security note: Avoid old functions like tmpnam(), tempnam(), and mktemp(). They return a name but do not open the file atomically, leaving a window for a symlink attack. Always use mkstemp() or tmpfile().

📋 8. Complete Quick Reference – All System Calls in This Series
System Calls & Flags — Linux Advanced File I/O
Call / Flag Signature Key Behaviour
dup() int dup(int oldfd) Lowest free fd pointing to same OFD
dup2() int dup2(int old, int new) Exact fd number; closes newfd first
dup3() int dup3(int old, int new, int flags) dup2 + can set O_CLOEXEC atomically
pread() ssize_t pread(fd, buf, n, off) Read at offset; OFD position unchanged
pwrite() ssize_t pwrite(fd, buf, n, off) Write at offset; OFD position unchanged
readv() ssize_t readv(fd, iov, iovcnt) Scatter read into multiple buffers
writev() ssize_t writev(fd, iov, iovcnt) Gather write from multiple buffers, atomic
preadv() ssize_t preadv(fd, iov, n, off) readv + specified offset
pwritev() ssize_t pwritev(fd, iov, n, off) writev + specified offset
truncate() int truncate(path, off_t len) Set file size; pads with zeroes if growing
ftruncate() int ftruncate(fd, off_t len) Same as truncate but uses fd
mkstemp() int mkstemp(char *template) Unique temp file; returns fd; 0600 perms
tmpfile() FILE* tmpfile(void) Unique temp file; auto-deleted on close

🎯 9. Top Interview Questions — File I/O Internals
Q: What is the difference between a file descriptor and an open file description?
A: A file descriptor is a small integer in your process’s fd table. It points to an Open File Description (OFD) in the kernel’s system-wide table. The OFD holds the actual state: offset, flags, and a pointer to the i-node. Multiple fds (even from different processes) can point to the same OFD.
Q: Why does O_APPEND need to be atomic?
A: Without atomicity, two processes could both lseek to end-of-file and then write, causing one to overwrite the other. O_APPEND makes seek-to-end + write a single uninterruptible operation so each write always lands after the previous one.
Q: How does 2>&1 shell redirection work at the syscall level?
A: The shell calls dup2(1, 2), which makes fd 2 (stderr) point to the same OFD as fd 1 (stdout). Because they share the same OFD, they share the same file offset — so writes from both are correctly interleaved without overwriting each other.
Q: Why use pread() instead of lseek() + read() in threads?
A: All threads share the same file descriptors and therefore the same OFD offsets. Between lseek() and read(), another thread can change the offset — a race condition. pread() takes the offset as a parameter and does not touch the OFD offset at all.
Q: What advantage does writev() have over multiple write() calls?
A: writev() is atomic (all buffers written as one unit), avoids the need for a temporary buffer, and incurs only one syscall overhead instead of N. Multiple write() calls could be interleaved with writes from other processes, and each carries its own syscall cost.

🎓 Continue Your Linux Journey

Next up: Process Management, Signals, and Pipes — the building blocks of every Linux shell and daemon.

← Part 1: Atomicity & Internals Back to Linux Course Index →

Leave a Reply

Your email address will not be published. Required fields are marked *