dup/dup2 · pread/pwrite · Scatter-Gather I/O · Non-blocking · Large Files · Temp Files
dup / dup2
fd duplication
pread/pwrite
Offset I/O
readv/writev
Scatter-Gather
LFS + tmpfiles
Large & temp files
Picking Up from Part 1
In Part 1 we covered atomicity, fcntl(), and the three-layer kernel model (fd → OFD → i-node). Now we go deeper into practical system calls that every Linux systems programmer — and interviewee — needs to know cold.
🔑 Key Terms Covered
What Does “Duplicate” Mean?
When you duplicate a file descriptor, you create a second fd number that points to the exact same Open File Description as the original. They share the file offset and status flags. Closing one does not close the other.
The classic use case: shell I/O redirection. When you run ./prog > out.txt 2>&1, the shell makes fd 2 (stderr) point to the same OFD as fd 1 (stdout) — which already points to out.txt. That is literally a dup2(1, 2) call.
int dup (int oldfd);
// Returns: lowest available new fd, or -1 on error
int dup2(int oldfd, int newfd);
// Returns: newfd, or -1 on error. Closes newfd first if it was open.
/* Linux-specific: dup3 lets you set O_CLOEXEC in one step */
int dup3(int oldfd, int newfd, int flags);
| Feature | dup(oldfd) | dup2(oldfd, newfd) |
|---|---|---|
| Which new fd? | Kernel picks lowest available | You specify exact number |
| If newfd already open? | N/A | Silently closes it first |
| oldfd == newfd? | N/A | No-op; returns newfd unchanged |
| close-on-exec flag? | Always OFF on new fd | Always OFF (use dup3 for ON) |
Practical Example: Redirect stdout to a File
The example below shows how to redirect a process’s standard output to a file programmatically — the same thing the shell does for > file.txt redirection.
/*
* redirect_stdout.c
* Redirect stdout (fd 1) to "output.txt" using dup2().
* After dup2(), any printf() goes to the file, not the terminal.
* Compile: gcc redirect_stdout.c -o redirect_stdout
*/
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
int main(void)
{
/* Open file that will receive stdout */
int filefd = open("output.txt", O_WRONLY | O_CREAT | O_TRUNC, 0644);
if (filefd == -1) { perror("open"); exit(1); }
/* Make fd 1 (stdout) point to the same OFD as filefd */
if (dup2(filefd, STDOUT_FILENO) == -1) { perror("dup2"); exit(1); }
/* filefd is no longer needed — the OFD stays open via fd 1 */
close(filefd);
/* This printf now writes to output.txt, not the terminal */
printf("Hello from redirected stdout!\n");
printf("Line 2 also goes to the file.\n");
return 0;
}
/* Run: ./redirect_stdout && cat output.txt */
Redirect Both stdout and stderr to the Same File
/*
* redirect_both.c
* Equivalent of shell: ./prog > out.txt 2>&1
* Both stdout and stderr go to out.txt.
* Compile: gcc redirect_both.c -o redirect_both
*/
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
int main(void)
{
int fd = open("out.txt", O_WRONLY | O_CREAT | O_TRUNC, 0644);
if (fd == -1) { perror("open"); exit(1); }
dup2(fd, STDOUT_FILENO); /* stdout → out.txt */
dup2(fd, STDERR_FILENO); /* stderr → same OFD → out.txt */
close(fd);
fprintf(stdout, "This is stdout\n");
fprintf(stderr, "This is stderr\n");
/* Both lines appear in out.txt — no interleaving issues
because they share the same OFD (same file offset) */
return 0;
}
| fd 1 (stdout) | ──┐ | OFD offset, flags → inode (out.txt) |
| fd 2 (stderr) | ──┘ |
dup2(oldfd, newfd), explicitly close(newfd) yourself if newfd might be open. The silent close inside dup2 swallows any close error (e.g., flushing a file that was partially written). Catching it manually is safer.The Problem They Solve
In a multi-threaded program, all threads share the same fd table — including the same file offset in every OFD. If thread A does lseek() to position 100 and then thread B also does lseek() to position 200 before thread A calls read(), thread A will read from the wrong position. This is a race condition.
pread() and pwrite() solve this by making the offset a parameter, not shared state. They read/write at the given offset and leave the file’s OFD offset completely unchanged.
ssize_t pread (int fd, void *buf, size_t count, off_t offset);
ssize_t pwrite(int fd, const void *buf, size_t count, off_t offset);
/* File offset in OFD is NOT changed by these calls */
| ❌ Unsafe: lseek + read (race) | ✅ Safe: pread (atomic) |
|---|---|
| lseek(fd, 100, SEEK_SET); /* ← Thread B may lseek here! */ read(fd, buf, 50); /* reads wrong offset */ |
pread(fd, buf, 50, 100); /* atomic: no race possible */ /* fd offset unchanged */ |
/*
* pread_demo.c
* Read different sections of a binary file simultaneously
* without disturbing the shared file offset.
* Compile: gcc pread_demo.c -o pread_demo
*
* Imagine a binary config file laid out as:
* bytes 0-3 : magic number
* bytes 4-7 : version
* bytes 8-11 : data length
*/
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdint.h>
int main(void)
{
/* Create a simple binary file */
int fd = open("config.bin", O_RDWR | O_CREAT | O_TRUNC, 0644);
uint32_t data[] = { 0xDEADBEEF, 2, 1024 }; /* magic, version, length */
write(fd, data, sizeof(data));
/* Now read specific fields using pread — no lseek needed */
uint32_t magic, version, datalen;
pread(fd, &magic, sizeof(magic), 0); /* read bytes 0-3 */
pread(fd, &version, sizeof(version), 4); /* read bytes 4-7 */
pread(fd, &datalen, sizeof(datalen), 8); /* read bytes 8-11 */
printf("Magic : 0x%X\n", magic);
printf("Version: %u\n", version);
printf("Length : %u\n", datalen);
/* File offset is still 12 (from the initial write).
pread did NOT move it. */
off_t pos = lseek(fd, 0, SEEK_CUR);
printf("Current fd offset after pread calls: %ld\n", (long)pos);
close(fd);
return 0;
}
The Concept in Plain English
Normally, read() fills one buffer and write() sends one buffer. But real programs often have data split across multiple variables or structs. Without scatter-gather, you would either:
- Allocate one big temporary buffer, copy everything into it, then call write() — wasteful
- Call write() multiple times — non-atomic, more syscall overhead
writev() takes an array of buffers and writes them all as one contiguous, atomic unit. readv() does the reverse: fills multiple buffers from one read. This is called gather output and scatter input.
| iov[0] iov_base → buf_A iov_len = 4 |
iov[1] iov_base → buf_B iov_len = 8 |
iov[2] iov_base → buf_C iov_len = 16 |
writev(): writes 4+8+16 = 28 bytes atomically
readv(): fills buf_A (4), buf_B (8), buf_C (16) from file |
ssize_t readv (int fd, const struct iovec *iov, int iovcnt);
ssize_t writev(int fd, const struct iovec *iov, int iovcnt);
/* struct iovec { void *iov_base; size_t iov_len; }; */
/*
* writev_demo.c
* Write a simple protocol packet (header + payload) atomically
* without copying into a single buffer first.
* Compile: gcc writev_demo.c -o writev_demo
*/
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/uio.h>
/* Simple custom protocol header */
struct pkt_header {
uint8_t type;
uint16_t length;
uint32_t seq_num;
} __attribute__((packed));
int main(void)
{
struct pkt_header hdr = { .type = 1, .length = 13, .seq_num = 42 };
char payload[] = "Hello, World!";
struct iovec iov[2];
iov[0].iov_base = &hdr; /* first chunk: header */
iov[0].iov_len = sizeof(hdr);
iov[1].iov_base = payload; /* second chunk: payload */
iov[1].iov_len = strlen(payload);
int fd = open("packet.bin", O_WRONLY | O_CREAT | O_TRUNC, 0644);
if (fd == -1) { perror("open"); exit(1); }
/* ONE system call writes both header and payload atomically.
No temp buffer. No two separate write() calls. */
ssize_t written = writev(fd, iov, 2);
printf("Wrote %zd bytes (header %zu + payload %zu)\n",
written, sizeof(hdr), strlen(payload));
close(fd);
return 0;
}
/*
* readv_demo.c
* Read the packet written above back into separate structs.
* Compile: gcc readv_demo.c -o readv_demo
*/
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/uio.h>
struct pkt_header {
uint8_t type;
uint16_t length;
uint32_t seq_num;
} __attribute__((packed));
int main(void)
{
struct pkt_header hdr;
char payload[64];
struct iovec iov[2];
iov[0].iov_base = &hdr;
iov[0].iov_len = sizeof(hdr);
iov[1].iov_base = payload;
iov[1].iov_len = sizeof(payload) - 1;
int fd = open("packet.bin", O_RDONLY);
if (fd == -1) { perror("open"); exit(1); }
/* ONE call scatters data into hdr AND payload */
ssize_t nread = readv(fd, iov, 2);
payload[hdr.length] = '\0';
printf("type=%u length=%u seq=%u\n", hdr.type, hdr.length, hdr.seq_num);
printf("payload: %s\n", payload);
close(fd);
return 0;
}
preadv() / pwritev() – Best of Both Worlds
Linux 2.6.30+ adds preadv() and pwritev(): scatter-gather I/O plus a specified offset (like pread/pwrite). Useful for multi-threaded programs that need both features simultaneously.
#include <sys/uio.h>
ssize_t preadv (int fd, const struct iovec *iov, int iovcnt, off_t offset);
ssize_t pwritev(int fd, const struct iovec *iov, int iovcnt, off_t offset);
The Blocking Problem
By default, reading from a pipe or FIFO with no data will make your process sleep (block) indefinitely. In a server handling many connections, one slow client could freeze the entire process.
O_NONBLOCK changes this: if the I/O cannot complete immediately, the system call returns right away with an error code (EAGAIN or EWOULDBLOCK — they are the same on Linux). Your program can then do other work and retry later.
| Blocking (default) | Non-blocking (O_NONBLOCK) |
|---|---|
| Process calls read() ↓ No data in pipe ↓ Process SLEEPS (blocked) ↓ (waits indefinitely …) Writer writes data ↓ read() returns |
Process calls read() ↓ No data in pipe ↓ read() returns -1, errno=EAGAIN ↓ (process continues!) Do other work… ↓ Retry read() later |
/*
* nonblock_pipe.c
* Demonstrates non-blocking read on a pipe.
* Compile: gcc nonblock_pipe.c -o nonblock_pipe
*/
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>
#include <string.h>
int main(void)
{
int pipefd[2];
pipe(pipefd); /* pipefd[0]=read end, pipefd[1]=write end */
/* Enable O_NONBLOCK on the read end using fcntl */
int flags = fcntl(pipefd[0], F_GETFL);
fcntl(pipefd[0], F_SETFL, flags | O_NONBLOCK);
char buf[64];
ssize_t n;
/* First attempt: pipe is empty */
n = read(pipefd[0], buf, sizeof(buf));
if (n == -1 && errno == EAGAIN)
printf("Attempt 1: No data yet (EAGAIN) — doing other work...\n");
/* Write something to the pipe */
write(pipefd[1], "ping", 4);
/* Second attempt: data is now available */
n = read(pipefd[0], buf, sizeof(buf));
if (n > 0) {
buf[n] = '\0';
printf("Attempt 2: Got data: '%s'\n", buf);
}
close(pipefd[0]);
close(pipefd[1]);
return 0;
}
O_NONBLOCK on regular disk files is generally ignored by the kernel because the buffer cache makes disk I/O appear instant. It matters for pipes, FIFOs, sockets, terminals, and device files.The Problem
On a 32-bit system, off_t (the file offset type) is a signed 32-bit integer. That limits offsets to 2 GB (2³¹ − 1 bytes). Modern log files, databases, and video files routinely exceed this. The Large File Summit (LFS) extensions solve this.
| Type / Function | 32-bit default | LFS 64-bit version |
|---|---|---|
| File offset type | off_t (32-bit) | off64_t (64-bit) |
| Max file size | 2 GB | 8 Exabytes (theoretical) |
| open() | open() | open64() or use macro |
| lseek() | lseek() | lseek64() |
| stat() | stat() | stat64() |
The Modern Way: _FILE_OFFSET_BITS=64
The cleanest approach is to define _FILE_OFFSET_BITS=64 at compile time. This automatically maps all 32-bit file functions to their 64-bit counterparts — no source code changes needed.
/*
* largefile_demo.c
* Demonstrates writing at a very large offset (beyond 2 GB).
* Compile: gcc -D_FILE_OFFSET_BITS=64 largefile_demo.c -o largefile_demo
*
* NOTE: This creates a sparse file. The actual disk usage is tiny
* because the kernel does not allocate blocks for the hole.
*/
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
int main(void)
{
int fd = open("bigfile.dat", O_RDWR | O_CREAT | O_TRUNC, 0644);
if (fd == -1) { perror("open"); exit(1); }
/* Seek to 3 GB offset — impossible without LFS on 32-bit! */
off_t big_offset = (off_t)3 * 1024 * 1024 * 1024; /* 3 GB */
if (lseek(fd, big_offset, SEEK_SET) == -1) {
perror("lseek: maybe 32-bit without LFS?");
exit(1);
}
if (write(fd, "END\n", 4) != 4) { perror("write"); exit(1); }
printf("Wrote 4 bytes at offset %lld (3 GB)\n", (long long)big_offset);
/* Verify: ls -lh bigfile.dat → shows ~3.0G but du -sh shows ~4K */
close(fd);
return 0;
}
/* Compile command:
gcc -D_FILE_OFFSET_BITS=64 largefile_demo.c -o largefile_demo */
off_t values with LFS, cast to long long and use %lld. On some 32-bit platforms, off_t may be wider than long, so %ld will give wrong results.printf("offset = %lld\n", (long long)offset);What is /dev/fd?
The kernel provides a virtual directory /dev/fd that contains entries like /dev/fd/0, /dev/fd/1, /dev/fd/2, etc. Opening /dev/fd/N is equivalent to calling dup(N) — you get a new file descriptor pointing to the same OFD as fd N.
| Path | Meaning | Equivalent to |
|---|---|---|
| /dev/fd/0 = /dev/stdin | Standard input | dup(0) |
| /dev/fd/1 = /dev/stdout | Standard output | dup(1) |
| /dev/fd/2 = /dev/stderr | Standard error | dup(2) |
| /dev/fd/N | Any open fd | dup(N) |
Shell Use Case: Pass stdin as a Filename
Some command-line tools only accept filenames, not stdin. /dev/stdin bridges that gap:
## Shell examples using /dev/fd
# diff a live sorted output against a saved snapshot
# (diff expects two filenames — but one "file" is a pipe)
sort current_list.txt | diff /dev/fd/0 saved_snapshot.txt
# Use /dev/stdin where a tool only accepts filenames
echo "hello world" | wc -l /dev/stdin
# In C: open /dev/fd/1 to get another fd pointing to stdout
# (same as dup(1))
fd = open("/dev/fd/1", O_WRONLY);
/dev/fd is a symlink to /proc/self/fd. You can verify with ls -la /dev/fd. The /proc/self/fd/ directory contains one symlink per open file descriptor in your process.Why Not Just Use a Fixed Filename?
If two instances of your program both try to create /tmp/myapp.tmp, they collide. Worse, a malicious process could create that file first and trick your program into writing sensitive data there. Proper temp file APIs avoid both problems with a unique, random name and exclusive open.
| Feature | mkstemp() | tmpfile() |
|---|---|---|
| Returns | int (file descriptor) | FILE* (stdio stream) |
| You know the filename? | Yes (template is filled in) | No (hidden) |
| Auto-delete on close? | No (you call unlink manually) | Yes — automatically deleted |
| Interface | POSIX syscall-level | ANSI C stdio-level |
| Permissions | 0600 (owner rw only) | Implementation-defined |
/*
* temp_files.c
* Demonstrates both mkstemp() and tmpfile() patterns.
* Compile: gcc temp_files.c -o temp_files
*/
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
/* ── mkstemp: use when you need the filename (e.g. pass to another process) ── */
void demo_mkstemp(void)
{
/* Last 6 chars MUST be XXXXXX — replaced with unique random chars */
char template[] = "/tmp/myapp-XXXXXX";
int fd = mkstemp(template);
if (fd == -1) { perror("mkstemp"); return; }
printf("mkstemp created: %s\n", template);
/* Unlink immediately so file is removed when fd is closed,
even if the process crashes */
unlink(template);
/* Now use fd for I/O — file still accessible via fd */
write(fd, "temporary data\n", 15);
/* Seek back and read */
lseek(fd, 0, SEEK_SET);
char buf[32] = {0};
read(fd, buf, 15);
printf("mkstemp content: %s", buf);
close(fd);
/* File is now gone from filesystem (was unlinked earlier) */
}
/* ── tmpfile: use when you just need a throwaway scratch area ── */
void demo_tmpfile(void)
{
FILE *fp = tmpfile();
if (!fp) { perror("tmpfile"); return; }
fprintf(fp, "scratch data line 1\n");
fprintf(fp, "scratch data line 2\n");
/* Rewind and read back */
rewind(fp);
char line[64];
printf("\ntmpfile contents:\n");
while (fgets(line, sizeof(line), fp))
printf(" %s", line);
fclose(fp);
/* File is automatically deleted on fclose() */
}
int main(void)
{
demo_mkstemp();
demo_tmpfile();
return 0;
}
tmpnam(), tempnam(), and mktemp(). They return a name but do not open the file atomically, leaving a window for a symlink attack. Always use mkstemp() or tmpfile().| Call / Flag | Signature | Key Behaviour |
|---|---|---|
| dup() | int dup(int oldfd) | Lowest free fd pointing to same OFD |
| dup2() | int dup2(int old, int new) | Exact fd number; closes newfd first |
| dup3() | int dup3(int old, int new, int flags) | dup2 + can set O_CLOEXEC atomically |
| pread() | ssize_t pread(fd, buf, n, off) | Read at offset; OFD position unchanged |
| pwrite() | ssize_t pwrite(fd, buf, n, off) | Write at offset; OFD position unchanged |
| readv() | ssize_t readv(fd, iov, iovcnt) | Scatter read into multiple buffers |
| writev() | ssize_t writev(fd, iov, iovcnt) | Gather write from multiple buffers, atomic |
| preadv() | ssize_t preadv(fd, iov, n, off) | readv + specified offset |
| pwritev() | ssize_t pwritev(fd, iov, n, off) | writev + specified offset |
| truncate() | int truncate(path, off_t len) | Set file size; pads with zeroes if growing |
| ftruncate() | int ftruncate(fd, off_t len) | Same as truncate but uses fd |
| mkstemp() | int mkstemp(char *template) | Unique temp file; returns fd; 0600 perms |
| tmpfile() | FILE* tmpfile(void) | Unique temp file; auto-deleted on close |
A: A file descriptor is a small integer in your process’s fd table. It points to an Open File Description (OFD) in the kernel’s system-wide table. The OFD holds the actual state: offset, flags, and a pointer to the i-node. Multiple fds (even from different processes) can point to the same OFD.
A: Without atomicity, two processes could both lseek to end-of-file and then write, causing one to overwrite the other. O_APPEND makes seek-to-end + write a single uninterruptible operation so each write always lands after the previous one.
A: The shell calls dup2(1, 2), which makes fd 2 (stderr) point to the same OFD as fd 1 (stdout). Because they share the same OFD, they share the same file offset — so writes from both are correctly interleaved without overwriting each other.
A: All threads share the same file descriptors and therefore the same OFD offsets. Between lseek() and read(), another thread can change the offset — a race condition. pread() takes the offset as a parameter and does not touch the OFD offset at all.
A: writev() is atomic (all buffers written as one unit), avoids the need for a temporary buffer, and incurs only one syscall overhead instead of N. Multiple write() calls could be interleaved with writes from other processes, and each carries its own syscall cost.
🎓 Continue Your Linux Journey
Next up: Process Management, Signals, and Pipes — the building blocks of every Linux shell and daemon.
← Part 1: Atomicity & Internals Back to Linux Course Index →
