Why Does Speed Matter?
Web servers, shells, and container runtimes create thousands of processes per second. Understanding the real cost of fork(), vfork(), and clone() — and where that cost comes from — lets you make the right architectural choice for high-performance code.
Benchmark Results (kernel 2.6.27, x86-32, 100,000 operations)
| Method | 1.70 MB process Time (s) / Rate |
2.70 MB process Time (s) / Rate |
11.70 MB process Time (s) / Rate |
|---|---|---|---|
| fork() | 22.27s / 4,544/s | 26.38s / 4,135/s | 126.93s / 1,276/s |
| vfork() | 3.52s / 28,955/s | 3.55s / 28,621/s | 3.53s / 28,810/s |
| clone() [thread-like] | 2.97s / 34,333/s | 2.98s / 34,217/s | 2.93s / 34,688/s |
| fork() + exec() | 135.72s / 764/s | 146.15s / 719/s | 260.34s / 435/s |
| vfork() + exec() | 107.36s / 969/s | 107.81s / 964/s | 107.97s / 960/s |
Visual: Relative Speed (100% = fastest = clone())
Why is fork() Slower Than vfork() and clone()?
| Operation | fork() | vfork() | clone() (shared) |
|---|---|---|---|
| Duplicate page tables | ✗ YES — grows with process size | ✓ No (shared) | ✓ No (CLONE_VM) |
| Mark pages read-only (CoW) | ✗ YES — all data+heap+stack pages | ✓ No | ✓ No |
| Copy fd table | ✗ YES | ✗ YES | ✓ No (CLONE_FILES) |
| Copy signal dispositions | ✗ YES | ✗ YES | ✓ No (CLONE_SIGHAND) |
| Cost scales with process size? | ✗ YES — 10x larger = much slower | ✓ No (constant time) | ✓ No (constant time) |
Copy-on-Write (CoW) — fork() Is Lazy, Not Zero-Cost
fork() does not copy actual memory pages immediately — it uses Copy-on-Write. Both parent and child share the same physical pages, marked read-only. When either writes to a page, a fault occurs and the kernel copies that page for the writer. The cost of fork() is marking page table entries read-only, which grows linearly with process size.
| fork() — Copy-on-Write Flow | ||||
|
After fork()
Parent and Child
share same pages (read-only) |
→ |
Child writes to page
Write triggers
page fault in kernel |
→ |
Kernel copies page
Child gets private
copy. Parent page unchanged. |
| Cost of fork() = time to walk ALL page table entries and mark them read-only. Proportional to virtual memory size. | ||||
Example 1 — Benchmark fork() vs vfork() Speed
/* bench_fork_vfork.c — Measure 10,000 fork() and vfork() calls */
#include <stdio.h>
#include <stdlib.h>
#include <sys/wait.h>
#include <sys/time.h>
#include <unistd.h>
#define ITERATIONS 10000
static double elapsed_ms(struct timeval *start, struct timeval *end)
{
return (end->tv_sec - start->tv_sec) * 1000.0 +
(end->tv_usec - start->tv_usec) / 1000.0;
}
static void bench_fork(void)
{
struct timeval t1, t2;
gettimeofday(&t1, NULL);
for (int i = 0; i < ITERATIONS; i++) {
pid_t pid = fork();
if (pid == 0) _exit(0); /* Child exits immediately */
waitpid(pid, NULL, 0); /* Parent waits */
}
gettimeofday(&t2, NULL);
printf("fork(): %6.1f ms for %d iterations (%5.0f/sec)\n",
elapsed_ms(&t1, &t2), ITERATIONS,
ITERATIONS / (elapsed_ms(&t1, &t2) / 1000.0));
}
static void bench_vfork(void)
{
struct timeval t1, t2;
gettimeofday(&t1, NULL);
for (int i = 0; i < ITERATIONS; i++) {
pid_t pid = vfork();
if (pid == 0) _exit(0); /* Child exits immediately */
waitpid(pid, NULL, 0);
}
gettimeofday(&t2, NULL);
printf("vfork(): %6.1f ms for %d iterations (%5.0f/sec)\n",
elapsed_ms(&t1, &t2), ITERATIONS,
ITERATIONS / (elapsed_ms(&t1, &t2) / 1000.0));
}
int main(void)
{
printf("Benchmarking process creation (%d iterations each)...\n\n",
ITERATIONS);
bench_fork();
bench_vfork();
/* Allocate extra heap to simulate a larger process */
void *extra = malloc(10 * 1024 * 1024); /* 10 MB */
if (!extra) { perror("malloc"); return 1; }
printf("\n--- After allocating 10 MB extra heap ---\n\n");
bench_fork();
bench_vfork();
free(extra);
return 0;
}
/* Compile: gcc -O2 -o bench bench_fork_vfork.c
Run: ./bench
Expected: vfork() ~constant; fork() much slower after heap grows */
Example 2 — fork+exec vs vfork+exec (The Real Story)
When you add exec() after fork/vfork, the difference shrinks dramatically — exec() itself is expensive (disk I/O, dynamic linking). This example measures the full fork-exec lifecycle.
/* bench_exec.c — Compare fork+exec vs vfork+exec */
#include <stdio.h>
#include <stdlib.h>
#include <sys/wait.h>
#include <sys/time.h>
#include <unistd.h>
#define ITERATIONS 1000
/* /bin/true: a program that does nothing and exits */
#define PROG "/bin/true"
static double elapsed_ms(struct timeval *a, struct timeval *b)
{
return (b->tv_sec - a->tv_sec) * 1000.0
+ (b->tv_usec - a->tv_usec) / 1000.0;
}
int main(void)
{
struct timeval t1, t2;
char *args[] = { PROG, NULL };
/* --- fork() + exec() --- */
gettimeofday(&t1, NULL);
for (int i = 0; i < ITERATIONS; i++) {
pid_t pid = fork();
if (pid == 0) {
execv(PROG, args);
_exit(127);
}
waitpid(pid, NULL, 0);
}
gettimeofday(&t2, NULL);
double fork_exec_ms = elapsed_ms(&t1, &t2);
/* --- vfork() + exec() --- */
gettimeofday(&t1, NULL);
for (int i = 0; i < ITERATIONS; i++) {
pid_t pid = vfork();
if (pid == 0) {
execv(PROG, args);
_exit(127);
}
waitpid(pid, NULL, 0);
}
gettimeofday(&t2, NULL);
double vfork_exec_ms = elapsed_ms(&t1, &t2);
printf("fork() + exec(): %7.1f ms (%5.0f/sec)\n",
fork_exec_ms, ITERATIONS / (fork_exec_ms / 1000.0));
printf("vfork() + exec(): %7.1f ms (%5.0f/sec)\n",
vfork_exec_ms, ITERATIONS / (vfork_exec_ms / 1000.0));
printf("Ratio: %.1fx\n", fork_exec_ms / vfork_exec_ms);
printf("(Ratio is much smaller than without exec — exec dominates)\n");
return 0;
}
/* Compile: gcc -O2 -o bench_exec bench_exec.c
Run: ./bench_exec */
Why exec() Reduces the fork() vs vfork() Difference
- exec() reads the program binary from disk — this is the dominant cost
- exec() performs dynamic linking, loading shared libraries
- These costs are the same whether you used fork() or vfork()
- The page-table copying cost of fork() is a smaller fraction of the total when exec() is involved
- In practice, most shells use vfork() + exec() for efficiency
- Repeated exec() of the same binary is artificially fast in benchmarks due to kernel buffer cache
What clone() Flags Were Used in the Benchmark?
/* The clone() benchmark used these flags (sharing most resources): */
int flags = CLONE_VM | /* share virtual memory — no page table copy */
CLONE_VFORK | /* suspend parent like vfork() */
CLONE_FS | /* share filesystem attributes */
CLONE_SIGHAND| /* share signal dispositions */
CLONE_FILES; /* share file descriptor table */
/* This is why clone() was slightly faster than vfork():
vfork() still copies the fd table and signal dispositions.
clone() with CLONE_FILES + CLONE_SIGHAND skips those copies. */
