Speed of Process Creation - embeddedpathashala.com

Chapter 28.3 — TLPI

Speed of Process Creation

Benchmarking fork(), vfork(), and clone() — understanding the real costs

~30x

vfork vs fork (large process)

34K/s

clone() rate

CoW

Copy-on-Write

← waitpid() Extensions Next: Attribute Inheritance →

Why Does Speed Matter?

Web servers, shells, and container runtimes create thousands of processes per second. Understanding the real cost of fork(), vfork(), and clone() — and where that cost comes from — lets you make the right architectural choice for high-performance code.

Benchmark Results (kernel 2.6.27, x86-32, 100,000 operations)

Method	1.70 MB process Time (s) / Rate	2.70 MB process Time (s) / Rate	11.70 MB process Time (s) / Rate
fork()	22.27s / 4,544/s	26.38s / 4,135/s	126.93s / 1,276/s
vfork()	3.52s / 28,955/s	3.55s / 28,621/s	3.53s / 28,810/s
clone() [thread-like]	2.97s / 34,333/s	2.98s / 34,217/s	2.93s / 34,688/s
fork() + exec()	135.72s / 764/s	146.15s / 719/s	260.34s / 435/s
vfork() + exec()	107.36s / 969/s	107.81s / 964/s	107.97s / 960/s

Visual: Relative Speed (100% = fastest = clone())

clone() (no exec)

100% — 34,333/s

vfork() (no exec)

84% — 28,955/s

fork() (no exec)

13% — 4,544/s

vfork() + exec()

969/s

fork() + exec()

764/s

Why is fork() Slower Than vfork() and clone()?

Operation	fork()	vfork()	clone() (shared)
Duplicate page tables	✗ YES — grows with process size	✓ No (shared)	✓ No (CLONE_VM)
Mark pages read-only (CoW)	✗ YES — all data+heap+stack pages	✓ No	✓ No
Copy fd table	✗ YES	✗ YES	✓ No (CLONE_FILES)
Copy signal dispositions	✗ YES	✗ YES	✓ No (CLONE_SIGHAND)
Cost scales with process size?	✗ YES — 10x larger = much slower	✓ No (constant time)	✓ No (constant time)

Copy-on-Write (CoW) — fork() Is Lazy, Not Zero-Cost

fork() does not copy actual memory pages immediately — it uses Copy-on-Write. Both parent and child share the same physical pages, marked read-only. When either writes to a page, a fault occurs and the kernel copies that page for the writer. The cost of fork() is marking page table entries read-only, which grows linearly with process size.

fork() — Copy-on-Write Flow
After fork() Parent and Child share same pages (read-only)	→	Child writes to page Write triggers page fault in kernel	→	Kernel copies page Child gets private copy. Parent page unchanged.
Cost of fork() = time to walk ALL page table entries and mark them read-only. Proportional to virtual memory size.

Example 1 — Benchmark fork() vs vfork() Speed

/* bench_fork_vfork.c — Measure 10,000 fork() and vfork() calls */
#include <stdio.h>
#include <stdlib.h>
#include <sys/wait.h>
#include <sys/time.h>
#include <unistd.h>

#define ITERATIONS 10000

static double elapsed_ms(struct timeval *start, struct timeval *end)
{
    return (end->tv_sec  - start->tv_sec)  * 1000.0 +
           (end->tv_usec - start->tv_usec) / 1000.0;
}

static void bench_fork(void)
{
    struct timeval t1, t2;
    gettimeofday(&t1, NULL);

    for (int i = 0; i < ITERATIONS; i++) {
        pid_t pid = fork();
        if (pid == 0) _exit(0);         /* Child exits immediately */
        waitpid(pid, NULL, 0);          /* Parent waits */
    }

    gettimeofday(&t2, NULL);
    printf("fork():  %6.1f ms for %d iterations  (%5.0f/sec)\n",
           elapsed_ms(&t1, &t2), ITERATIONS,
           ITERATIONS / (elapsed_ms(&t1, &t2) / 1000.0));
}

static void bench_vfork(void)
{
    struct timeval t1, t2;
    gettimeofday(&t1, NULL);

    for (int i = 0; i < ITERATIONS; i++) {
        pid_t pid = vfork();
        if (pid == 0) _exit(0);         /* Child exits immediately */
        waitpid(pid, NULL, 0);
    }

    gettimeofday(&t2, NULL);
    printf("vfork(): %6.1f ms for %d iterations  (%5.0f/sec)\n",
           elapsed_ms(&t1, &t2), ITERATIONS,
           ITERATIONS / (elapsed_ms(&t1, &t2) / 1000.0));
}

int main(void)
{
    printf("Benchmarking process creation (%d iterations each)...\n\n",
           ITERATIONS);
    bench_fork();
    bench_vfork();

    /* Allocate extra heap to simulate a larger process */
    void *extra = malloc(10 * 1024 * 1024);   /* 10 MB */
    if (!extra) { perror("malloc"); return 1; }
    printf("\n--- After allocating 10 MB extra heap ---\n\n");
    bench_fork();
    bench_vfork();

    free(extra);
    return 0;
}
/* Compile: gcc -O2 -o bench bench_fork_vfork.c
   Run:     ./bench
   Expected: vfork() ~constant; fork() much slower after heap grows */

Example 2 — fork+exec vs vfork+exec (The Real Story)

When you add exec() after fork/vfork, the difference shrinks dramatically — exec() itself is expensive (disk I/O, dynamic linking). This example measures the full fork-exec lifecycle.

/* bench_exec.c — Compare fork+exec vs vfork+exec */
#include <stdio.h>
#include <stdlib.h>
#include <sys/wait.h>
#include <sys/time.h>
#include <unistd.h>

#define ITERATIONS 1000
/* /bin/true: a program that does nothing and exits */
#define PROG "/bin/true"

static double elapsed_ms(struct timeval *a, struct timeval *b)
{
    return (b->tv_sec - a->tv_sec) * 1000.0
         + (b->tv_usec - a->tv_usec) / 1000.0;
}

int main(void)
{
    struct timeval t1, t2;
    char *args[] = { PROG, NULL };

    /* --- fork() + exec() --- */
    gettimeofday(&t1, NULL);
    for (int i = 0; i < ITERATIONS; i++) {
        pid_t pid = fork();
        if (pid == 0) {
            execv(PROG, args);
            _exit(127);
        }
        waitpid(pid, NULL, 0);
    }
    gettimeofday(&t2, NULL);
    double fork_exec_ms = elapsed_ms(&t1, &t2);

    /* --- vfork() + exec() --- */
    gettimeofday(&t1, NULL);
    for (int i = 0; i < ITERATIONS; i++) {
        pid_t pid = vfork();
        if (pid == 0) {
            execv(PROG, args);
            _exit(127);
        }
        waitpid(pid, NULL, 0);
    }
    gettimeofday(&t2, NULL);
    double vfork_exec_ms = elapsed_ms(&t1, &t2);

    printf("fork()  + exec(): %7.1f ms  (%5.0f/sec)\n",
           fork_exec_ms,  ITERATIONS / (fork_exec_ms  / 1000.0));
    printf("vfork() + exec(): %7.1f ms  (%5.0f/sec)\n",
           vfork_exec_ms, ITERATIONS / (vfork_exec_ms / 1000.0));
    printf("Ratio: %.1fx\n", fork_exec_ms / vfork_exec_ms);
    printf("(Ratio is much smaller than without exec — exec dominates)\n");

    return 0;
}
/* Compile: gcc -O2 -o bench_exec bench_exec.c
   Run:     ./bench_exec  */

Why exec() Reduces the fork() vs vfork() Difference

exec() reads the program binary from disk — this is the dominant cost
exec() performs dynamic linking, loading shared libraries
These costs are the same whether you used fork() or vfork()
The page-table copying cost of fork() is a smaller fraction of the total when exec() is involved
In practice, most shells use vfork() + exec() for efficiency
Repeated exec() of the same binary is artificially fast in benchmarks due to kernel buffer cache

What clone() Flags Were Used in the Benchmark?

/* The clone() benchmark used these flags (sharing most resources): */
int flags = CLONE_VM     |   /* share virtual memory — no page table copy */
            CLONE_VFORK  |   /* suspend parent like vfork() */
            CLONE_FS     |   /* share filesystem attributes */
            CLONE_SIGHAND|   /* share signal dispositions */
            CLONE_FILES;     /* share file descriptor table */

/* This is why clone() was slightly faster than vfork():
   vfork() still copies the fd table and signal dispositions.
   clone() with CLONE_FILES + CLONE_SIGHAND skips those copies. */

Interview Questions

Q1. Why does fork() become slower as the process gets larger, but vfork() stays constant?

fork() must duplicate the parent’s page tables — marking every data, heap, and stack page as copy-on-write read-only. The time for this is proportional to the number of pages, which grows with process size. vfork() does not copy page tables at all — the child shares the parent’s virtual memory directly. The parent is suspended until the child execs or exits, so no concurrent access problem exists.

Q2. If vfork() is faster than fork(), why not always use vfork()?

vfork() has severe restrictions: the child must not modify memory (other than its local stack variables), call library functions (which modify global state), or do anything other than calling exec() or _exit(). Violating these rules corrupts the parent’s state. fork() with copy-on-write is safe for any use. Modern glibc’s fork() is already optimized; the practical difference for typical processes is small enough that the danger of vfork() is rarely worth it.

Q3. Why does exec() reduce the relative speed difference between fork() and vfork()?

exec() performs expensive operations: reading the program binary from disk (or buffer cache), setting up the new address space, loading shared libraries, and initializing the runtime. These costs are the same regardless of whether fork() or vfork() was used. When exec() cost dominates (e.g., a large binary not cached), the page-table copying overhead of fork() is a small fraction of the total time, so the ratio fork/vfork shrinks significantly.

Q4. Why is clone() slightly faster than vfork() in the benchmark?

vfork() shares virtual memory but still copies the file descriptor table and signal disposition table from parent to child. The benchmark’s clone() call also included CLONE_FILES and CLONE_SIGHAND, which share these structures rather than copying them. Avoiding those copies gives clone() a small edge. The magnitude of the difference depends on how many file descriptors are open — opening 100 extra fds raised vfork() time but left clone() time unchanged.

Q5. What does copy-on-write mean and how does fork() implement it?

Copy-on-write means the kernel does not immediately copy memory pages at fork() time. Instead, parent and child share the same physical pages. Their page table entries are marked read-only. When either process writes to a shared page, a page fault occurs. The kernel handles it by allocating a new physical page, copying the content, updating the faulting process’s page table to point to the new page (writable), and resuming. Pages that are never written are never copied — this is the efficiency of CoW.