Chapter 28.2 — TLPI

The clone() System Call

The low-level engine behind every Linux thread and process — understanding the real fork()

clone()

Core API

do_fork()

Kernel Function

KSE

Scheduling Entity

← Process Accounting Next: clone() Flags →

What is clone()?

clone() is a Linux-specific system call that creates a new process. Unlike fork() which always copies the parent’s memory, signals, and file descriptors, clone() lets you choose precisely which resources are shared and which are copied.

It is the real primitive: inside the kernel, fork(), vfork(), and clone() all call the same kernel function do_fork() — they just pass different flags. The threading libraries (NPTL, LinuxThreads) use clone() directly to create threads.

fork() vs vfork() vs clone() — At a Glance

System Call	Memory	Starts Executing	Stack	Sharing Control
`fork()`	Copy-on-Write copy	After the fork() call	Copy of parent’s stack	No control — fixed behaviour
`vfork()`	Shares parent memory	After the vfork() call	Shares parent’s stack!	No control — fixed behaviour
`clone()`	Caller decides via flags	At a new function (func)	Caller provides separate stack	Full control via flags

clone() Prototype

#define _GNU_SOURCE
#include <sched.h>

int clone(
    int  (*func)(void *),   /* Child starts executing here */
    void  *child_stack,     /* Top of stack for child (grows downward) */
    int    flags,           /* Sharing flags + termination signal */
    void  *func_arg,        /* Argument passed to func */
    /* Optional: */
    pid_t *ptid,            /* Store child TID here (before fork) */
    struct user_desc *tls,  /* Thread-local storage descriptor */
    pid_t *ctid             /* Store child TID here; cleared on exit */
);
/* Returns: child PID on success in parent, -1 on error */

Parameters Explained

Parameter	Role
`func`	Function where the child process begins execution. Returns the child’s exit status.
`child_stack`	Pointer to the top of a memory block to use as the child’s stack. Stack grows downward on x86, so pass the high end of a malloc’d block.
`flags`	Bitmask of CLONE_* flags (what to share) ORed with the child’s termination signal (lower byte). E.g. `CLONE_VM \| SIGCHLD`.
`func_arg`	Argument passed to `func`. Cast to/from `void*` to pass any type.
`ptid`	If CLONE_PARENT_SETTID is set: kernel writes child TID here before returning (race-free thread ID capture).
`tls`	Thread-local storage descriptor. Used by NPTL for per-thread data.
`ctid`	If CLONE_CHILD_CLEARTID is set: kernel zeroes this location and wakes futex waiters when child exits — how pthread_join() works.

Inside the Kernel — How clone() Works

User calls
clone(func, stack, flags, arg) → glibc wrapper
sets up registers,
calls sys_clone() → Kernel
do_fork()
creates new KSE → Child starts
executing
func(arg)

Child Stack Setup — Why Pass the Top?

On x86 and most architectures, the stack grows downward in memory. So you allocate a block and pass a pointer to its high address end:

stackTop = stack + STACK_SIZE ← pass this to clone()

stack[65535]

stack[65534] ← stack grows downward as child runs

…

stack[0] ← malloc’d block starts here

stack = malloc(STACK_SIZE) ← low address

Example 1 — Basic clone() Usage

A minimal example that creates a child process using clone(). The child prints a message and exits; the parent waits for it.

/* basic_clone.c — Minimal clone() example */
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <sched.h>
#include <signal.h>
#include <sys/wait.h>

#define STACK_SIZE (64 * 1024)   /* 64 KB stack for child */

/* This function runs in the child process */
static int child_func(void *arg)
{
    char *message = (char *) arg;
    printf("[Child]  PID=%d, message='%s'\n", getpid(), message);
    return 0;   /* exit status of child */
}

int main(void)
{
    char *stack;
    char *stack_top;
    pid_t child_pid;

    /* Allocate stack for the child */
    stack = malloc(STACK_SIZE);
    if (!stack) { perror("malloc"); exit(1); }

    /* Stack grows downward: pass the HIGH end */
    stack_top = stack + STACK_SIZE;

    printf("[Parent] PID=%d, creating child with clone()...\n", getpid());

    /* Create the child:
     *   - child_func: where child starts
     *   - stack_top: top of child's stack
     *   - SIGCHLD: signal sent to parent when child exits
     *   - "hello": argument to child_func
     */
    child_pid = clone(child_func, stack_top, SIGCHLD, "hello from clone");
    if (child_pid == -1) { perror("clone"); exit(1); }

    printf("[Parent] Child PID = %d\n", child_pid);

    /* Wait for child to finish */
    if (waitpid(child_pid, NULL, 0) == -1) { perror("waitpid"); exit(1); }

    printf("[Parent] Child has terminated.\n");

    free(stack);
    return 0;
}
/* Compile: gcc -o basic_clone basic_clone.c
   Run:     ./basic_clone  */

Example 2 — Sharing File Descriptors with CLONE_FILES

This example shows the difference between cloning with and without CLONE_FILES. When CLONE_FILES is set, the child and parent share the same file descriptor table — closing an fd in the child also closes it in the parent.

/* clone_files.c — Demonstrate CLONE_FILES sharing */
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <sched.h>
#include <signal.h>
#include <string.h>
#include <unistd.h>
#include <sys/wait.h>
#include <errno.h>

#define STACK_SIZE (64 * 1024)

/* Receive the fd number as argument; close it and exit */
static int child_func(void *arg)
{
    int fd = *((int *) arg);
    printf("[Child] closing fd %d\n", fd);
    close(fd);
    return 0;
}

int run_test(int use_clone_files)
{
    char *stack      = malloc(STACK_SIZE);
    char *stack_top  = stack + STACK_SIZE;
    int   fd, flags;
    pid_t child_pid;

    /* Open a file — child will close this */
    fd = open("/dev/null", O_RDWR);
    if (fd == -1) { perror("open"); return -1; }

    /* Choose whether to share fd table */
    flags = SIGCHLD;
    if (use_clone_files)
        flags |= CLONE_FILES;

    child_pid = clone(child_func, stack_top, flags, &fd);
    if (child_pid == -1) { perror("clone"); return -1; }

    waitpid(child_pid, NULL, 0);

    /* Try writing to the fd — did child's close() affect us? */
    ssize_t n = write(fd, "x", 1);
    if (n == -1 && errno == EBADF)
        printf("[Parent] fd %d CLOSED — child's close() affected parent "
               "(CLONE_FILES=%s)\n", fd, use_clone_files ? "ON" : "OFF");
    else
        printf("[Parent] fd %d OPEN — child's close() did NOT affect parent "
               "(CLONE_FILES=%s)\n", fd, use_clone_files ? "ON" : "OFF");

    close(fd);   /* close even if already closed (EBADF is ok) */
    free(stack);
    return 0;
}

int main(void)
{
    printf("=== Without CLONE_FILES ===\n");
    run_test(0);

    printf("\n=== With CLONE_FILES ===\n");
    run_test(1);

    return 0;
}
/* Expected output:
   [Parent] fd 3 OPEN  — child's close() did NOT affect parent (CLONE_FILES=OFF)
   [Parent] fd 3 CLOSED — child's close() affected parent (CLONE_FILES=ON)  */

Example 3 — Custom Termination Signal with clone()

The low byte of the flags argument specifies which signal the parent receives when the child terminates. If it is not SIGCHLD, the parent must use __WCLONE in waitpid().

/* clone_signal.c — Use SIGUSR1 as child termination signal */
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <sched.h>
#include <signal.h>
#include <sys/wait.h>
#include <unistd.h>

#define STACK_SIZE (64 * 1024)
#define CHILD_SIG  SIGUSR1   /* Non-standard termination signal */

static int child_func(void *arg)
{
    printf("[Child] running, PID=%d\n", getpid());
    sleep(1);
    printf("[Child] exiting\n");
    return 42;   /* exit code */
}

int main(void)
{
    char  *stack     = malloc(STACK_SIZE);
    char  *stack_top = stack + STACK_SIZE;
    pid_t  child_pid;
    int    status;

    /* Ignore CHILD_SIG so it doesn't terminate the parent */
    if (signal(CHILD_SIG, SIG_IGN) == SIG_ERR) {
        perror("signal"); exit(1);
    }

    /* Lower byte of flags = termination signal (SIGUSR1 here) */
    child_pid = clone(child_func, stack_top, CHILD_SIG, NULL);
    if (child_pid == -1) { perror("clone"); exit(1); }

    printf("[Parent] waiting for child %d (using __WCLONE)...\n", child_pid);

    /*
     * __WCLONE: wait for children that deliver a signal != SIGCHLD.
     * This is required when the termination signal is not SIGCHLD.
     */
    if (waitpid(-1, &status, __WCLONE) == -1) {
        perror("waitpid"); exit(1);
    }

    if (WIFEXITED(status))
        printf("[Parent] child exited with status %d\n", WEXITSTATUS(status));

    free(stack);
    return 0;
}
/* Compile: gcc -o clone_signal clone_signal.c
   Run:     ./clone_signal  */

Kernel Scheduling Entities (KSE) — Threads vs Processes

A useful way to think about Linux: both threads and processes are Kernel Scheduling Entities (KSEs). They differ only in how many attributes they share with other KSEs.

KSE Type	Shares with siblings	Created by
POSIX Process (fork)	Almost nothing — independent memory, fds, signals	`fork()` → clone(SIGCHLD)
POSIX Thread (pthread)	Memory, file descriptors, signal dispositions, process ID	`pthread_create()` → clone(CLONE_VM \| CLONE_FILES \| …)
vfork() child	Shares memory temporarily (until exec/exit)	`vfork()` → clone(CLONE_VM \| CLONE_VFORK \| SIGCHLD)

NPTL vs LinuxThreads — clone() Flag Differences

Threading Library	clone() Flags Used	Threads share PID?
LinuxThreads (old)	CLONE_VM \| CLONE_FILES \| CLONE_FS \| CLONE_SIGHAND	❌ No — each thread has a unique PID
NPTL (modern)	+ CLONE_THREAD \| CLONE_SETTLS \| CLONE_PARENT_SETTID \| CLONE_CHILD_CLEARTID \| CLONE_SYSVSEM	✅ Yes — all threads share the TGID

/* How NPTL uses clone() internally (simplified) */
/* When you call pthread_create(), NPTL does roughly this: */

pid_t new_tid;    /* thread ID will be written here */
pid_t ctid_loc;   /* cleared when thread exits — pthread_join uses this */

clone(
    thread_start_fn,     /* pthread start routine wrapper */
    new_stack_top,
    CLONE_VM          |  /* share memory */
    CLONE_FILES       |  /* share file descriptors */
    CLONE_FS          |  /* share filesystem attributes */
    CLONE_SIGHAND     |  /* share signal dispositions */
    CLONE_THREAD      |  /* place in same thread group (share PID) */
    CLONE_SETTLS      |  /* set up thread-local storage */
    CLONE_PARENT_SETTID| /* write TID to &new_tid before return */
    CLONE_CHILD_CLEARTID|/* clear &ctid_loc on exit → wakes pthread_join */
    CLONE_SYSVSEM,       /* share SysV semaphore undo values */
    thread_arg,
    &new_tid,            /* ptid */
    tls_descriptor,      /* tls */
    &ctid_loc            /* ctid */
);

⚠️ When to Use clone() Directly

In application code: almost never. clone() is not portable and requires careful stack management. Use fork() for processes and pthread_create() for threads. Use clone() only when writing a threading library or a container runtime. Understanding it is still valuable for interview preparation and understanding what the kernel actually does.

Interview Questions

Q1. How does clone() differ from fork()?

fork() always creates an independent copy of the process with copy-on-write memory, copies of file descriptor tables, and inherited signal dispositions. clone() lets the caller choose exactly which resources are shared via its flags argument. The child starts at a supplied function rather than returning from the call, and uses a separately allocated stack. Inside the kernel, both ultimately call do_fork().

Q2. Why must you allocate a separate stack for the clone() child?

When CLONE_VM is not set, the child gets its own copy of memory — but even so, if both parent and child ran on the same stack simultaneously, they would corrupt each other’s stack frames. Since the child starts executing a new function rather than returning from clone(), it needs its own stack from the beginning. With CLONE_VM (shared memory), it is even more critical — two threads must never share a stack.

Q3. What is the purpose of the lower byte of the flags argument?

The lower 8 bits of the flags argument specify the signal sent to the parent when the child terminates. Using SIGCHLD here makes the child behave like a normal fork() child. Using a different signal (like SIGUSR1) means the parent must use the __WCLONE flag in waitpid() to wait for it. Setting it to 0 means no signal is sent at all.

Q4. What is a Kernel Scheduling Entity (KSE)?

A KSE is the generic term for the objects managed by the Linux kernel scheduler. Both threads and processes are KSEs. They differ in the degree of attribute sharing: a thread shares memory, file descriptors, and signal dispositions with other threads in its group, while a process has independent copies. clone() is the tool that creates KSEs with any desired degree of sharing.

Q5. How does NPTL use CLONE_CHILD_CLEARTID to implement pthread_join()?

When creating a thread, NPTL passes the address of a variable (ctid) to clone() with CLONE_CHILD_CLEARTID set. The kernel stores the thread’s TID in this variable and treats it as a futex. When the thread exits, the kernel atomically zeroes ctid and wakes any thread performing a futex_wait on that address. pthread_join() does exactly that futex_wait — so it unblocks when the target thread exits.

Q6. Why pass the HIGH address of the stack buffer to clone(), not the LOW address?

On x86 and most common architectures, the stack grows downward in memory — each function call pushes data to a lower address. If you passed the low end of the buffer, the first stack push would go to an address below the allocated buffer, corrupting unrelated memory. Passing the high end means the stack has the full buffer to grow into.