Chapter 28.2.1 — TLPI

Chapter 28.2.1 — TLPI
The clone() Flags Argument
Every CLONE_* flag explained — from shared memory to containers and namespaces
20+
Flags
CLONE_VM
Shared Memory
CLONE_THREAD
Thread Groups
Containers
Namespaces

The flags Argument — The Power of clone()

The flags bitmask is what makes clone() powerful. Each bit controls whether a specific resource is shared between parent and child, or whether the child gets its own private copy. By combining flags you can create anything from a fully isolated process to a thread that shares everything.

Complete Flags Reference Table

Flag Effect When Set Used By
CLONE_FILES Share open file descriptor table POSIX threads
CLONE_FS Share filesystem info (umask, cwd, root) POSIX threads
CLONE_SIGHAND Share signal disposition table POSIX threads
CLONE_VM Share virtual memory (same page tables) POSIX threads, vfork()
CLONE_THREAD Place child in parent’s thread group (share PID) NPTL
CLONE_SYSVSEM Share System V semaphore undo values NPTL (kernel 2.6+)
CLONE_SETTLS Set up thread-local storage from tls arg NPTL (kernel 2.6+)
CLONE_PARENT_SETTID Write child TID to ptid before returning NPTL (kernel 2.6+)
CLONE_CHILD_SETTID Write child TID to ctid in child’s memory Kernel 2.6+
CLONE_CHILD_CLEARTID Zero ctid on exit, wake futex (pthread_join) NPTL (kernel 2.6+)
CLONE_NEWNS Child gets copy of parent’s mount namespace Containers (kernel 2.4.19+)
CLONE_NEWIPC New System V IPC namespace Containers (kernel 2.6.19+)
CLONE_NEWNET New network namespace Containers (kernel 2.4.24+)
CLONE_NEWPID New PID namespace (child appears as PID 1) Containers (kernel 2.6.19+)
CLONE_NEWUSER New user-ID/group-ID namespace Containers (kernel 2.6.23+)
CLONE_NEWUTS New UTS namespace (hostname, domainname) Containers (kernel 2.6.19+)
CLONE_PARENT Child’s parent = caller’s parent (same PPID) Kernel 2.4+
CLONE_VFORK Suspend parent until child execs or exits vfork() emulation
CLONE_PTRACE Trace child if parent is being traced Debuggers
CLONE_UNTRACED Prevent CLONE_PTRACE being forced on child Kernel threads (kernel 2.6+)
CLONE_IO Share I/O context with parent Kernel 2.6.25+

Section A: Flags for POSIX Thread Creation

CLONE_VM — Share Virtual Memory

When set, parent and child share the same page tables. Any memory write by one is immediately visible to the other. This is the defining characteristic of a thread. Without it, the child gets a copy-on-write copy (like fork()).

Dependency: CLONE_SIGHAND requires CLONE_VM (kernel 2.6+). CLONE_THREAD requires CLONE_SIGHAND, which requires CLONE_VM.

CLONE_FILES — Share File Descriptor Table

Parent and child share one file descriptor table. open(), close(), dup() in either process affects both. POSIX requires all threads in a process to share file descriptors. Without this flag, the child gets its own copy of the fd table (referencing the same underlying open file descriptions as fork() does).

CLONE_FS — Share Filesystem Attributes

Shared: umask, root directory (chroot()), current working directory (chdir()). A chdir() or chroot() by either process affects both. POSIX threads share these attributes. Cannot be combined with CLONE_NEWNS.

CLONE_SIGHAND — Share Signal Dispositions

The signal disposition table (what to do for each signal: ignore, default, or handler) is shared. Changing a signal handler in one process via sigaction() changes it for both. Signal masks and pending signals are always separate even when sharing dispositions.

Example 1 — Simulate a Thread (CLONE_VM + CLONE_FILES + CLONE_SIGHAND)

/* thread_like_clone.c — Create a thread-like child sharing memory */
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <sched.h>
#include <signal.h>
#include <sys/wait.h>
#include <unistd.h>

#define STACK_SIZE (64 * 1024)

int shared_counter = 0;   /* In shared memory — both parent and child see changes */

static int child_func(void *arg)
{
    printf("[Child]  shared_counter before = %d\n", shared_counter);
    shared_counter += 100;    /* This modifies PARENT's memory too (CLONE_VM) */
    printf("[Child]  shared_counter after  = %d\n", shared_counter);
    return 0;
}

int main(void)
{
    char *stack     = malloc(STACK_SIZE);
    char *stack_top = stack + STACK_SIZE;

    shared_counter = 5;
    printf("[Parent] shared_counter = %d (before clone)\n", shared_counter);

    pid_t pid = clone(
        child_func,
        stack_top,
        CLONE_VM | CLONE_FILES | CLONE_SIGHAND | SIGCHLD,
        NULL
    );
    if (pid == -1) { perror("clone"); exit(1); }

    waitpid(pid, NULL, 0);

    /* Because CLONE_VM was set, the child's write is visible here */
    printf("[Parent] shared_counter = %d (after child modified it)\n",
           shared_counter);

    free(stack);
    return 0;
}
/*
 * Output:
 * [Parent] shared_counter = 5 (before clone)
 * [Child]  shared_counter before = 5
 * [Child]  shared_counter after  = 105
 * [Parent] shared_counter = 105 (after child modified it)
 *
 * Without CLONE_VM, parent would still see 5.
 */

Section B: Thread Groups — CLONE_THREAD, TID, TGID

CLONE_THREAD — Place Child in Parent’s Thread Group

POSIX requires all threads in a process to share the same process ID. Linux achieves this via thread groups. When CLONE_THREAD is set, the child is placed in the same thread group as the parent, meaning getpid() returns the same value in all threads (the TGID).

Requires: CLONE_SIGHAND (which requires CLONE_VM).

Thread Group — TID vs TGID Diagram

Thread Group — TGID = 2001 (the process ID seen by getpid())
Thread A
TID = 2001
TGID = 2001
Group Leader
Thread B
TID = 2002
TGID = 2001
Thread C
TID = 2003
TGID = 2001
Thread D
TID = 2004
TGID = 2001
All threads: getpid() returns 2001 | gettid() returns individual TID | PPID = 1900 for all

Example 2 — CLONE_THREAD: Shared PID (TGID)

/* clone_thread.c — Show TID vs TGID with CLONE_THREAD */
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <sched.h>
#include <signal.h>
#include <sys/wait.h>
#include <sys/syscall.h>
#include <unistd.h>

#define STACK_SIZE (64 * 1024)

/* gettid() is not in libc — call it directly */
static pid_t gettid_wrapper(void)
{
    return (pid_t) syscall(SYS_gettid);
}

static int thread_func(void *arg)
{
    printf("[Thread] PID (TGID) via getpid()  = %d\n", getpid());
    printf("[Thread] TID        via gettid()  = %d\n", gettid_wrapper());
    sleep(1);  /* Give parent time to print its info */
    return 0;
}

int main(void)
{
    char *stack     = malloc(STACK_SIZE);
    char *stack_top = stack + STACK_SIZE;

    printf("[Main]   PID (TGID) via getpid()  = %d\n", getpid());
    printf("[Main]   TID        via gettid()  = %d\n", gettid_wrapper());

    pid_t tid = clone(
        thread_func,
        stack_top,
        /* CLONE_THREAD requires CLONE_SIGHAND which requires CLONE_VM */
        CLONE_VM | CLONE_SIGHAND | CLONE_THREAD,
        NULL
    );
    if (tid == -1) { perror("clone"); exit(1); }

    printf("[Main]   clone() returned TID      = %d\n", tid);

    /*
     * Cannot use waitpid() for CLONE_THREAD children directly.
     * The thread must be joined via futex or we just sleep here.
     */
    sleep(2);

    free(stack);
    return 0;
}
/*
 * Output (example):
 * [Main]   PID (TGID) via getpid()  = 12345
 * [Main]   TID        via gettid()  = 12345
 * [Main]   clone() returned TID      = 12346
 * [Thread] PID (TGID) via getpid()  = 12345   ← SAME as parent!
 * [Thread] TID        via gettid()  = 12346   ← Different
 */

Section C: Namespace Flags — Building Containers

What Are Linux Namespaces?

Namespaces wrap global system resources (PIDs, network, mounts, users) so each namespace has its own private view. Processes in different namespaces cannot see each other’s resources. This is the foundation of Docker, LXC, and other container technologies.

Flag Isolates Container Use
CLONE_NEWPID Process IDs First process in container gets PID 1
CLONE_NEWNET Network stack Container gets own IP, interfaces, firewall
CLONE_NEWNS Mount points Container has its own filesystem view
CLONE_NEWIPC System V IPC (queues, semaphores) Containers can’t share IPC objects
CLONE_NEWUTS Hostname and NIS domain name Each container has its own hostname
CLONE_NEWUSER User and group IDs Unprivileged containers (UID 0 inside ≠ root outside)

Container Isolation Model

Linux Kernel
Host (Default Namespace)
PID namespace: 1..65535
Network: eth0, 192.168.1.1
Hostname: myserver
Mounts: /proc, /sys, /home
Container (New Namespace)
PID namespace: 1 = container init
Network: veth0, 172.17.0.2
Hostname: webapp-container
Mounts: container root fs
Isolated via: CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWUTS | CLONE_NEWNS | CLONE_NEWIPC | CLONE_NEWUSER

Example 3 — New UTS Namespace (Change Hostname in Child)

/* clone_newuts.c — Child gets its own hostname via CLONE_NEWUTS */
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <sched.h>
#include <signal.h>
#include <sys/wait.h>
#include <unistd.h>

#define STACK_SIZE (64 * 1024)

static int child_func(void *arg)
{
    char *new_hostname = (char *) arg;

    /* Set a new hostname — only affects this UTS namespace */
    if (sethostname(new_hostname, strlen(new_hostname)) == -1) {
        perror("sethostname"); return 1;
    }

    char buf[256];
    gethostname(buf, sizeof(buf));
    printf("[Child]  hostname = '%s'\n", buf);
    return 0;
}

int main(void)
{
    char *stack     = malloc(STACK_SIZE);
    char *stack_top = stack + STACK_SIZE;
    char  host_before[256];

    gethostname(host_before, sizeof(host_before));
    printf("[Parent] hostname before = '%s'\n", host_before);

    /* CLONE_NEWUTS: child gets its own UTS namespace */
    /* Requires CAP_SYS_ADMIN (root) */
    pid_t pid = clone(child_func, stack_top,
                      CLONE_NEWUTS | SIGCHLD,
                      "my-container-host");
    if (pid == -1) { perror("clone"); exit(1); }

    waitpid(pid, NULL, 0);

    /* Parent's hostname is UNCHANGED */
    char host_after[256];
    gethostname(host_after, sizeof(host_after));
    printf("[Parent] hostname after  = '%s' (unchanged)\n", host_after);

    free(stack);
    return 0;
}
/* Run as root: sudo ./clone_newuts
 * Output:
 * [Parent] hostname before = 'myserver'
 * [Child]  hostname = 'my-container-host'
 * [Parent] hostname after  = 'myserver'    ← Parent unaffected
 */

Section D: Threading Support Flags — SETTID / CLEARTID

Flag When is TID written/cleared? Why needed?
CLONE_PARENT_SETTID Before clone() returns — in parent’s memory at ptid Race-free TID capture. Return value of clone() can race with thread exit handler.
CLONE_CHILD_SETTID After clone — in child’s memory at ctid Flexibility for other threading implementations.
CLONE_CHILD_CLEARTID When child exits — zeros ctid and wakes futex How pthread_join() detects thread termination without polling.

⚠️ Why CLONE_PARENT_SETTID Solves a Race Condition

/* Without CLONE_PARENT_SETTID — RACE CONDITION */
tid = clone(...);
/* If child exits BEFORE this assignment completes,
   the signal handler fires with tid still = 0.
   Handler cannot identify the thread. BROKEN! */

/* With CLONE_PARENT_SETTID — SAFE */
/* Kernel writes TID to &new_tid BEFORE clone() returns,
   so no matter when the signal fires, new_tid has the right value. */
clone(..., CLONE_PARENT_SETTID, ..., &new_tid, ...);

Section E: Mount Namespaces — CLONE_NEWNS

CLONE_NEWNS gives the child its own copy of the parent’s mount namespace. Calls to mount() and umount() in the child do not affect the parent’s view of the filesystem. This is the basis for container isolation.

Important: CLONE_NEWNS and CLONE_FS cannot be used together — they are contradictory (NEWNS gives a private namespace copy; CLONE_FS would then share attributes back with the parent).

Interview Questions

Q1. What is the difference between TID and TGID in Linux?
TID (Thread ID) is a unique identifier for each kernel scheduling entity (thread). TGID (Thread Group ID) is the ID shared by all threads in the same process. getpid() returns the TGID; gettid() returns the TID. The thread group leader has TID == TGID. This is how Linux satisfies the POSIX requirement that getpid() returns the same value in all threads of a process.
Q2. Why does CLONE_SIGHAND require CLONE_VM in Linux 2.6+?
Signal handlers are pointers to functions in the process’s address space. If two KSEs share signal handlers but have different virtual memory, a handler pointer in one process would point to the wrong location in the other. By requiring CLONE_VM, the kernel ensures both KSEs have the same address space, so handler pointers are valid in both.
Q3. What clone() flags does the NPTL implementation use to create a thread?
NPTL uses: CLONE_VM (shared memory), CLONE_FILES (shared fds), CLONE_FS (shared filesystem attrs), CLONE_SIGHAND (shared signal handlers), CLONE_THREAD (same thread group / shared PID), CLONE_SETTLS (thread-local storage), CLONE_PARENT_SETTID (race-free TID capture), CLONE_CHILD_CLEARTID (futex notification for pthread_join), and CLONE_SYSVSEM (shared semaphore undo values).
Q4. How do Linux namespaces enable container isolation?
Each namespace type wraps a global resource (PIDs, network, mounts, IPC, UTS, users) so processes in the namespace see their own private instance. For example, CLONE_NEWPID creates a new PID namespace where the first process appears as PID 1, even though it has a different PID in the host namespace. CLONE_NEWNET gives an isolated network stack. Combining all namespace flags gives near-complete isolation — the basis of Docker containers.
Q5. How does CLONE_CHILD_CLEARTID enable pthread_join()?
When a thread is created with CLONE_CHILD_CLEARTID, the kernel stores the thread’s TID at the ctid address and treats it as a futex. When the thread exits, the kernel atomically zeros ctid and calls futex_wake on that address. pthread_join() performs a futex_wait on ctid — so it blocks until the thread exits and then returns. This gives a zero-overhead, no-polling thread join mechanism.
Q6. What does the unshare() system call do?
unshare() (added in kernel 2.6.16) lets a process undo some sharing that was established at clone/fork time. For example, a process created with CLONE_FILES can later call unshare(CLONE_FILES) to get its own private copy of the file descriptor table. This is used by the unshare(1) command-line tool to run processes in new namespaces without creating a new child process.

Leave a Reply

Your email address will not be published. Required fields are marked *