30.1.3 — Performance of Mutexes

How much does a mutex really cost? Futexes, atomics, and the fast path explained

Key Terms in This File

futex fast user-space mutex atomic operations lock contention system call cost NPTL fcntl lock semaphore

Is Using a Mutex Expensive?

A common concern for beginners: “If I lock and unlock a mutex in every loop iteration, won’t my program be much slower?”

The short answer is: yes, there is a cost, but it is much smaller than you might think, and for most programs, the overhead is negligible.

The TLPI book measured real numbers on an x86-32 system running Linux 2.6.31 with NPTL (Native POSIX Thread Library):

Mechanism	Time for 20 million operations	Requires system call?
No mutex (raw increment, wrong result)	~0.35 seconds (10M loops × 2 threads)	No
pthreads mutex (uncontended)	~3.1 seconds (10M loops × 2 threads)	No (uses atomic ops in userspace)
fcntl() file region locks	~44 seconds for 20M operations	Yes (every lock/unlock)
System V semaphore (semop)	~28 seconds for 20M operations	Yes (every operation)

Key insight: Pthreads mutexes are roughly 9× slower than no synchronization, but roughly 9× faster than file locks or semaphores. In real applications where threads do actual work (not just increment a counter), the mutex overhead becomes a tiny fraction of total execution time.

Why Are Mutexes So Fast? — Futexes

On Linux, pthreads mutexes are implemented using futexes (Fast User-space muTEXes). Understanding futexes explains why mutex performance is so good.

What is a futex?
A futex is a mechanism that allows locking to happen entirely in user-space using atomic CPU instructions when there is no contention (no other thread is waiting). Only when there IS contention (a thread needs to sleep/wake) does it make a system call.

The Futex Fast Path (No System Call):

Thread calls
pthread_mutex_lock()

→

Atomic CAS instruction
(compare-and-swap in userspace)

→

Lock acquired
(NO kernel involvement)

The Futex Slow Path (System Call needed):

Mutex is locked
by another thread

→

futex() system call
(thread goes to sleep)

→

Woken on unlock
via futex() again

This means: when there is no contention (which is the common case in well-designed programs), locking and unlocking a mutex costs only a few CPU instructions — no kernel involvement at all. The system call overhead only occurs when threads are actually competing for the same mutex simultaneously.

Why Are File Locks and Semaphores Slower?

Unlike mutexes, fcntl() file locks and System V semaphores always require a system call for every lock and unlock operation. Each system call has overhead: it involves switching from user mode to kernel mode, context save/restore, and switching back. Even though each system call is fast (microseconds), doing it 20 million times adds up to many seconds.

Remember: Mutexes are only for synchronizing threads within the same process. If you need to synchronize between different processes, you must use file locks, POSIX semaphores, or shared memory with mutexes with the PTHREAD_PROCESS_SHARED attribute.

What is “Lock Contention”?

Contention occurs when multiple threads try to acquire the same mutex at the same time. High contention means threads are frequently blocked waiting — this is the scenario where the slow path (system call) is taken and performance suffers.

To keep performance high:

Keep critical sections short — do as little work as possible while holding the mutex.
Don’t call slow operations (I/O, network, malloc) while holding a mutex.
Use separate mutexes for separate data structures — this reduces contention by allowing more parallel access.

Code Example 1: Measuring Mutex Overhead

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <pthread.h>

#define LOOPS 1000000

static int glob = 0;
static pthread_mutex_t mtx = PTHREAD_MUTEX_INITIALIZER;

/* Version 1: no mutex (fast but wrong) */
static void *no_mutex_thread(void *arg)
{
    int loc;
    for (int i = 0; i < LOOPS; i++) {
        loc = glob; loc++; glob = loc;
    }
    return NULL;
}

/* Version 2: with mutex (correct) */
static void *with_mutex_thread(void *arg)
{
    int loc;
    for (int i = 0; i < LOOPS; i++) {
        pthread_mutex_lock(&mtx);
        loc = glob; loc++; glob = loc;
        pthread_mutex_unlock(&mtx);
    }
    return NULL;
}

double measure_time(void *(*fn)(void *))
{
    pthread_t t1, t2;
    struct timespec start, end;
    glob = 0;

    clock_gettime(CLOCK_MONOTONIC, &start);
    pthread_create(&t1, NULL, fn, NULL);
    pthread_create(&t2, NULL, fn, NULL);
    pthread_join(t1, NULL);
    pthread_join(t2, NULL);
    clock_gettime(CLOCK_MONOTONIC, &end);

    double elapsed = (end.tv_sec - start.tv_sec) +
                     (end.tv_nsec - start.tv_nsec) / 1e9;
    return elapsed;
}

int main(void)
{
    double t_no_mtx  = measure_time(no_mutex_thread);
    double t_with_mtx = measure_time(with_mutex_thread);

    printf("Without mutex: %.3f sec, glob = %d (WRONG if != %d)\n",
           t_no_mtx, glob, 2 * LOOPS);

    glob = 0;
    t_with_mtx = measure_time(with_mutex_thread);
    printf("With mutex:    %.3f sec, glob = %d (always correct)\n",
           t_with_mtx, glob);

    printf("Mutex overhead factor: %.1fx slower\n",
           t_with_mtx / t_no_mtx);
    return 0;
}

/*
 * Compile: gcc -O2 -o perf_test perf_test.c -lpthread
 *
 * Typical output on modern x86_64:
 *   Without mutex: 0.02 sec, glob = 1234567 (WRONG if != 2000000)
 *   With mutex:    0.18 sec, glob = 2000000 (always correct)
 *   Mutex overhead factor: 9.0x slower
 *
 * But note: these threads ONLY increment a counter.
 * Real threads do much more work per lock, so overhead is tiny.
 */

Code Example 2: Keeping Critical Sections Short (Best Practice)

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>

static int result_list[1000];
static int result_count = 0;
static pthread_mutex_t list_mtx = PTHREAD_MUTEX_INITIALIZER;

static void *worker(void *arg)
{
    int id = *((int *) arg);

    for (int i = 0; i < 100; i++) {
        /* ===== SLOW COMPUTATION (outside the lock) ===== */
        /* Do heavy work BEFORE acquiring the mutex */
        int computed_value = id * 1000 + i;
        /* Simulate some work: sleep(0) = yield */
        /* In real code this might be a calculation, I/O result, etc. */

        /* ===== SHORT CRITICAL SECTION (inside the lock) ===== */
        /* Only hold the mutex for the quick update */
        pthread_mutex_lock(&list_mtx);
        result_list[result_count++] = computed_value;
        pthread_mutex_unlock(&list_mtx);
        /* ===== END CRITICAL SECTION ===== */

        /* More work after releasing the lock */
    }
    return NULL;
}

int main(void)
{
    pthread_t threads[5];
    int ids[5];

    for (int i = 0; i < 5; i++) {
        ids[i] = i;
        pthread_create(&threads[i], NULL, worker, &ids[i]);
    }
    for (int i = 0; i < 5; i++)
        pthread_join(threads[i], NULL);

    printf("Collected %d results\n", result_count);
    return 0;
}

/*
 * GOOD PATTERN: Heavy computation happens OUTSIDE the lock.
 * Only the minimal write to the shared list is inside the lock.
 * This keeps contention low and threads running in parallel.
 */

Interview Questions — Mutex Performance

Q1. What is a futex and why does it make mutexes fast? A futex (Fast User-space muTEX) is a Linux kernel mechanism. It allows mutex lock/unlock to be performed entirely in user-space using atomic CPU instructions when there is no contention — no kernel mode switch is needed. Only when a thread needs to sleep (because the mutex is locked) does it invoke a kernel system call. Since uncontended locks are the common case, mutexes avoid the kernel overhead most of the time.

Q2. How do pthreads mutexes compare in performance to fcntl() file locks? Mutexes are roughly 14× faster than fcntl() file locks for the same number of operations. File locks always require a system call for every lock and unlock, whereas mutex locks on the uncontended path use only atomic hardware instructions with no system call.

Q3. What is “lock contention” and how does it affect performance? Lock contention occurs when multiple threads frequently try to acquire the same mutex simultaneously. When a thread cannot get the lock, it must call the kernel to sleep, and this involves an expensive context switch. High contention degrades performance significantly. It is reduced by keeping critical sections short and using fine-grained locking (separate mutexes for separate resources).

Q4. What is the golden rule for keeping mutex overhead low? Keep the critical section as short as possible. Do all computation outside the lock; only perform the actual shared-state update inside the lock. Never do I/O, sleep, or other slow operations while holding a mutex.

Q5. Can pthreads mutexes be used to synchronize threads in different processes? By default, no — pthreads mutexes work only within a single process. To share a mutex between processes, you must place it in shared memory and initialize it with the PTHREAD_PROCESS_SHARED process-shared attribute using pthread_mutexattr_setpshared().

Q6. What does NPTL stand for and what is its significance? NPTL stands for Native POSIX Thread Library. It is the Linux pthreads implementation that replaced the older LinuxThreads. NPTL uses futexes for mutex implementation, which dramatically improved mutex performance compared to the older implementation that required system calls for every mutex operation.

← 30.1.2: Lock/Unlock Next: Deadlocks →

embeddedpathashala.com

30.1.3 — Performance of Mutexes

30.1.3 — Performance of Mutexes

Is Using a Mutex Expensive?

Why Are Mutexes So Fast? — Futexes

Why Are File Locks and Semaphores Slower?

What is “Lock Contention”?

Code Example 1: Measuring Mutex Overhead

Code Example 2: Keeping Critical Sections Short (Best Practice)

Interview Questions — Mutex Performance

Leave a Reply Cancel reply