30.1.3 — Performance of Mutexes
How much does a mutex really cost? Futexes, atomics, and the fast path explained
Is Using a Mutex Expensive?
A common concern for beginners: “If I lock and unlock a mutex in every loop iteration, won’t my program be much slower?”
The short answer is: yes, there is a cost, but it is much smaller than you might think, and for most programs, the overhead is negligible.
The TLPI book measured real numbers on an x86-32 system running Linux 2.6.31 with NPTL (Native POSIX Thread Library):
| Mechanism | Time for 20 million operations | Requires system call? |
|---|---|---|
| No mutex (raw increment, wrong result) | ~0.35 seconds (10M loops × 2 threads) | No |
| pthreads mutex (uncontended) | ~3.1 seconds (10M loops × 2 threads) | No (uses atomic ops in userspace) |
| fcntl() file region locks | ~44 seconds for 20M operations | Yes (every lock/unlock) |
| System V semaphore (semop) | ~28 seconds for 20M operations | Yes (every operation) |
Why Are Mutexes So Fast? — Futexes
On Linux, pthreads mutexes are implemented using futexes (Fast User-space muTEXes). Understanding futexes explains why mutex performance is so good.
A futex is a mechanism that allows locking to happen entirely in user-space using atomic CPU instructions when there is no contention (no other thread is waiting). Only when there IS contention (a thread needs to sleep/wake) does it make a system call.
The Futex Fast Path (No System Call):
pthread_mutex_lock()
(compare-and-swap in userspace)
(NO kernel involvement)
The Futex Slow Path (System Call needed):
by another thread
(thread goes to sleep)
via futex() again
This means: when there is no contention (which is the common case in well-designed programs), locking and unlocking a mutex costs only a few CPU instructions — no kernel involvement at all. The system call overhead only occurs when threads are actually competing for the same mutex simultaneously.
Why Are File Locks and Semaphores Slower?
Unlike mutexes, fcntl() file locks and System V semaphores always require a system call for every lock and unlock operation. Each system call has overhead: it involves switching from user mode to kernel mode, context save/restore, and switching back. Even though each system call is fast (microseconds), doing it 20 million times adds up to many seconds.
PTHREAD_PROCESS_SHARED attribute.What is “Lock Contention”?
Contention occurs when multiple threads try to acquire the same mutex at the same time. High contention means threads are frequently blocked waiting — this is the scenario where the slow path (system call) is taken and performance suffers.
To keep performance high:
- Keep critical sections short — do as little work as possible while holding the mutex.
- Don’t call slow operations (I/O, network, malloc) while holding a mutex.
- Use separate mutexes for separate data structures — this reduces contention by allowing more parallel access.
Code Example 1: Measuring Mutex Overhead
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <pthread.h>
#define LOOPS 1000000
static int glob = 0;
static pthread_mutex_t mtx = PTHREAD_MUTEX_INITIALIZER;
/* Version 1: no mutex (fast but wrong) */
static void *no_mutex_thread(void *arg)
{
int loc;
for (int i = 0; i < LOOPS; i++) {
loc = glob; loc++; glob = loc;
}
return NULL;
}
/* Version 2: with mutex (correct) */
static void *with_mutex_thread(void *arg)
{
int loc;
for (int i = 0; i < LOOPS; i++) {
pthread_mutex_lock(&mtx);
loc = glob; loc++; glob = loc;
pthread_mutex_unlock(&mtx);
}
return NULL;
}
double measure_time(void *(*fn)(void *))
{
pthread_t t1, t2;
struct timespec start, end;
glob = 0;
clock_gettime(CLOCK_MONOTONIC, &start);
pthread_create(&t1, NULL, fn, NULL);
pthread_create(&t2, NULL, fn, NULL);
pthread_join(t1, NULL);
pthread_join(t2, NULL);
clock_gettime(CLOCK_MONOTONIC, &end);
double elapsed = (end.tv_sec - start.tv_sec) +
(end.tv_nsec - start.tv_nsec) / 1e9;
return elapsed;
}
int main(void)
{
double t_no_mtx = measure_time(no_mutex_thread);
double t_with_mtx = measure_time(with_mutex_thread);
printf("Without mutex: %.3f sec, glob = %d (WRONG if != %d)\n",
t_no_mtx, glob, 2 * LOOPS);
glob = 0;
t_with_mtx = measure_time(with_mutex_thread);
printf("With mutex: %.3f sec, glob = %d (always correct)\n",
t_with_mtx, glob);
printf("Mutex overhead factor: %.1fx slower\n",
t_with_mtx / t_no_mtx);
return 0;
}
/*
* Compile: gcc -O2 -o perf_test perf_test.c -lpthread
*
* Typical output on modern x86_64:
* Without mutex: 0.02 sec, glob = 1234567 (WRONG if != 2000000)
* With mutex: 0.18 sec, glob = 2000000 (always correct)
* Mutex overhead factor: 9.0x slower
*
* But note: these threads ONLY increment a counter.
* Real threads do much more work per lock, so overhead is tiny.
*/
Code Example 2: Keeping Critical Sections Short (Best Practice)
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <pthread.h>
static int result_list[1000];
static int result_count = 0;
static pthread_mutex_t list_mtx = PTHREAD_MUTEX_INITIALIZER;
static void *worker(void *arg)
{
int id = *((int *) arg);
for (int i = 0; i < 100; i++) {
/* ===== SLOW COMPUTATION (outside the lock) ===== */
/* Do heavy work BEFORE acquiring the mutex */
int computed_value = id * 1000 + i;
/* Simulate some work: sleep(0) = yield */
/* In real code this might be a calculation, I/O result, etc. */
/* ===== SHORT CRITICAL SECTION (inside the lock) ===== */
/* Only hold the mutex for the quick update */
pthread_mutex_lock(&list_mtx);
result_list[result_count++] = computed_value;
pthread_mutex_unlock(&list_mtx);
/* ===== END CRITICAL SECTION ===== */
/* More work after releasing the lock */
}
return NULL;
}
int main(void)
{
pthread_t threads[5];
int ids[5];
for (int i = 0; i < 5; i++) {
ids[i] = i;
pthread_create(&threads[i], NULL, worker, &ids[i]);
}
for (int i = 0; i < 5; i++)
pthread_join(threads[i], NULL);
printf("Collected %d results\n", result_count);
return 0;
}
/*
* GOOD PATTERN: Heavy computation happens OUTSIDE the lock.
* Only the minimal write to the shared list is inside the lock.
* This keeps contention low and threads running in parallel.
*/
Interview Questions — Mutex Performance
PTHREAD_PROCESS_SHARED process-shared attribute using pthread_mutexattr_setpshared().