When you are writing a server that handles hundreds or thousands of clients at the same time, the choice of I/O monitoring API matters a lot. Linux gives you three main choices: select(), poll(), and epoll. On the surface they all do the same thing — tell you which file descriptors are ready for reading or writing. But under the hood they work very differently, and that difference becomes critical when N (the number of file descriptors you monitor) is large.
This tutorial explains exactly why epoll outperforms the older two, backed by real benchmark numbers from the Linux kernel source book.
Imagine you are writing a chat server. You have 10,000 clients connected. At any given moment, maybe only 5 of them are sending a message. The other 9,995 are just sitting idle.
With poll() or select(), every time you call these functions, the kernel has to scan through all 10,000 file descriptors to check which ones are ready. It doesn’t matter that only 5 are active — it still checks all 10,000 every single time.
This is called O(N) scaling — the work done grows linearly with the number of file descriptors N. When N is 10000, that’s a lot of wasted work.
There is also a second problem with select() specifically: it uses a bitmask called fd_set to represent which file descriptors to monitor. The size of this bitmask is limited by FD_SETSIZE, which is typically 1024. To test with large numbers of file descriptors, the benchmark had to recompile glibc headers with FD_SETSIZE = 16384. This is a real-world limitation.
The following table shows CPU time (in seconds) for 100,000 monitoring operations on Linux 2.6.25. In each operation, exactly one randomly selected file descriptor becomes ready.
| File Descriptors (N) | poll() CPU (sec) | select() CPU (sec) | epoll CPU (sec) |
|---|---|---|---|
| 10 | 0.61 | 0.73 | 0.41 |
| 100 | 2.9 | 3.0 | 0.42 |
| 1,000 | 35 | 35 | 0.53 |
| 10,000 | 990 | 930 | 0.66 |
At N=10, epoll is already a bit faster but not dramatically so. At N=10,000, poll() takes 990 seconds vs epoll’s 0.66 seconds — a difference of about 1500x. And worse, the scaling for poll/select at N=10,000 is worse than linear — it degrades faster than expected.
With poll()/select(), the kernel knows nothing about your file descriptors between calls. Every call is stateless — you hand it the full list, it scans everything, and returns.
With epoll, when you call epoll_ctl() to register a file descriptor, the kernel attaches a note to that file descriptor’s internal data structure (called the open file description). When I/O activity happens on that fd — say, data arrives — the kernel automatically adds that fd to a ready list. When you call epoll_wait(), it just fetches items from that ready list. No full scan needed.
fd registered
to open file description
(data arrives on fd)
Ready List
only ready fds
With poll()/select(), you copy the entire list of file descriptors into the kernel on every call, and the kernel copies a result back. For N=10,000, that is a lot of data being transferred back and forth on every single call.
With epoll, you build the interest list once using epoll_ctl(). After that, epoll_wait() passes nothing to the kernel about file descriptors — it only gets back the ones that are actually ready. This eliminates the per-call overhead that kills performance at large N.
Here is a minimal but complete example showing how to set up epoll to monitor a server socket and client connections. Compare this structure to a poll()-based server to understand the difference.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/epoll.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <errno.h>
#define MAX_EVENTS 64
#define PORT 8080
int main(void)
{
int server_fd, epfd, nfds;
struct epoll_event ev, events[MAX_EVENTS];
struct sockaddr_in addr;
/* Step 1: Create the epoll instance */
epfd = epoll_create1(0);
if (epfd == -1) {
perror("epoll_create1");
exit(1);
}
/* Step 2: Create server socket */
server_fd = socket(AF_INET, SOCK_STREAM | SOCK_NONBLOCK, 0);
memset(&addr, 0, sizeof(addr));
addr.sin_family = AF_INET;
addr.sin_addr.s_addr = INADDR_ANY;
addr.sin_port = htons(PORT);
bind(server_fd, (struct sockaddr *)&addr, sizeof(addr));
listen(server_fd, 128);
/* Step 3: Register server_fd with epoll (done ONCE) */
ev.events = EPOLLIN;
ev.data.fd = server_fd;
if (epoll_ctl(epfd, EPOLL_CTL_ADD, server_fd, &ev) == -1) {
perror("epoll_ctl: server_fd");
exit(1);
}
printf("Listening on port %d...\n", PORT);
/* Step 4: Event loop — epoll_wait only returns READY fds */
for (;;) {
/*
* No need to pass all N fds here.
* Kernel already knows what to watch (registered via epoll_ctl).
* It returns only the descriptors that are actually ready.
*/
nfds = epoll_wait(epfd, events, MAX_EVENTS, -1);
if (nfds == -1) {
perror("epoll_wait");
break;
}
for (int i = 0; i < nfds; i++) {
if (events[i].data.fd == server_fd) {
/* New connection */
int client_fd = accept(server_fd, NULL, NULL);
if (client_fd == -1) continue;
/* Register new client with epoll */
ev.events = EPOLLIN;
ev.data.fd = client_fd;
epoll_ctl(epfd, EPOLL_CTL_ADD, client_fd, &ev);
printf("New client connected: fd=%d\n", client_fd);
} else {
/* Existing client sent data */
char buf[256];
int n = read(events[i].data.fd, buf, sizeof(buf));
if (n <= 0) {
/* Client disconnected — remove from epoll */
epoll_ctl(epfd, EPOLL_CTL_DEL, events[i].data.fd, NULL);
close(events[i].data.fd);
} else {
buf[n] = '\0';
printf("fd=%d says: %s\n", events[i].data.fd, buf);
}
}
}
}
close(epfd);
close(server_fd);
return 0;
}
epoll_ctl() is called once per fd (when you add it). epoll_wait() is called in a loop but it never receives a list of all fds — the kernel already has that stored internally. When 5 out of 10,000 clients are ready, epoll_wait() returns exactly 5 events, not 10,000.| Property | poll() | select() | epoll |
|---|---|---|---|
| Scales with | O(N) fds monitored | O(N) fds monitored | O(events) ready |
| fd limit | ulimit | FD_SETSIZE (1024) | ulimit (no bitmask) |
| Per-call fd copy | Yes (all N) | Yes (all N) | No |
| Kernel fd scan | Every call | Every call | Never (event-driven) |
| Best for | Small N | Small N, portability | Large N, servers |
The ideal use case for epoll is exactly the classic server scenario: many file descriptors monitored, but only a few are active at any given moment. The sparser the activity, the more epoll wins.
Answer: On every call to poll(), the kernel must check all file descriptors you passed in, even if only one is ready. This gives O(N) behavior per call. Additionally, the full list of N descriptors is copied from user space to kernel space on every call, adding transfer overhead. As N grows large, both the scanning and the data transfer time become the bottleneck.
Answer: epoll uses two key mechanisms. First, when you register a file descriptor with epoll_ctl(), the kernel attaches a callback to that fd’s open file description. When an I/O event occurs, the kernel automatically adds the fd to a ready list. Second, epoll_wait() simply fetches from that ready list — no scan of all N fds happens. The time taken is proportional to the number of events that occurred, not the number of fds being monitored.
Answer: FD_SETSIZE is the maximum number of file descriptors that can be stored in an fd_set bitmask used by select(). It defaults to 1024 on most Linux systems. Any file descriptor with a number equal to or greater than FD_SETSIZE cannot be monitored with select() at all. To monitor more fds, you have to recompile glibc with a larger value — which is impractical in production. epoll has no such hard limit.
Answer:
- epoll_create1(flags) — Creates an epoll instance and returns a file descriptor referring to it.
- epoll_ctl(epfd, op, fd, event) — Adds, modifies, or removes a file descriptor from the epoll interest list. Operations are EPOLL_CTL_ADD, EPOLL_CTL_MOD, EPOLL_CTL_DEL.
- epoll_wait(epfd, events, maxevents, timeout) — Waits for events on the registered fds. Returns only the fds that are currently ready, up to maxevents.
Answer: With poll(), you pass an array of N pollfd structs to the kernel every call, and the kernel fills in the revents field for every fd and returns. With epoll_wait(), you pass no fd information at all — the kernel already has the interest list from the earlier epoll_ctl() calls. epoll_wait() only returns data for the fds that are actually ready (typically very few). This dramatically reduces the user-kernel data transfer overhead at large N.
Answer: When N is small (say, fewer than 10–20 file descriptors) and all or most of them are frequently active, epoll’s overhead of maintaining the kernel data structure may actually be slightly higher than poll(). From the benchmark table, at N=10, epoll is only marginally faster than poll() (0.41s vs 0.61s). For small embedded systems or simple local sockets, poll() is perfectly fine and involves less complexity.
Now that you understand why epoll is fast, learn about the two notification modes epoll supports — and why choosing the wrong one can cause silent bugs in your server.
