epoll vs poll vs select — Performance Deep Dive

 

epoll vs poll vs select — Performance Deep Dive
Chapter 63 · Linux Systems Programming · EmbeddedPathashala
3
I/O APIs compared
2400x
epoll faster at N=10000
O(1)
epoll scaling

When you are writing a server that handles hundreds or thousands of clients at the same time, the choice of I/O monitoring API matters a lot. Linux gives you three main choices: select(), poll(), and epoll. On the surface they all do the same thing — tell you which file descriptors are ready for reading or writing. But under the hood they work very differently, and that difference becomes critical when N (the number of file descriptors you monitor) is large.

This tutorial explains exactly why epoll outperforms the older two, backed by real benchmark numbers from the Linux kernel source book.

Key Concepts in This Tutorial
epoll poll() select() FD_SETSIZE epoll_ctl() epoll_wait() Ready List Kernel Space O(N) vs O(1) Scalability

Why poll() and select() Are Slow at Scale

Imagine you are writing a chat server. You have 10,000 clients connected. At any given moment, maybe only 5 of them are sending a message. The other 9,995 are just sitting idle.

With poll() or select(), every time you call these functions, the kernel has to scan through all 10,000 file descriptors to check which ones are ready. It doesn’t matter that only 5 are active — it still checks all 10,000 every single time.

This is called O(N) scaling — the work done grows linearly with the number of file descriptors N. When N is 10000, that’s a lot of wasted work.

How poll() / select() Works Internally
Your App calls poll()
Kernel receives ALL N fd descriptors
Kernel checks fd[0], fd[1], fd[2] … fd[N-1]
Returns readiness info for ALL N fds
Problem: Even if only 1 fd is ready, the kernel scanned all N. Next call — same scan again. Every call is O(N).

There is also a second problem with select() specifically: it uses a bitmask called fd_set to represent which file descriptors to monitor. The size of this bitmask is limited by FD_SETSIZE, which is typically 1024. To test with large numbers of file descriptors, the benchmark had to recompile glibc headers with FD_SETSIZE = 16384. This is a real-world limitation.

Benchmark Numbers — The Proof

The following table shows CPU time (in seconds) for 100,000 monitoring operations on Linux 2.6.25. In each operation, exactly one randomly selected file descriptor becomes ready.

File Descriptors (N) poll() CPU (sec) select() CPU (sec) epoll CPU (sec)
10 0.61 0.73 0.41
100 2.9 3.0 0.42
1,000 35 35 0.53
10,000 990 930 0.66

Relative Cost at N=10,000 (normalized)
poll() 990s
select() 930s
epoll 0.66s ✓

At N=10, epoll is already a bit faster but not dramatically so. At N=10,000, poll() takes 990 seconds vs epoll’s 0.66 seconds — a difference of about 1500x. And worse, the scaling for poll/select at N=10,000 is worse than linear — it degrades faster than expected.

Why epoll Is Fast — The Internal Design
Reason 1: Kernel Maintains a Ready List

With poll()/select(), the kernel knows nothing about your file descriptors between calls. Every call is stateless — you hand it the full list, it scans everything, and returns.

With epoll, when you call epoll_ctl() to register a file descriptor, the kernel attaches a note to that file descriptor’s internal data structure (called the open file description). When I/O activity happens on that fd — say, data arrives — the kernel automatically adds that fd to a ready list. When you call epoll_wait(), it just fetches items from that ready list. No full scan needed.

epoll Internal Mechanism
epoll_ctl() called
fd registered
Kernel attaches hook
to open file description
I/O event occurs
(data arrives on fd)
Kernel adds fd to
Ready List
epoll_wait() returns
only ready fds
Reason 2: Kernel Data Structure Persists Between Calls

With poll()/select(), you copy the entire list of file descriptors into the kernel on every call, and the kernel copies a result back. For N=10,000, that is a lot of data being transferred back and forth on every single call.

With epoll, you build the interest list once using epoll_ctl(). After that, epoll_wait() passes nothing to the kernel about file descriptors — it only gets back the ones that are actually ready. This eliminates the per-call overhead that kills performance at large N.

poll() / select() — Every Call
① App → Kernel: send ALL N fds
② Kernel: scan ALL N fds
③ Kernel → App: return status of ALL N fds
Repeated every call. O(N) always.
epoll — Setup + Calls
① epoll_ctl(): register fds once in kernel
② I/O event → kernel auto-updates ready list
③ epoll_wait(): returns only ready fds (few)
O(events), not O(N). Scales beautifully.

Code Example — epoll Basic Setup

Here is a minimal but complete example showing how to set up epoll to monitor a server socket and client connections. Compare this structure to a poll()-based server to understand the difference.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/epoll.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <errno.h>

#define MAX_EVENTS  64
#define PORT        8080

int main(void)
{
    int server_fd, epfd, nfds;
    struct epoll_event ev, events[MAX_EVENTS];
    struct sockaddr_in addr;

    /* Step 1: Create the epoll instance */
    epfd = epoll_create1(0);
    if (epfd == -1) {
        perror("epoll_create1");
        exit(1);
    }

    /* Step 2: Create server socket */
    server_fd = socket(AF_INET, SOCK_STREAM | SOCK_NONBLOCK, 0);
    memset(&addr, 0, sizeof(addr));
    addr.sin_family      = AF_INET;
    addr.sin_addr.s_addr = INADDR_ANY;
    addr.sin_port        = htons(PORT);
    bind(server_fd, (struct sockaddr *)&addr, sizeof(addr));
    listen(server_fd, 128);

    /* Step 3: Register server_fd with epoll (done ONCE) */
    ev.events  = EPOLLIN;
    ev.data.fd = server_fd;
    if (epoll_ctl(epfd, EPOLL_CTL_ADD, server_fd, &ev) == -1) {
        perror("epoll_ctl: server_fd");
        exit(1);
    }

    printf("Listening on port %d...\n", PORT);

    /* Step 4: Event loop — epoll_wait only returns READY fds */
    for (;;) {
        /*
         * No need to pass all N fds here.
         * Kernel already knows what to watch (registered via epoll_ctl).
         * It returns only the descriptors that are actually ready.
         */
        nfds = epoll_wait(epfd, events, MAX_EVENTS, -1);
        if (nfds == -1) {
            perror("epoll_wait");
            break;
        }

        for (int i = 0; i < nfds; i++) {
            if (events[i].data.fd == server_fd) {
                /* New connection */
                int client_fd = accept(server_fd, NULL, NULL);
                if (client_fd == -1) continue;

                /* Register new client with epoll */
                ev.events  = EPOLLIN;
                ev.data.fd = client_fd;
                epoll_ctl(epfd, EPOLL_CTL_ADD, client_fd, &ev);
                printf("New client connected: fd=%d\n", client_fd);
            } else {
                /* Existing client sent data */
                char buf[256];
                int n = read(events[i].data.fd, buf, sizeof(buf));
                if (n <= 0) {
                    /* Client disconnected — remove from epoll */
                    epoll_ctl(epfd, EPOLL_CTL_DEL, events[i].data.fd, NULL);
                    close(events[i].data.fd);
                } else {
                    buf[n] = '\0';
                    printf("fd=%d says: %s\n", events[i].data.fd, buf);
                }
            }
        }
    }

    close(epfd);
    close(server_fd);
    return 0;
}
Key observation: epoll_ctl() is called once per fd (when you add it). epoll_wait() is called in a loop but it never receives a list of all fds — the kernel already has that stored internally. When 5 out of 10,000 clients are ready, epoll_wait() returns exactly 5 events, not 10,000.

Scaling Summary

Property poll() select() epoll
Scales with O(N) fds monitored O(N) fds monitored O(events) ready
fd limit ulimit FD_SETSIZE (1024) ulimit (no bitmask)
Per-call fd copy Yes (all N) Yes (all N) No
Kernel fd scan Every call Every call Never (event-driven)
Best for Small N Small N, portability Large N, servers

The ideal use case for epoll is exactly the classic server scenario: many file descriptors monitored, but only a few are active at any given moment. The sparser the activity, the more epoll wins.

Interview Questions
Q1: Why does poll() perform poorly when monitoring a large number of file descriptors?

Answer: On every call to poll(), the kernel must check all file descriptors you passed in, even if only one is ready. This gives O(N) behavior per call. Additionally, the full list of N descriptors is copied from user space to kernel space on every call, adding transfer overhead. As N grows large, both the scanning and the data transfer time become the bottleneck.

Q2: How does epoll maintain performance even when monitoring 10,000 file descriptors?

Answer: epoll uses two key mechanisms. First, when you register a file descriptor with epoll_ctl(), the kernel attaches a callback to that fd’s open file description. When an I/O event occurs, the kernel automatically adds the fd to a ready list. Second, epoll_wait() simply fetches from that ready list — no scan of all N fds happens. The time taken is proportional to the number of events that occurred, not the number of fds being monitored.

Q3: What is FD_SETSIZE and why is it a problem for select()?

Answer: FD_SETSIZE is the maximum number of file descriptors that can be stored in an fd_set bitmask used by select(). It defaults to 1024 on most Linux systems. Any file descriptor with a number equal to or greater than FD_SETSIZE cannot be monitored with select() at all. To monitor more fds, you have to recompile glibc with a larger value — which is impractical in production. epoll has no such hard limit.

Q4: What are the three epoll system calls and what does each do?

Answer:

  • epoll_create1(flags) — Creates an epoll instance and returns a file descriptor referring to it.
  • epoll_ctl(epfd, op, fd, event) — Adds, modifies, or removes a file descriptor from the epoll interest list. Operations are EPOLL_CTL_ADD, EPOLL_CTL_MOD, EPOLL_CTL_DEL.
  • epoll_wait(epfd, events, maxevents, timeout) — Waits for events on the registered fds. Returns only the fds that are currently ready, up to maxevents.
Q5: What is the difference in per-call data transfer between poll() and epoll_wait()?

Answer: With poll(), you pass an array of N pollfd structs to the kernel every call, and the kernel fills in the revents field for every fd and returns. With epoll_wait(), you pass no fd information at all — the kernel already has the interest list from the earlier epoll_ctl() calls. epoll_wait() only returns data for the fds that are actually ready (typically very few). This dramatically reduces the user-kernel data transfer overhead at large N.

Q6: In what scenario does epoll NOT have a significant advantage over poll()?

Answer: When N is small (say, fewer than 10–20 file descriptors) and all or most of them are frequently active, epoll’s overhead of maintaining the kernel data structure may actually be slightly higher than poll(). From the benchmark table, at N=10, epoll is only marginally faster than poll() (0.41s vs 0.61s). For small embedded systems or simple local sockets, poll() is perfectly fine and involves less complexity.

Next: Level-Triggered vs Edge-Triggered Notification

Now that you understand why epoll is fast, learn about the two notification modes epoll supports — and why choosing the wrong one can cause silent bugs in your server.

Next Tutorial → Back to Home

Leave a Reply

Your email address will not be published. Required fields are marked *