Knowing that edge-triggered epoll requires you to drain the buffer until EAGAIN is one thing. But what happens when one of your file descriptors has a massive or never-ending stream of data? If you drain it greedily, all your other connected clients starve — they never get a chance to be processed.
This tutorial covers the correct general framework for writing edge-triggered epoll servers, and the specific technique for preventing file descriptor starvation under heavy single-fd load.
There is a well-defined pattern for building any server using edge-triggered epoll. Follow these three steps and your server will be correct:
Use fcntl(fd, F_SETFL, O_NONBLOCK) for every fd before adding it to epoll. This is mandatory — without it, your drain loop would block forever on the last read.
Register each fd with EPOLLIN | EPOLLET (or EPOLLOUT | EPOLLET for write readiness). This only needs to happen once per fd — or again if the fd changes.
Call epoll_wait() to get ready fds. For each ready fd, call read()/write()/accept() in a loop until you get EAGAIN or EWOULDBLOCK. That signals the buffer is empty — then go back to epoll_wait().
Here is a scenario that breaks naive ET implementations. Imagine your server has 3 clients connected:
(endless stream)
epoll_wait() returns fd=5 as ready → your code starts draining fd=5 → reads 512 bytes, 512 more, 512 more… fd=5 never returns EAGAIN because new data keeps arriving → fd=6 and fd=7 never get serviced → Clients B and C are starved
This problem is unique to ET mode. With LT mode, you can safely read just a bit from each fd and go back to epoll_wait() — the kernel will keep reminding you about the remaining data. ET mode removes that safety net.
Note: starvation can also occur with signal-driven I/O for the same reason — it is also an edge-triggered mechanism.
The fix is to take control of scheduling yourself. Instead of fully draining each fd as soon as epoll_wait() reports it, you maintain your own application-level ready list and service each fd for a limited amount of work per round.
Read 512B
Read 512B
Read 512B
Read 512B
The anti-starvation loop structure gives you a natural place to do other things besides I/O. This is one of the reasons experienced developers prefer this pattern even when starvation is not an immediate concern:
Check and fire expired timers on each loop iteration, without needing a separate thread or alarm signal.
Call sigwaitinfo() or read from a signalfd in the loop to process pending signals safely between I/O operations.
Check for idle or timed-out connections in each pass and close them before they accumulate.
This complete example demonstrates the anti-starvation pattern. It maintains an application-level ready list and services each fd in round-robin fashion.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>
#include <sys/epoll.h>
#include <sys/socket.h>
#include <netinet/in.h>
#define MAX_EVENTS 64
#define MAX_CLIENTS 256
#define PORT 8080
#define READ_CHUNK 512 /* Read this many bytes per round-robin turn */
/*
* Application-level ready list.
* We track which fds have been reported ready by epoll
* but may still have data to read.
*/
int ready_list[MAX_CLIENTS];
int ready_count = 0;
/* Add fd to ready list if not already present */
void ready_list_add(int fd)
{
for (int i = 0; i < ready_count; i++)
if (ready_list[i] == fd) return;
if (ready_count < MAX_CLIENTS)
ready_list[ready_count++] = fd;
}
/* Remove fd from ready list */
void ready_list_remove(int fd)
{
for (int i = 0; i < ready_count; i++) {
if (ready_list[i] == fd) {
ready_list[i] = ready_list[--ready_count];
return;
}
}
}
void set_nonblocking(int fd)
{
int flags = fcntl(fd, F_GETFL, 0);
fcntl(fd, F_SETFL, flags | O_NONBLOCK);
}
int main(void)
{
int server_fd, epfd, nfds;
struct epoll_event ev, events[MAX_EVENTS];
epfd = epoll_create1(0);
/* Create and bind server socket */
server_fd = socket(AF_INET, SOCK_STREAM, 0);
int opt = 1;
setsockopt(server_fd, SOL_SOCKET, SO_REUSEADDR, &opt, sizeof(opt));
set_nonblocking(server_fd);
struct sockaddr_in addr = {
.sin_family = AF_INET,
.sin_addr.s_addr = INADDR_ANY,
.sin_port = htons(PORT)
};
bind(server_fd, (struct sockaddr *)&addr, sizeof(addr));
listen(server_fd, 128);
/* Register server fd with EPOLLET (edge-triggered) */
ev.events = EPOLLIN | EPOLLET;
ev.data.fd = server_fd;
epoll_ctl(epfd, EPOLL_CTL_ADD, server_fd, &ev);
printf("ET server listening on port %d\n", PORT);
for (;;) {
/*
* If our application ready list already has fds waiting,
* use timeout=0 so we quickly check for new fds and then
* continue servicing the existing list.
* If the list is empty, block until something becomes ready.
*/
int timeout = (ready_count > 0) ? 0 : -1;
nfds = epoll_wait(epfd, events, MAX_EVENTS, timeout);
if (nfds == -1) {
perror("epoll_wait");
break;
}
/* Add newly ready fds to application ready list */
for (int i = 0; i < nfds; i++) {
int fd = events[i].data.fd;
ready_list_add(fd);
}
/*
* Service ready list in round-robin.
* Do ONE limited read per fd per loop iteration.
* This prevents any single fd from starving others.
*/
int i = 0;
while (i < ready_count) {
int fd = ready_list[i];
if (fd == server_fd) {
/* Accept all pending connections */
while (1) {
int client_fd = accept(server_fd, NULL, NULL);
if (client_fd == -1) {
if (errno == EAGAIN || errno == EWOULDBLOCK)
break; /* No more pending connections */
perror("accept");
break;
}
set_nonblocking(client_fd);
ev.events = EPOLLIN | EPOLLET;
ev.data.fd = client_fd;
epoll_ctl(epfd, EPOLL_CTL_ADD, client_fd, &ev);
printf("New client: fd=%d\n", client_fd);
}
/* server_fd is fully drained — remove from ready list */
ready_list_remove(fd);
/* Don't advance i — next fd is now at position i */
} else {
/* Client fd — do ONE limited read */
char buf[READ_CHUNK];
ssize_t n = read(fd, buf, sizeof(buf));
if (n > 0) {
/* Process data */
printf("fd=%d: read %zd bytes\n", fd, n);
/*
* Do NOT remove from ready list yet.
* There may be more data — we'll try again next round.
* Move to next fd (round-robin).
*/
i++;
} else if (n == 0) {
/* Client disconnected */
printf("fd=%d: disconnected\n", fd);
epoll_ctl(epfd, EPOLL_CTL_DEL, fd, NULL);
close(fd);
ready_list_remove(fd);
/* Don't advance i */
} else {
if (errno == EAGAIN || errno == EWOULDBLOCK) {
/* Buffer fully drained — remove from ready list */
printf("fd=%d: fully drained (EAGAIN)\n", fd);
ready_list_remove(fd);
/* Don't advance i */
} else {
perror("read");
epoll_ctl(epfd, EPOLL_CTL_DEL, fd, NULL);
close(fd);
ready_list_remove(fd);
}
}
}
}
/* Place to add: timer checks, signal processing, etc. */
}
close(epfd);
close(server_fd);
return 0;
}
- epoll_wait() uses
timeout=0when ready_count > 0, so it doesn’t block — it just picks up any new events and immediately continues servicing the existing list. - Each client fd gets only
READ_CHUNKbytes per loop turn — not a full drain. - An fd is removed from the ready list only when
EAGAINis received (buffer truly empty) or on disconnect/error. - Round-robin is achieved by
i++after a successful read — moving to the next fd rather than looping back to the same one.
With level-triggered mode you do not need the anti-starvation loop. Here is why:
You read some data from fd=5. Go back to epoll_wait(). epoll will report fd=5, fd=6, and fd=7 all as ready (if they all have data). You naturally cycle through them. Partial reads on fd=5 are safe — the kernel keeps reporting it.
If you naively drain fd=5 until EAGAIN on every notification, other fds never get a turn. You must implement round-robin yourself. The kernel does not help you here — that’s the trade-off for ET’s efficiency.
This is why many production servers (like nginx) use LT mode by default, reserving ET only for the specific code paths where they have careful drain loops and starvation handling in place.
Answer: First, make all file descriptors nonblocking using fcntl(fd, F_SETFL, O_NONBLOCK). Second, register fds with epoll_ctl() using the EPOLLET flag. Third, in the event loop, for each fd returned by epoll_wait(), call read()/write()/accept() in a loop until you receive EAGAIN or EWOULDBLOCK before returning to epoll_wait().
Answer: FD starvation occurs when one file descriptor (e.g., a client sending a large stream) keeps the server busy draining its buffer, while other file descriptors that also have data waiting never get serviced. Because ET mode requires full draining until EAGAIN, if one fd never returns EAGAIN, the server loops on it indefinitely.
Answer: Maintain an application-level ready list. When epoll_wait() reports ready fds, add them to your list. Then service each fd in the list for a limited amount of I/O per turn (e.g., one fixed-size read), cycling through all fds in round-robin order. Only remove an fd from the ready list when it returns EAGAIN. Also, use timeout=0 in epoll_wait() when the ready list is non-empty, so new events are picked up without blocking.
Answer: If we set timeout to -1 (block indefinitely) and there are still fds in our ready list that haven’t been fully drained, we might miss the opportunity to service them promptly. With timeout=0, epoll_wait() returns immediately with any newly ready fds (without blocking), and we then continue servicing our existing ready list. This ensures fairness without unnecessarily blocking the event loop.
Answer: The loop structure is a natural place to integrate timer expiry checks (fire callbacks for timers that have elapsed), signal handling via sigwaitinfo() or reading from a signalfd, connection timeout detection for idle clients, and other periodic maintenance tasks. This avoids the need for separate threads or signal handlers for these operations.
Answer: Not in the same way. With LT mode, you can read a limited amount from fd=5, return to epoll_wait(), and the kernel will report fd=5 again alongside fd=6 and fd=7 if they all have data. This natural reset at each epoll_wait() call gives other fds a fair chance. You can use blocking file descriptors and partial reads safely. Starvation from a single fd is not a concern because you are free to interleave calls to epoll_wait() with partial reads.
Answer: EAGAIN (same as EWOULDBLOCK on Linux) means the kernel receive buffer for that fd is currently empty — there is no more data to read right now. This is the correct exit condition for your drain loop. You should remove the fd from your application ready list and return to epoll_wait(). The next notification for that fd will come when the remote side sends more data, at which point the kernel generates a new edge event.
Learn about the classic race condition when using select() with signals, and how Linux solves it with signalfd.
