The epoll API — Introduction & Core Concepts

The epoll API — Introduction & Core Concepts
Chapter 63 – Alternative I/O Models | Part 4 of 5
⚡ Topic: epoll
🔑 Key: Interest List, Ready List
🎯 Level: Intermediate

What is epoll and Why Does It Exist?

Imagine you are writing a web server that handles 10,000 simultaneous connections. With select() or poll(), every time you want to check for activity, you must pass the entire list of 10,000 file descriptors to the kernel. The kernel scans all of them, even if only 3 are actually active. This is O(n) — it gets slower as connections grow.

epoll (event poll) solves this. You register file descriptors with the kernel once. The kernel maintains its own list and only tells you about the ones that are actually ready. Checking for events is now O(ready) instead of O(total). That is why epoll is the backbone of high-performance servers like nginx.

epoll is Linux-specific and was introduced in Linux 2.6.

Keywords in this tutorial:

epoll epoll instance Interest List Ready List epoll_create() epoll_create1() epoll_ctl() epoll_wait() Level-Triggered Edge-Triggered EPOLL_CLOEXEC

🐢 Why select() and poll() Don’t Scale

📊 Scaling Behaviour: select/poll vs epoll
😩 select() / poll()
  • Pass full FD list on every call
  • Kernel scans ALL FDs every time
  • Copy FD list user→kernel each call
  • O(n) — slows with more connections
  • select() limited to 1024 FDs (FD_SETSIZE)
🚀 epoll
  • Register FDs once — kernel remembers
  • Kernel only returns READY FDs
  • No repeated FD list copying
  • O(ready events) — stays fast
  • No limit on number of FDs

The root cause of select/poll inefficiency is that they are stateless — the kernel does not remember what you were watching between calls. Every call is brand new. epoll is stateful — you register once and the kernel tracks everything persistently.

🏗️ The epoll Model — Instance, Interest List, Ready List

The central object in epoll is the epoll instance. It is a kernel data structure referenced by a normal file descriptor. Inside the instance, the kernel maintains two lists:

📊 epoll Internal Architecture

📋 Interest List
fd=3 EPOLLIN
fd=7 EPOLLIN | EPOLLOUT
fd=12 EPOLLIN
fd=15 EPOLLOUT
All monitored FDs
(added via epoll_ctl)

✅ Ready List
fd=7 EPOLLIN
fd=15 EPOLLOUT
Only FDs with events now
(returned by epoll_wait)
Ready list is always a subset of the interest list

You call epoll_wait() and it returns only the entries in the ready list. You never scan the full interest list — the kernel does all the tracking internally.

🔧 The Three epoll System Calls

The entire epoll API is just three system calls. Everything you need to monitor hundreds of thousands of file descriptors:

epoll_create() / epoll_create1()
Creates a new epoll instance. Returns a file descriptor that acts as your handle. Think of it like “open a new I/O monitoring channel.”
epoll_ctl()
Add, remove, or modify file descriptors in the interest list. You call this when connections open or close. Can use EPOLL_CTL_ADD, EPOLL_CTL_MOD, EPOLL_CTL_DEL.
epoll_wait()
Block until one or more FDs become ready, then return only those ready FDs in an array. Your main event loop calls this repeatedly.

📊 Typical epoll Usage Flow
1️⃣ Create
epoll_create1()
2️⃣ Register FDs
epoll_ctl(ADD)
3️⃣ Wait for events
epoll_wait()
4️⃣ Process events
handle ready FDs
loop back to 3

⚡ Level-Triggered vs Edge-Triggered Notification

This is one of the most important and most asked-about features of epoll. Understanding the difference is critical for correctness.

📶 Level-Triggered (LT) — Default

epoll notifies you as long as there is data to read (or space to write). Even if you don’t read it this time, next call to epoll_wait() will report the same fd again.

Think: a water tank with a float switch. As long as water is in the tank (condition is true), the switch stays ON.

✅ Safe — easier to program
✅ Compatible with select/poll logic
⚡ Edge-Triggered (ET) — EPOLLET flag

epoll notifies you only once when the state changes (e.g., data arrives). If you don’t read all the data, you will NOT be notified again until more new data arrives.

Think: a doorbell. It rings when someone arrives. If you ignore it, it does not ring again for the same person.

✅ More efficient for high throughput
⚠️ Must read ALL data before returning to epoll_wait

📊 LT vs ET — When You Get Notified
Situation Level-Triggered Edge-Triggered
100 bytes arrive on socket ✅ Notified ✅ Notified
You read 50 bytes, 50 remain ✅ Notified again next wait ❌ NOT notified again
50 more new bytes arrive ✅ Notified ✅ Notified (new edge)
You read all data, buffer empty ❌ Not notified (nothing there) ❌ Not notified

⚠️ When using edge-triggered mode (EPOLLET), you must loop and read until you get EAGAIN (errno). Otherwise data silently waits in the buffer and you never get notified again for that remaining data.

select() and poll() are level-triggered only. Signal-driven I/O is edge-triggered only. epoll supports both, which is one of its key advantages.

📊 epoll vs select vs poll vs Signal-Driven I/O
Feature select() poll() Signal-Driven I/O epoll
Scales with many FDs ❌ O(n) ❌ O(n) ✅ Good ✅ Best
FD count limit 1024 (FD_SETSIZE) No limit No limit No limit
Level-triggered
Edge-triggered
Identifies ready FD Must scan all Must scan all Yes (si_fd) Yes (direct)
Complexity Simple Simple Complex (signals) Moderate
Portability POSIX POSIX Linux (F_SETSIG) Linux only

🎯 Interview Questions
Q1. Why does epoll scale better than select() or poll() when monitoring many file descriptors?
select() and poll() require passing the complete list of monitored FDs on every call, and the kernel scans all of them — O(n) per call. epoll is stateful: you register FDs once with epoll_ctl() and the kernel maintains the list internally. epoll_wait() returns only the FDs that are ready — O(number of ready events), not O(total FDs). As the number of idle connections grows, epoll stays fast while select/poll slows down.
Q2. What is the interest list and the ready list in epoll?
The interest list is the set of all file descriptors that a process has registered with an epoll instance via epoll_ctl(). The ready list is the subset of those FDs that currently have events (data available, writable, etc.). epoll_wait() returns entries from the ready list. The ready list is always a subset of the interest list.
Q3. Explain the difference between level-triggered and edge-triggered notification in epoll.
Level-triggered (default): epoll_wait() keeps reporting a fd as ready as long as the condition persists (e.g., data remains in the buffer). Safe and simple. Edge-triggered (EPOLLET flag): epoll_wait() notifies only when the state changes — e.g., when new data arrives. If you don’t consume all the data, you will not be notified again until more new data arrives. ET requires reading until EAGAIN to avoid missing data.
Q4. Which I/O models support level-triggered and edge-triggered modes?
select() and poll() support level-triggered only. Signal-driven I/O is edge-triggered only. epoll supports both — it is the only standard Linux mechanism with this flexibility.
Q5. What are the three system calls in the epoll API and what does each do?
epoll_create() / epoll_create1() — creates an epoll instance and returns a file descriptor handle. epoll_ctl() — adds, removes, or modifies file descriptors in the interest list. epoll_wait() — blocks until events occur, then returns the ready list entries.
Q6. Why must you read until EAGAIN when using edge-triggered epoll?
In edge-triggered mode, epoll only fires once per state transition (e.g., when new data arrives). If 100 bytes are available and you read only 50, the remaining 50 bytes sit in the buffer silently — epoll will NOT notify you again until more new data arrives. You must read in a loop until read() returns -1 with errno == EAGAIN to ensure all available data is consumed.

Next: epoll_create, epoll_ctl — System Calls Deep Dive →

Full API signatures, EPOLL_CTL_ADD / MOD / DEL, epoll_event structure, and complete server example.

Part 5: epoll API Deep Dive ← Part 3

Leave a Reply

Your email address will not be published. Required fields are marked *