The Syscall Tax That Epoll Can't Escape
Epoll notifies your app when I/O is possible. You still must call read() or write() to actually move data. That's two syscalls per I/O event (epoll_wait + read/write), plus a one-time epoll_ctl registration. Each syscall triggers a context switch between user and kernel mode — a huge overhead when handling thousands of connections.
Io_uring flips the model: it notifies you when I/O is done. The kernel and your app share a memory region with two ring buffers (submission and completion). You post an operation to the submission queue, the kernel processes it, and writes the result to the completion queue. Instead of a syscall pair per I/O, you get one io_uring_enter() call per batch — or, with IORING_SETUP_SQPOLL, close to none during steady state.
Code Comparison: Epoll vs. Io_uring
Epoll (readiness model)
#include
#include
int epoll_fd = epoll_create1(0);
struct epoll_event ev = {.events = EPOLLIN, .data.fd = STDIN_FILENO};
epoll_ctl(epoll_fd, EPOLL_CTL_ADD, STDIN_FILENO, &ev);
struct epoll_event events[1];
epoll_wait(epoll_fd, events, 1, -1); // syscall #1
char buf[1024];
read(STDIN_FILENO, buf, sizeof(buf)); // syscall #2
Three syscalls total: epoll_ctl (one-time), epoll_wait, and read. Each I/O operation costs two syscalls.
Io_uring (completion model)
#include
struct io_uring ring; io_uring_queue_init(32, &ring, 0);
struct io_uring_sqe *sqe = io_uring_get_sqe(&ring); io_uring_prep_read(sqe, STDIN_FILENO, buf, sizeof(buf), 0); io_uring_submit(&ring); // one syscall for submission
struct io_uring_cqe *cqe; io_uring_wait_cqe(&ring, &cqe); // one syscall for completion // cqe->res contains bytes read io_uring_cqe_seen(&ring, cqe);
No separate registration — the ring is set up once. The `io_uring_submit()` and `io_uring_wait_cqe()` each may call `io_uring_enter()` once, but one call can submit a batch of operations and reap many completions. With SQPOLL, even those calls disappear during steady state.
## When Io_uring Shines
- **Zero-copy I/O**: Register buffers with `io_uring_register_buffers()` to avoid kernel memory remapping. For network sends, use `IORING_OP_SEND_ZC` (kernel 6.0+) to skip copying the buffer into kernel space entirely.
- **Batch processing**: One `io_uring_enter()` can submit dozens of reads and collect their results, while epoll requires a syscall pair per operation.
- **Lower latency**: Completion model eliminates the polling loop and reduces context switches.
## The SQPOLL Caveat
`IORING_SETUP_SQPOLL` spins a kernel thread that polls the submission queue. When idle, it backs off after `sq_thread_idle` microseconds, but it still burns CPU even with an empty queue. Not free — use only if you have sustained I/O.
## Why You Should Care
Io_uring landed in Linux 5.1 (2019). If your servers run kernels newer than that, there's little reason to use epoll for new projects. The TinyGate rewrite showed a dramatic performance boost — though still not beating nginx/haproxy, the architectural advantages are clear. For from-scratch projects, io_uring is the way to go.
## What to Do Now
1. Check your kernel version: `uname -r`. If >= 5.1, you can use io_uring.
2. Install liburing (`liburing-dev` on Debian/Ubuntu, `liburing-devel` on Fedora).
3. Rewrite your I/O loop using the completion model. Start with `io_uring_queue_init()` and replace epoll_wait/read pairs with submission/completion batches.
4. For maximum performance, register buffers and use `IORING_OP_SEND_ZC` for network sends (kernel 6.0+).

