Implementing QUIC from Scratch with Rust: Runtime

on 2024-12-20

Choosing an Appropriate Runtime for Feather-QUIC

Runtime Choice for Feather-QUIC

What Does a QUIC Protocol Stack Need at Its Core?

Before writing a QUIC stack, I first need to be clear about what it must provide: connection-oriented semantics, reliability, and streams, much like TCP. From the diagram below, we can see that QUIC is built on UDP. At the bottom, QUIC only gets UDP datagram reads and writes. Without that foundation, it has nothing to stand on.

QUIC running over UDP while implementing reliability in user space

QUIC also needs a clock. This is easy to understand if we look at TCP: retransmission timers for ACKs, keepalive timers if keepalive is enabled, and the famous TIME-WAIT timer during connection teardown. QUIC is doing many of the same jobs, just without TCP's historical baggage. So a QUIC implementation also needs timestamps and timers to drive its core behavior.

Another point worth calling out: QUIC runs in user space, while TCP lives in the kernel. That is one reason people like UDP-based transports. User space makes iteration much easier; you are not waiting on kernel upgrades every time the protocol needs to evolve.

What Exactly Is a Runtime?

Maybe I am just late to the party, but I feel like people have been saying "runtime" much more often in recent years. I like the word. It makes code sound as if it has a life of its own. In my understanding, a runtime is usually the execution environment of a language or virtual machine. In network programming, I use it more narrowly: the engine that provides the event loop, I/O driver, timers, and sometimes task scheduling. In that sense, C libraries like libevent are runtimes, and systems such as NGINX and Redis also carry their own runtimes because they implement their own low-level networking loops.

When I first learned network programming, I studied the I/O models from UNIX Network Programming: blocking I/O, non-blocking I/O, and I/O multiplexing. On Linux we now also have true asynchronous I/O through io_uring, which can be useful for networking too. A runtime usually gives developers two broad ways to work with these models.

The first is asynchronous event-driven models like Reactor and Proactor. These are typically implemented using I/O multiplexing (e.g., epoll) combined with non-blocking I/O, as in the classic libevent. Developers focus on read and write events, handling data from non-blocking sockets as needed. While asynchronous I/O can also enable event-driven models, Proactor introduces a slight variation. It typically requires developers to prepare user-space buffers for read/write operations in advance. When the event fires, the I/O has already completed, so the application only needs to handle its own follow-up logic. Although this model is classic, it has a significant drawback: managing callbacks can become a cognitive burden when developing complex systems.

The second style is what modern runtimes usually try to provide: hide the event-driven model behind coroutines, so developers can write code that looks synchronous while still getting the performance benefits of event-driven I/O. OpenResty does this with Lua's stackful coroutines on top of NGINX's event loop. In Rust, tokio provides a similar experience with stackless coroutines and the mio event-driven library underneath.

Feather-QUIC's Choice

In the Rust ecosystem, we have two main options for implementing our QUIC protocol stack: mio and tokio. Using tokio would reduce cognitive load, minimizing the use of asynchronous callbacks and simplifying state machine management.

I still chose not to use tokio. To me, the QUIC stack is already part of the runtime. If I bring in tokio, I do not want to only use its UDP socket API and then expose a pile of callbacks to users. I would want people using feather-quic to benefit from tokio properly, and that means a deeper integration. At this stage, I would rather spend my energy on the QUIC stack itself. After the basic stack works, integrating it with tokio can become a separate project, and probably a good excuse to learn tokio more deeply.

So for now, I chose mio. Since I only plan to build a QUIC client first, the problem is smaller. In this implementation I do not even use asynchronous callbacks. I simply put the related logic right where mio reports read/write events (related code change).

Interesting Details in the Code

UDP Socket

The creation of a UDP socket itself involves various considerations. Many developers use sendto and recvfrom to send/receive datagrams, specifying the target address in the parameters. Personally, I prefer using bind and connect system calls to set up the four-tuple in the UDP socket's kernel structure. This allows us to use recv and send directly, with the Linux kernel matching the four-tuple for precise socket identification, simplifying the code.

This approach has traps. If the network changes, for example from Wi-Fi to Ethernet, the client has to throw away the old bind + connect socket and create a new one. I will come back to this when I work on QUIC Migration. On the server side, creating a new UDP socket for every QUIC request used to risk degrading the Linux UDP hash table under high concurrency. A recent Linux optimization upgraded that table from two-tuple lookup to four-tuple lookup (code change), so this problem is less scary now. Server-side UDP hot upgrades are still hard, though. The kernel does not naturally know whether a packet should go to the old process or the new one, because UDP has no connection concept. Projects such as NGINX solve this with eBPF helpers. I will dig into these details later when I write about QUIC connection migration.

Another interesting aspect is this: when programming with TCP, we often know that if we write extensively to a non-blocking TCP socket under poor network conditions, the TCP stack's send queue can easily fill up. This results in a EAGAIN error code when attempting to write. However, with UDP, there theoretically shouldn't be a send queue, and data written to a UDP socket should be immediately handed over to the IP layer for processing. So, the question arises: should we handle the EAGAIN error that might occur when sending data through a UDP socket?

Yes, we do. Although UDP is designed without a send queue, the Linux kernel's socket structure (sk) still has a sk_write_queue. If that queue is exhausted, sock_alloc_send_pskb may fail to allocate a send skb, and the send path can return EAGAIN. The upper bound of sk_write_queue is tied to the familiar SO_SNDBUF value. In an extreme test, you can set SO_SNDBUF on a non-blocking UDP socket to something tiny, say 4096 bytes, and then send a UDP datagram larger than that. You will see EAGAIN.

Implementation of Timers

mio lets us register UDP read and write events, but it does not give us the timer event we need. So we have to implement a small timer ourselves. That is not too hard here, because we only need one timer. Unlike a general-purpose networking library, we do not need a complex data structure for managing many timers.

For instance, when implementing timers with Linux's epoll, networking libraries typically use the timeout parameter provided by epoll_wait. However, this parameter has a limitation: it only supports millisecond-level precision. For protocol stacks, we often prefer timers with higher precision.

libevent already points to the usual answers. On newer Linux kernels, epoll_pwait2 gives nanosecond-level timeout precision. If the kernel is too old, timerfd is the more common fallback. Set a high-precision timeout, register the timerfd with epoll, and you get a more accurate timer.

Since mio does not support timer implementation, I prefer using the timerfd approach. In the Rust ecosystem, I quickly found a crate that provides timer support for mio based on timerfd: mio-timerfd. However, I encountered an annoying compilation error, as shown below:

Rust compiler error showing mio-timerfd missing the Source trait

Rust's compiler errors are usually friendly, which is one reason I like Rust. This one looked clear at first: the compiler said mio-timerfd did not implement the Source trait defined in mio. But when I opened the mio-timerfd source, the Source implementation was right there.

I was stuck for a bit. As a Rust beginner, I fell back to my old C/C++ debugging habit: inspect the Rust target directory directly and use nm to compare compiled symbols. Rust has its own mangling rules, but the symbols were still useful. For example, a symbol containing mio..event..source..Source$GT$10deregister points to the deregister function from the Source trait.

nm output comparing Source trait symbols across mio versions

When I saw two mio rlibs in the output, I suspected that the issue was likely caused by inconsistent versions of mio dependencies. The nm output further confirmed this suspicion. As shown above, the symbols for the Source trait implementation in sourcefd differ between the two mio versions. A quick Google search revealed that the cargo tree command could help me quickly identify the dependency issue with mio. The solution was straightforward: fork mio-timerfd and adapt it to be compatible with the latest version of mio.

This part also reminded me that Cargo handles multiple versions of the same crate pretty well. If a project pulls in two versions of the same crate, Cargo keeps their symbols separate. If something does not match, the compiler usually tells you early. Traditional C/C++ projects are much easier to mess up here: different library versions may expose the same symbols, and the linker decides what wins based on resolution order. That is fragile. Rust's more explicit style helps catch this kind of problem earlier. I'll leave myself a TODO here: spend some time learning how the Rust build pipeline works, not just the Cargo layer.

Exploring io-uring

At the beginning of this article, I mentioned asynchronous I/O. On Linux, that naturally means io-uring. I had heard about it for a long time but never had a good chance to try it, so I used this project as an excuse to build the smallest asynchronous callback-based networking library I could, following the Proactor model.

After briefly consulting some resources, I found that io-uring is an excellent solution for file I/O. However, it seems to spark some debate when it comes to network I/O. Fundamentally, the main advantage of io-uring over epoll is the reduction in the number of system calls. Frequent system calls lead to context switching, which undeniably impacts performance. Thus, in theory, io-uring should outperform epoll.

Blocking vs Non-blocking Sockets

The first question came up almost immediately: should the socket be non-blocking? In theory, it should not matter much because reads and writes are completed in the kernel by io-uring, and user space only calls submit_and_wait. But I still want to know whether blocking vs non-blocking sockets change kernel-side behavior under high concurrency. To answer that properly, I probably need to read the io-uring implementation.

Single vs Multiple Submissions

Another issue arose during development when I tried to submit asynchronous read events. I seemed to have two options:

Register a single read event at a time, wait for its completion, process my application logic, and then register another read event.
Register multiple read events in one go and wait for their completion sequentially.

Instinctively, I would avoid the first approach, as io-uring is designed to reduce system calls. In real-world scenarios, data often arrives in bulk, and multiple read events happening simultaneously are more common. However, using the second approach felt somewhat awkward. After digging through the documentation, I discovered the Multi-shot feature. This allows registering a single event in the Submission queue while handling a specified number of read completion events. It felt like an elegant solution.

Always pay attention to Rust unsafe code

While using the Rust io-uring crate, I encountered an issue with the Timer opcode. The timer wasn’t triggering as expected, which was frustrating. Since io-uring’s interfaces are straightforward, involving only a few system calls, mainly operating on the shared memory Submission Queue and Completion Queue, I lacked a clear approach to debug the problem. Resorting to a brute-force method, I minimized my test case to reproduce the issue.

Eventually, I realized the Timer opcode held a timestamp parameter pointing to the memory address of a timestamp variable I created during the call. The Timer opcode did not copy this timestamp into its memory. The variable was a stack variable, and its lifetime ended when the add_timer function exited. Because the push operation for the Submission Queue is unsafe, the Rust compiler didn’t warn me about the lifetime issue.

This realization prompted me to reflect on how I’d grown reliant on Rust’s lifetime checks. In unsafe Rust, I hadn’t exercised the same caution I would have in C, where I would immediately question the lifetime of input parameters. Rust’s safety guarantees had led me to assume the compiler would catch such issues. As a Rust novice, this experience reminded me to stay vigilant when writing unsafe code.

Observations and Reflections

My current takeaway: io-uring is not hard to start using, but using it well is a different story. You need to spend time on its implementation details and read mature libraries built on top of it. For example, I still do not know whether, or how, I should explore SQPOLL/kernel-side polling or zero-copy techniques for network I/O. This might explain the ongoing debates about io-uring network performance, such as Yet another comparison between io_uring and epoll on network performance and io_uring is slower than epoll. The tone in that second thread reminds me of my very first job.

When I have more time, I want to study io-uring more deeply, clean up the current code, and run some performance comparisons.

Conclusion

I developed all the changes mentioned in this article on the runtime branch. Here’s the link to the runtime code. Now, it’s time to start implementing the QUIC handshake!