Implementing QUIC from Scratch with Rust: Reliability

on 2025-03-09

A comparison between TCP and QUIC reliability, the implementation details behind QUIC ACKs and loss detection, and some notes on how I plan to improve testing and project quality.

Theory

What Reliability Means

The TCP RFC clearly lists reliability as one of TCP's core properties. More specifically, the transport layer promises to deliver data to the receiver in order, correctly, and completely, even when the network is bad and packets may be lost or reordered. That is reliability. It is also one of the biggest differences between TCP and UDP.

How TCP Provides Reliability

First, TCP needs to know whether data has reached the peer in order, correctly, and completely. As everyone knows, TCP puts Sequence Number and Acknowledgment Number fields in its header. The receiver uses Sequence Number to process data and hand it to the upper layer in send order. It also uses Acknowledgment Number to tell the sender the largest continuous Sequence Number it has received from the beginning of the stream.

The rest is the sender's job. Based on the Acknowledgment Number in packets returned by the receiver, the sender decides which already-sent data has been acknowledged, which data is still unacknowledged, and which data needs retransmission. For transmission efficiency, retransmission cannot be too frequent or too aggressive. The process of deciding whether an unacknowledged packet should be retransmitted is usually called loss detection.

With that strategy, TCP implements reliability. But there are many details I have skipped here. Besides loss detection itself, the receiver's ACK strategy has many details too: when to ACK immediately, when delayed ACK is acceptable, and what to do when the receiver has no data of its own to send. Frequent ACKs also hurt efficiency, so this is always a tradeoff.

Loss Detection

I have always understood TCP loss detection as relying mainly on two mechanisms: loss detection when the sender receives ACKs, and timeout retransmission when the sender does not receive ACKs for too long.

The first mechanism is the well-known fast retransmit and fast recovery path. When the sender receives three duplicate ACKs in a row, it should quickly retransmit the missing data. RFC 5681 gives detailed guidance. Of course there are still many details here, such as which data should be retransmitted and how the congestion window should change for efficiency.

The second mechanism, timeout retransmission, is the fallback. If the sender does not receive an acknowledgment for long enough, it actively retransmits unacknowledged data. That timeout is the familiar retransmission timeout, or RTO. The core question is how to calculate RTO. RTO is tightly tied to RTT. In other words, RTO is derived from RTT. What is RTT? RTT (round trip time) is the current round-trip time of the network path. From the definition alone, we can tell that RTT reflects the live state of the path. That is why RTO has to depend on RTT. According to RFC 6298, the default RTO is 1 second, and the formula is RTO = SRTT + 4 × RTTVAR, where SRTT is the smoothed RTT estimate and RTTVAR is the RTT variation.

Later, when we talk about QUIC reliability, I will dig into RTT measurement in more detail. QUIC learned a lot from TCP here and improves the design significantly. In most cases, the ACK-driven loss detection path triggers earlier than timeout retransmission, because RTO is usually longer and acts more like a last resort.

How TCP Reliability Evolved

Another way to ask this is: what are the flaws in TCP reliability? TCP is a very old protocol, after all. The first problem that comes to mind is that Acknowledgment Number simply does not carry enough information. It can only express the largest continuous byte range received by the receiver. If there is a hole, even if later data has already arrived, the receiver cannot tell the sender precisely through this field. The sender may have to retransmit all unacknowledged data, which hurts efficiency.

That is why Selective Acknowledgment (SACK) exists. SACK is negotiated through TCP options and then reported through SACK options in ACK segments. It tells the sender which byte ranges beyond the cumulative Acknowledgment Number have already arrived. This lets the sender retransmit missing data more accurately. I also want to mention Negative Acknowledgment (NACK). Unlike SACK, NACK tells the sender which data the receiver has not received, or in other words, which data the receiver wants retransmitted. I first ran into NACK in WebRTC. In real-time communication, NACK is more natural because audio/video traffic often does not require every packet to arrive; it only needs key packets to be delivered. As I understand it, TCP, as a general-purpose transport, does not support such an aggressive strategy.

Then there is Recent Acknowledgment (RACK). One thing to clarify: RACK is not extra acknowledgment information in the way SACK or NACK is. RACK complements fast retransmit and fast recovery. Fast retransmit has its own flaw: it requires three duplicate ACKs before declaring packet loss. In high-latency networks, that can make loss detection slow. In networks with frequent reordering, duplicate ACKs may not mean loss at all, so fast retransmit can also misjudge.

To handle this, RACK introduces time as an extra signal for deciding whether loss happened, instead of relying only on the number of duplicate ACKs. More specifically, RACK records the send time of each sent packet. Whenever an ACK arrives, it finds the largest send time among acknowledged packets, then compares that time with the send time of unacknowledged packets. If the difference is larger than some time threshold, the unacknowledged packet is considered lost and should be retransmitted. RACK can improve loss detection efficiency. When only one packet is lost, RACK plus SACK can detect it faster instead of waiting blindly for three duplicate ACKs.

Finally, timeout retransmission is slow, especially for tail loss. Tail loss means the sender sends a chunk of data, has no more data to send for now, and the last few packets are lost. In the original TCP design, because there are no more packets from the sender to drive ACK responses, it is hard for the receiver to give the sender enough ACK information to infer the loss. The sender usually falls back to timeout retransmission, which takes longer.

For this scenario, Tail Loss Probe (TLP) provides a better approach. After data is sent, TLP starts a timer. If the data is not acknowledged, the sender sends a probe packet. The timer is called Probe Timeout (PTO), and the formula is smoothed_rtt + max(4*rttvar, kGranularity) + max_ack_delay. Compared with Retransmission Timeout (RTO), PTO has a similar formula, but it can send multiple probe packets when multiple packets hit timeout, and its impact on the congestion window is smaller. In modern networks, especially wireless networks where reordering and loss are common, PTO tends to recover more efficiently than RTO.

How QUIC Reliability Differs from TCP

All of the TCP background above is here to help explain QUIC reliability. QUIC and TCP are both transport protocols trying to solve similar problems, so they share many ideas. But QUIC has no historical baggage and can absorb lessons from TCP's design. That gives QUIC several important differences.

RTT Measurement

RTO, PTO, and the time threshold in RACK all depend on RTT, because RTT reflects the real state of the network path. Accurate RTT measurement is therefore critical. TCP makes this hard because Sequence Number and Acknowledgment Number carry too little information. The most painful issue is retransmission: retransmitted data uses the same Sequence Number as the original data. When the peer acknowledges it, the sender cannot know whether the ACK corresponds to the first transmission or the retransmission. TCP has many RTT measurement algorithms, but none of them is perfect.

QUIC solves this in a simple way: QUIC packet numbers always increase, so every acknowledged QUIC packet maps to exactly one previously sent packet. Someone might ask: if QUIC packet numbers keep increasing, how does QUIC keep data ordered? QUIC uses separate fields for data ordering. For example, QUIC Crypto frames and Stream frames both use offsets to keep data ordered within their own streams. This exposes the essential problem with TCP Sequence Number: TCP couples packet numbering and data ordering together, mixing the signaling path and the data path into one field.

QUIC also accounts for delayed ACKs. Its ACK frame includes an ACK Delay field, so the receiver can tell the sender about extra application-layer response delay. That makes RTT estimation more accurate.

SACK-Like ACK Ranges

QUIC ACK ranges are similar in spirit to SACK. They carry ranges of QUIC packet numbers received by the receiver. But compared with TCP SACK, which can only report three ranges (I was surprised when I first learned that; how is three enough?), QUIC ACK ranges can report far more. Another difference is that SACK allows reneging, meaning a receiver can take back a range it previously SACKed, and the sender may need to retransmit it again.

I was shocked when I first learned about SACK Reneging. If the receiver can do that, doesn't the sender get tricked? The implementation also becomes much more complex. I was curious why TCP was designed this way, so I read RFC 2018. The reason seems to be receiver buffer pressure: the receiver is allowed to discard lower-priority data and retract earlier SACK ranges. I still feel that flow control should be the better tool here. Also, if receiver memory is precious, is sender memory free? But the people who designed it must have had their reasons, and maybe receiver-side safety risks were more important. In any case, thank you QUIC for simply forbidding reneging. That makes implementation much easier.

Only Using a PTO Timer

As mentioned above, TCP introduced TLP for tail loss. QUIC simplifies the design by using a PTO timer to replace the older TLP and RTO timers. Besides implementation simplicity, the PTO timer can be more aggressive than RTO, which improves efficiency. As discussed earlier, a PTO timer only sends probe packets, so the cost is lower. It handles tail loss and also serves as a fallback loss detection mechanism, letting the protocol enter loss probing more quickly and recover faster after packet loss.

Implementation Details

After the theory, I want to split the feature into a few core questions. Once those are handled, the implementation becomes much more straightforward.

Handling the Receiver's Responses

There are essentially two jobs here. First, we use the peer's Packet Number to maintain the state needed for generating our own ACK Frame. Second, if the peer's response carries an ACK Frame, we update our sent queue and run loss detection.

Updating QUIC Packet Number State

Why do we need to process QUIC Packet Number? Because QUIC is full duplex. Even when we are acting as a sender, we still need to maintain information for building ACK Frames back to the peer. A QUIC ACK Frame carries two kinds of information: ACK Delay, which improves RTT estimation, and ranges that work like SACK, telling the peer which packet number ranges have been received.

The second kind is the core data we maintain by processing packet numbers from the peer. I do not want to spend too much time explaining the exact ACK frame encoding, because the QUIC RFC is already clear. The key question is how we maintain the information needed to build the ACK frame.

Maintaining ACK frame state means maintaining ranges of packet numbers we have already received. When a new QUIC packet arrives, we compare its QUIC Packet Number with the existing ranges and update them. For example, if the current ranges are [3, 5], [8, 9], and the received packet number is 7, then the ranges should become [3, 5], [7, 9]. This is basically range insertion, merging, and updating.

One special case is what to do when the number of ranges exceeds our limit. QUIC does not explicitly limit the number of ranges, except that one ACK Frame must fit into a single QUIC packet and therefore stay within MTU. As an implementation, we still need a limit. I simply set mine to 18.

If the range count exceeds that limit, I check whether the packet number that caused the new range is a new packet or an old one. If it is a new packet, I evict old acknowledgment ranges because new ranges matter more to network performance, and old ranges can rely on the peer's retransmission logic as a fallback. If it is a very old packet, I save it separately and immediately send a standalone ACK frame to tell the peer it was received, without affecting the current maintained ranges.

When to Send ACK Frames

After updating the state needed to build our ACK Frame, we still need to decide when to send that ACK frame. This is the ACK frequency problem. ACK frequency affects transport performance because loss detection usually depends on ACK Frames. If ACKs are too infrequent, delayed responses hurt performance. If ACKs are too frequent, they also waste bandwidth.

The QUIC RFC suggests that the receiver should send an ACK immediately after receiving at least two ACK-eliciting packets. If the receiver decides not to ACK immediately, it must still send the ACK within its negotiated local MAX ACK Delay and use the ACK frame's ACK Delay field to tell the peer. There are special cases too. During the handshake, ACKs should be sent promptly to improve handshake performance. If packets arrive out of order, the receiver should also ACK immediately to protect transport quality.

Handling ACK Frames

When an ACK Frame arrives, the receiver mainly does a few things. First, it scans the sent queue and checks which QUIC frames have been acknowledged. The frame type matters. For Crypto frames and Stream frames, the follow-up work differs. Crypto handshake data is fully managed by the QUIC stack, so once acknowledged, the corresponding sent data can be destroyed. Stream frames involve QUIC Stream state. If the QUIC Stream send queue was full and peer acknowledgment frees space, the QUIC stack needs to notify the upper layer that the stream is writable.

QUIC ACK frames also need special handling. We have been talking about maintaining ACK frame construction state, but not about when to clean it up. When a sent ACK frame is acknowledged by the peer, that is a good opportunity. The QUIC RFC says ranges before Largest_acked in the acknowledged ACK Frame can be cleared. I like this operation because it removes a lot of implementation complexity.

Second, we trigger RTT estimation. The QUIC RFC adds extra requirements here: Largest_acked must be updated, and the packet acknowledged by the ACK Frame must contain an ACK-eliciting frame. I understand these requirements as safeguards for RTT accuracy. The last job is loss detection, which I will discuss below.

Sending Data

When to Retransmit

In one sentence: loss detection decides when to retransmit, and the PTO timer improves loss detection efficiency. We already discussed when loss detection runs: it runs when an ACK Frame is received. As for what counts as loss, the QUIC RFC blends ideas from RACK and uses ACK responses to judge loss from both time and packet-threshold dimensions. One detail to notice: for time-threshold loss detection, if no loss is detected, the implementation sets a timer for the unacknowledged packet closest to crossing the threshold. When that timer fires, loss handling runs again.

Back to the PTO timer. The PTO timer sends ACK-eliciting packets to trigger ACK Frame responses and speed up loss detection. There are several details worth noting. First, PTO timeout is calculated from RTT, but it is usually longer than the QUIC loss detection timer. This also means PTO is not cheap; once it fires, we need to do extra work.

For example, the QUIC RFC recommends changing the PTO probe packet's Packet Number enough to make the peer ACK promptly instead of delaying ACK. What I find interesting here is that QUIC allows packet number jumps. This can also make traffic sniffing slightly harder, and if you think about it, a small packet number jump does not affect reliability. In my implementation, I still send two PTO probe packets in a row to ensure the peer responds quickly, because a peer usually ACKs immediately after receiving two ACK-eliciting packets. This recovers faster than relying on a packet number jump, at the cost of sending one more packet.

What exactly is a PTO probe packet? It depends on the state. Normally, if there is no data to send, it is a Ping frame. During the handshake, it is usually handshake data such as a Crypto frame, which helps complete the handshake faster. But when I looked at NGINX QUIC as the server, I noticed that NGINX still sends Ping frames as PTO probes during the handshake. This may be for server resource protection. If there is data to send, PTO probe packets are usually Stream frames or Crypto frames.

Retransmission Priority

Normally, retransmission is constrained by the congestion window and flow control. I have not implemented those yet, so I can ignore them for now. The remaining question is what to retransmit first. For both network recovery and helping the receiver reclaim memory, retransmitted data should be sent before new data. One small thing to remember is that packet numbers still keep increasing.

Testing

After finishing this reliability feature, my strongest feeling was that I had already written many bugs. It was exactly what I expected: starting a new project is not hard; keeping its quality under control is hard.

The biggest problem is that the basic QUIC stack is not complete yet, so I cannot add the kind of automated tests I really want. Even for this reliability implementation, I cannot cover every detail because I do not yet have full data send/receive capability. For now I can only write temporary test cases for the scenarios I worry about. Besides test cases, I also plan to add fuzz testing later to improve feather-quic's robustness, so it only throws unrecoverable errors when it really has to and otherwise handles abnormal cases according to the QUIC design.

After the basic features are done, I will consider using quic-interop-runner to test the QUIC stack. Once the basic behavior is stable, I will think about performance tests and optimization. Finally, the codebase is getting a little large. I need to find ways to simplify it, and I should adjust the crate structure too.

Epilogue

If you have read this far, you can probably tell that I am not especially strong at transport-layer QoS. Before implementing QUIC reliability, I only knew about TCP fast retransmit and timeout retransmission, plus a bit of SACK and NACK. QUIC introduced many mechanisms that taught me a lot during implementation, including TCP RACK-TLP ideas and many QUIC-specific optimizations.

I originally planned to update this blog in January, but I kept procrastinating and dragged it into March. I even thought about abandoning this small project, but a colleague told me my blog was pretty good, and of course I decided to take that seriously 😂. There are still many mistakes here, both in theory and in code, but the project is fun to write. As usual, here is the related PR. For the next feature, I plan to implement QUIC data transfer, namely QUIC Stream.