openssl/doc/designs/quic-design/quic-fc.md
Hugo Landau 508e087c4c QUIC Flow Control
Reviewed-by: Paul Dale <pauli@openssl.org>
Reviewed-by: Matt Caswell <matt@openssl.org>
Reviewed-by: Tomas Mraz <tomas@openssl.org>
(Merged from https://github.com/openssl/openssl/pull/19040)
2022-09-26 08:01:55 +01:00

13 KiB

Flow Control

Introduction to QUIC Flow Control

QUIC flow control acts at both connection and stream levels. At any time, transmission of stream data could be prevented by connection-level flow control, by stream-level flow control, or both. Flow control uses a credit-based model in which the relevant flow control limit is expressed as the maximum number of bytes allowed to be sent on a stream, or across all streams, since the beginning of the stream or connection. This limit may be periodically bumped.

It is important to note that both connection and stream-level flow control relate only to the transmission of QUIC stream data. QUIC flow control at stream level counts the total number of logical bytes sent on a given stream. Note that this does not count retransmissions; thus, if a byte is sent, lost, and sent again, this still only counts as one byte for the purposes of flow control. Note that the total number of logical bytes sent on a given stream is equivalent to the current “length” of the stream. In essence, the relevant quantity is max(offset + len) for all STREAM frames (offset, len) we have ever sent for the stream.

(It is essential that this be determined correctly, as deadlock may occur if we believe we have exhausted our flow control credit whereas the peer believes we have not, as the peer may wait indefinitely for us to send more data before advancing us more flow control credit.)

QUIC flow control at connection level is based on the sum of all the logical bytes transmitted across all streams since the start of the connection.

Connection-level flow control is controlled by the MAX_DATA frame; stream-level flow control is controlled by the MAX_STREAM_DATA frame.

The DATA_BLOCKED and STREAM_DATA_BLOCKED frames defined by RFC 9000 are less important than they first appear, as peers are not allowed to rely on them. (For example, a peer is not allowed to wait until we send DATA_BLOCKED to increase our connection-level credit, and a conformant QUIC implementation can choose to never generate either of these frame types.) These frames rather serve two purposes: to enhance flow control performance, and as a debugging aid. However, their implementation is not critical.

Note that it follows from the above that the CRYPTO-frame stream is not subject to flow control.

Note that flow control and congestion control are completely separate mechanisms. In a given circumstance, either or both mechanisms may restrict our ability to transmit application data.

Consider the following diagram:

RWM   SWM           SWM'   CWM         CWM'
 |     |             |      |           |
 |     |<--    credit|   -->|           |
 |   <-|- threshold -|----->|           |
                      ----------------->
                             window size

We introduce the following terminology:

  • Controlled bytes refers to any byte which counts for purposes of flow control. A controlled byte is any byte of application data in a STREAM frame payload, the first time it is sent (retransmissions do not count).

  • (RX side only) Retirement, which refers to where we dequeue one or more controlled bytes from a QUIC stream and hand them to the application, meaning we are no longer responsible for them.

    Retirement is an important factor in our RX flow control design, as we want peers to transmit not just at the rate that our QUIC implementation can process incoming data, but also at a rate the application can handle.

  • (RX side only) The Retired Watermark (RWM), the total number of retired controlled bytes since the beginning of the connection or stream.

  • The Spent Watermark (SWM), which is the number of controlled bytes we have sent (for the TX side) or received (for the RX side). This represents the amount of flow control budget which has been spent. It is a monotonic value and never decreases. On the RX side, such bytes have not necessarily been retired yet.

  • The Credit Watermark (CWM), which is the number of bytes which have been authorized for transmission so far. This count is a cumulative count since the start of the connection or stream and thus is also monotonic.

  • The available credit, which is always simply the difference between the SWM and the CWM.

  • (RX side only) The threshold, which is how close we let the RWM get to the CWM before we choose to extend the peer more credit by bumping the CWM. The threshold is relative to (i.e., subtracted from) the CWM.

  • (RX side only) The window size, which is the amount by which we or a peer choose to bump the CWM each time, as we reach or exceed the threshold. The new CWM is calculated as the SWM plus the window size (note that it added to the SWM, not the old CWM.)

Note that:

  • If the available credit is zero, the TX side is blocked due to a lack of credit.

  • If any circumstance occurs which would cause the SWM to exceed the CWM, a flow control protocol violation has occurred and the connection should be terminated.

Connection-Level Flow Control - TX Side

TX side flow control is exceptionally simple. It can be modelled as the following state machine:

    ---> event: On TX (numBytes)
    ---> event: On TX Window Updated (numBytes)
    <--- event: On TX Blocked
    Get TX Window() -> numBytes

The On TX event is passed to the state machine whenever we send a packet. numBytes is the total number of controlled bytes we sent in the packet (i.e., the number of bytes of STREAM frame payload which are not retransmissions). This value is added to the TX-side SWM value. Note that this may be zero, though there is no need to pass the event in this case.

The On TX Window Updated event is passed to the state machine whenever we have our CWM increased. In other words, it is passed whenever we receive a MAX_DATA frame, with the integer value contained in that frame (or when we receive the initial_max_data transport parameter).

The On TX Window Updated event expresses the CWM (that is, the cumulative number of controlled bytes we are allowed to send since the start of the connection), thus it is monotonic and may never regress. If an On TX Window Update event is passed to the state machine with a value lower than that passed in any previous such event, it indicates a peer protocol error or a local programming error.

The Get TX Window function returns our credit value (that is, it returns the number of controlled bytes we are allowed to send). This value is reduced by the On TX event and increased by the On TX Window Updated event. In fact, it is simply the difference between the last On TX Window Updated value and the sum of the numBytes arguments of all On TX events so far; it is that simple.

The On TX Blocked event is emitted at the time of any edge transition where the value which would be returned by the Get TX Window function changes from non-zero to zero. This always occurs during processing of an On TX event. (This event is intended to assist in deciding when to generate DATA_BLOCKED frames.)

We must not exceed the flow control limits, else the peer may terminate the connection with an error.

An initial connection-level credit is communicated by the peer in the initial_max_data transport parameter. All other credits occur as a result of a MAX_DATA frame.

Stream-Level Flow Control - TX Side

Stream-level flow control works exactly the same as connection-level flow control for the TX side.

The On TX Window Updated event occurs in response to the MAX_STREAM_DATA frame, or based on the relevant transport parameter (initial_max_stream_data_bidi_local, initial_max_stream_data_bidi_remote, initial_max_stream_data_uni).

The On TX Blocked event can be used to decide when to generate STREAM_DATA_BLOCKED frames.

Note that the number of controlled bytes we can send in a stream is limited by both connection and stream-level flow control; thus the number of controlled bytes we can send is the lesser value of the values returned by the Get TX Window function on the connection-level and stream-level state machines, respectively.

Connection-Level Flow Control - RX Side

    ---> event: On RX Controlled Bytes (numBytes)       [internal event]
    ---> event: On Retire Controlled Bytes (numBytes)
    <--- event: Increase Window (numBytes)
    <--- event: Flow Control Error

RX side connection-level flow control provides an indication of when to generate MAX_DATA frames to bump the peer's connection-level transmission credit. It is somewhat more involved than the TX side.

The state machine receives On RX Controlled Bytes events from stream-level flow controllers. Callers do not pass the event themselves. The event is generated by a stream-level flow controller whenever we receive any controlled bytes. numBytes is the number of controlled bytes we received. (This event is generated by stream-level flow control as retransmitted stream data must be counted only once, and the stream-level flow control is therefore in the best position to determine how many controlled bytes (i.e., new, non-retransmitted stream payload bytes) have been received).

If we receive more controlled bytes than we authorized, the state machine emits the Flow Control Error event. The connection should be terminated with a protocol error in this case.

The state machine emits the Increase Window event when it thinks that the peer should be advanced more flow control credit (i.e., when the CWM should be bumped). numBytes is the new CWM value, and is monotonic with regard to all previous Increase Window events emitted by the state machine.

The state machine is passed the On Retire Controlled bytes event when one or more controlled bytes are dequeued from any stream and passed to the application.

The state machine uses the cadence of the On Retire Controlled Bytes events it receives to determine when to increase the flow control window. Thus, the On Retire Controlled Bytes event should be sent to the state machine when processing of the received controlled bytes has been completed (i.e., passed to the application).

Stream-Level Flow Control - RX Side

RX-side stream-level flow control works similarly to RX-side connection-level flow control. There are a few differences:

  • There is no On RX Controlled Bytes event.

  • The On Retire Controlled Bytes event may optionally pass the same event to a connection-level flow controller (an implementation decision), as these events should always occur at the same time.

  • An additional event is added, which replaces the On RX Controlled Bytes event:

      ---> event: On RX Stream Frame (offsetPlusLength, isFin)
    

    This event should be passed to the state machine when a STREAM frame is received. The offsetPlusLength argument is the sum of the offset field of the STREAM frame and the length of the frame's payload in bytes. The isFin argument should specify whether the STREAM frame had the FIN flag set.

    This event is used to generate the internal On RX Controlled Bytes event to the connection-level flow controller. It is also used by stream-level flow control to determine if flow control limits are violated by the peer.

    The state machine handles offsetPlusLength monotonically and ignores the event if a previous such event already had an equal or greater value. The reason this event is used instead of a On RX (numBytes) style event is that this API can be monotonic and thus easier to use (the caller does not need to remember if they have already counted a specific controlled byte in a STREAM frame, which may after all duplicate some of the controlled bytes in a previous STREAM frame).

RX Window Sizing

For RX flow control we must determine our window size. This is the value we add to the peer's current SWM to determine the new CWM each time as RWM reaches the threshold. The window size should be adapted dynamically according to network conditions.

Many implementations choose to have a mechanism for increasing the window size but not decreasing it, a simple approach which we adopt here.

The common algorithm is a so-called auto-tuning approach in which the rate of window consumption (i.e., the rate at which RWM approaches CWM after CWM is bumped) is measured and compared to the measured connection RTT. If the time it takes to consume one window size exceeds a fixed multiple of the RTT, the window size is doubled, up to an implementation-chosen maximum window size.

Auto-tuning occurs in 'epochs'. At the end of each auto-tuning epoch, a decision is made on whether to double the window size, and a new auto-tuning epoch is started.

For more information on auto-tuning, see Flow control in QUIC and QUIC Flow Control.