openssl/doc/designs/quic-design/quic-fc.md

273 lines
13 KiB
Markdown
Raw Normal View History

Flow Control
============
Introduction to QUIC Flow Control
---------------------------------
QUIC flow control acts at both connection and stream levels. At any time,
transmission of stream data could be prevented by connection-level flow control,
by stream-level flow control, or both. Flow control uses a credit-based model in
which the relevant flow control limit is expressed as the maximum number of
bytes allowed to be sent on a stream, or across all streams, since the beginning
of the stream or connection. This limit may be periodically bumped.
It is important to note that both connection and stream-level flow control
relate only to the transmission of QUIC stream data. QUIC flow control at stream
level counts the total number of logical bytes sent on a given stream. Note that
this does not count retransmissions; thus, if a byte is sent, lost, and sent
again, this still only counts as one byte for the purposes of flow control. Note
that the total number of logical bytes sent on a given stream is equivalent to
the current “length” of the stream. In essence, the relevant quantity is
`max(offset + len)` for all STREAM frames `(offset, len)` we have ever sent for
the stream.
(It is essential that this be determined correctly, as deadlock may occur if we
believe we have exhausted our flow control credit whereas the peer believes we
have not, as the peer may wait indefinitely for us to send more data before
advancing us more flow control credit.)
QUIC flow control at connection level is based on the sum of all the logical
bytes transmitted across all streams since the start of the connection.
Connection-level flow control is controlled by the `MAX_DATA` frame;
stream-level flow control is controlled by the `MAX_STREAM_DATA` frame.
The `DATA_BLOCKED` and `STREAM_DATA_BLOCKED` frames defined by RFC 9000 are less
important than they first appear, as peers are not allowed to rely on them. (For
example, a peer is not allowed to wait until we send `DATA_BLOCKED` to increase
our connection-level credit, and a conformant QUIC implementation can choose to
never generate either of these frame types.) These frames rather serve two
purposes: to enhance flow control performance, and as a debugging aid.
However, their implementation is not critical.
Note that it follows from the above that the CRYPTO-frame stream is not subject
to flow control.
Note that flow control and congestion control are completely separate
mechanisms. In a given circumstance, either or both mechanisms may restrict our
ability to transmit application data.
Consider the following diagram:
RWM SWM SWM' CWM CWM'
| | | | |
| |<-- credit| -->| |
| <-|- threshold -|----->| |
----------------->
window size
We introduce the following terminology:
- **Controlled bytes** refers to any byte which counts for purposes of flow
control. A controlled byte is any byte of application data in a STREAM frame
payload, the first time it is sent (retransmissions do not count).
- (RX side only) **Retirement**, which refers to where we dequeue one or more
controlled bytes from a QUIC stream and hand them to the application, meaning
we are no longer responsible for them.
Retirement is an important factor in our RX flow control design, as we want
peers to transmit not just at the rate that our QUIC implementation can
process incoming data, but also at a rate the application can handle.
- (RX side only) The **Retired Watermark** (RWM), the total number of retired
controlled bytes since the beginning of the connection or stream.
- The **Spent Watermark** (SWM), which is the number of controlled bytes we have
sent (for the TX side) or received (for the RX side). This represents the
amount of flow control budget which has been spent. It is a monotonic value
and never decreases. On the RX side, such bytes have not necessarily been
retired yet.
- The **Credit Watermark** (CWM), which is the number of bytes which have
been authorized for transmission so far. This count is a cumulative count
since the start of the connection or stream and thus is also monotonic.
- The available **credit**, which is always simply the difference between
the SWM and the CWM.
- (RX side only) The **threshold**, which is how close we let the RWM
get to the CWM before we choose to extend the peer more credit by bumping the
CWM. The threshold is relative to (i.e., subtracted from) the CWM.
- (RX side only) The **window size**, which is the amount by which we or a peer
choose to bump the CWM each time, as we reach or exceed the threshold. The new
CWM is calculated as the SWM plus the window size (note that it added to the
SWM, not the old CWM.)
Note that:
- If the available credit is zero, the TX side is blocked due to a lack of
credit.
- If any circumstance occurs which would cause the SWM to exceed the CWM,
a flow control protocol violation has occurred and the connection
should be terminated.
Connection-Level Flow Control - TX Side
---------------------------------------
TX side flow control is exceptionally simple. It can be modelled as the
following state machine:
---> event: On TX (numBytes)
---> event: On TX Window Updated (numBytes)
<--- event: On TX Blocked
Get TX Window() -> numBytes
The On TX event is passed to the state machine whenever we send a packet.
`numBytes` is the total number of controlled bytes we sent in the packet (i.e.,
the number of bytes of STREAM frame payload which are not retransmissions). This
value is added to the TX-side SWM value. Note that this may be zero, though
there is no need to pass the event in this case.
The On TX Window Updated event is passed to the state machine whenever we have
our CWM increased. In other words, it is passed whenever we receive a `MAX_DATA`
frame, with the integer value contained in that frame (or when we receive the
`initial_max_data` transport parameter).
The On TX Window Updated event expresses the CWM (that is, the cumulative
number of controlled bytes we are allowed to send since the start of the
connection), thus it is monotonic and may never regress. If an On TX Window
Update event is passed to the state machine with a value lower than that passed
in any previous such event, it indicates a peer protocol error or a local
programming error.
The Get TX Window function returns our credit value (that is, it returns the
number of controlled bytes we are allowed to send). This value is reduced by the
On TX event and increased by the On TX Window Updated event. In fact, it is
simply the difference between the last On TX Window Updated value and the sum of
the `numBytes` arguments of all On TX events so far; it is that simple.
The On TX Blocked event is emitted at the time of any edge transition where the
value which would be returned by the Get TX Window function changes from
non-zero to zero. This always occurs during processing of an On TX event. (This
event is intended to assist in deciding when to generate `DATA_BLOCKED`
frames.)
We must not exceed the flow control limits, else the peer may terminate the
connection with an error.
An initial connection-level credit is communicated by the peer in the
`initial_max_data` transport parameter. All other credits occur as a result of a
`MAX_DATA` frame.
Stream-Level Flow Control - TX Side
-----------------------------------
Stream-level flow control works exactly the same as connection-level flow
control for the TX side.
The On TX Window Updated event occurs in response to the `MAX_STREAM_DATA`
frame, or based on the relevant transport parameter
(`initial_max_stream_data_bidi_local`, `initial_max_stream_data_bidi_remote`,
`initial_max_stream_data_uni`).
The On TX Blocked event can be used to decide when to generate
`STREAM_DATA_BLOCKED` frames.
Note that the number of controlled bytes we can send in a stream is limited by
both connection and stream-level flow control; thus the number of controlled
bytes we can send is the lesser value of the values returned by the Get TX
Window function on the connection-level and stream-level state machines,
respectively.
Connection-Level Flow Control - RX Side
---------------------------------------
---> event: On RX Controlled Bytes (numBytes) [internal event]
---> event: On Retire Controlled Bytes (numBytes)
<--- event: Increase Window (numBytes)
<--- event: Flow Control Error
RX side connection-level flow control provides an indication of when to generate
`MAX_DATA` frames to bump the peer's connection-level transmission credit. It is
somewhat more involved than the TX side.
The state machine receives On RX Controlled Bytes events from stream-level flow
controllers. Callers do not pass the event themselves. The event is generated by
a stream-level flow controller whenever we receive any controlled bytes.
`numBytes` is the number of controlled bytes we received. (This event is
generated by stream-level flow control as retransmitted stream data must be
counted only once, and the stream-level flow control is therefore in the best
position to determine how many controlled bytes (i.e., new, non-retransmitted
stream payload bytes) have been received).
If we receive more controlled bytes than we authorized, the state machine emits
the Flow Control Error event. The connection should be terminated with a
protocol error in this case.
The state machine emits the Increase Window event when it thinks that the peer
should be advanced more flow control credit (i.e., when the CWM should be
bumped). `numBytes` is the new CWM value, and is monotonic with regard to all
previous Increase Window events emitted by the state machine.
The state machine is passed the On Retire Controlled bytes event when one or
more controlled bytes are dequeued from any stream and passed to the
application.
The state machine uses the cadence of the On Retire Controlled Bytes events it
receives to determine when to increase the flow control window. Thus, the On
Retire Controlled Bytes event should be sent to the state machine when
processing of the received controlled bytes has been *completed* (i.e., passed
to the application).
Stream-Level Flow Control - RX Side
-----------------------------------
RX-side stream-level flow control works similarly to RX-side connection-level
flow control. There are a few differences:
- There is no On RX Controlled Bytes event.
- The On Retire Controlled Bytes event may optionally pass the same event
to a connection-level flow controller (an implementation decision), as these
events should always occur at the same time.
- An additional event is added, which replaces the On RX Controlled Bytes event:
---> event: On RX Stream Frame (offsetPlusLength, isFin)
This event should be passed to the state machine when a STREAM frame is
received. The `offsetPlusLength` argument is the sum of the offset field of
the STREAM frame and the length of the frame's payload in bytes. The isFin
argument should specify whether the STREAM frame had the FIN flag set.
This event is used to generate the internal On RX Controlled Bytes event to
the connection-level flow controller. It is also used by stream-level flow
control to determine if flow control limits are violated by the peer.
The state machine handles `offsetPlusLength` monotonically and ignores the
event if a previous such event already had an equal or greater value. The
reason this event is used instead of a `On RX (numBytes)` style event is that
this API can be monotonic and thus easier to use (the caller does not need to
remember if they have already counted a specific controlled byte in a STREAM
frame, which may after all duplicate some of the controlled bytes in a
previous STREAM frame).
RX Window Sizing
----------------
For RX flow control we must determine our window size. This is the value we add
to the peer's current SWM to determine the new CWM each time as RWM reaches the
threshold. The window size should be adapted dynamically according to network
conditions.
Many implementations choose to have a mechanism for increasing the window size
but not decreasing it, a simple approach which we adopt here.
The common algorithm is a so-called auto-tuning approach in which the rate of
window consumption (i.e., the rate at which RWM approaches CWM after CWM is
bumped) is measured and compared to the measured connection RTT. If the time it
takes to consume one window size exceeds a fixed multiple of the RTT, the window
size is doubled, up to an implementation-chosen maximum window size.
Auto-tuning occurs in 'epochs'. At the end of each auto-tuning epoch, a decision
is made on whether to double the window size, and a new auto-tuning epoch is
started.
For more information on auto-tuning, see [Flow control in
QUIC](https://docs.google.com/document/d/1F2YfdDXKpy20WVKJueEf4abn_LVZHhMUMS5gX6Pgjl4/edit#heading=h.hcm2y5x4qmqt)
and [QUIC Flow
Control](https://docs.google.com/document/d/1SExkMmGiz8VYzV3s9E35JQlJ73vhzCekKkDi85F1qCE/edit#).