Reviewed-by: Tomas Mraz <tomas@openssl.org> Reviewed-by: Paul Dale <pauli@openssl.org> (Merged from https://github.com/openssl/openssl/pull/19770)
23 KiB
QUIC I/O Architecture
This document discusses possible implementation options for the I/O architecture internal to the libssl QUIC implementation, discusses the underlying design constraints driving this decision and introduces the resulting I/O architecture. It also identifies potential hazards to existing applications, and identifies how those hazards are mitigated.
Objectives
The OpenSSL QUIC API design is intended to meet the following objectives, amongst others:
-
We want to support both blocking and non-blocking semantics for application use of the libssl APIs.
-
In the case of non-blocking applications, it must be possible for an application to do its own polling and make its own event loop.
Requirements
These requirements are complicated by the fact that traditional use of the libssl API allows an application to pass an arbitrary BIO to an SSL object; not only that, separate BIOs can be passed for the read and write directions. The nature of this BIO can be arbitrary; it could be a socket, or a memory buffer.
Implementation of QUIC will require that the underlying network BIO passed to the QUIC implementation be configured to support datagram semantics instead of bytestream semantics as has been the case with traditional TLS over TCP.
Implementation of QUIC requires handling of timer events as well as the circumstances where a network socket becomes readable or writable. In many cases we need to handle these events simultaneously (e.g. wait until a socket becomes readable, or a timeout expires, whichever comes first).
Blocking vs. Non-Blocking I/O
The above constraints make it effectively a requirement that non-blocking I/O be used for the calls to the underlying network BIOs. To illustrate this point, we first consider how QUIC might be implemented using blocking I/O internally.
To function correctly and provide blocking semantics at the application level, our QUIC implementation must be able to block such that it can respond to any of the following events for the underlying network read and write BIOs immediately:
- The underlying network write BIO becomes writeable;
- The underlying network read BIO becomes readable;
- A timeout expires.
Blocking sockets and select(3)
Firstly, consider how this might be accomplished using the Berkeley sockets API. Blocking on all three wakeup conditions listed above would require use of an API such as select(3) or poll(3), regardless of whether the network socket is configured in blocking mode or not.
While in principle APIs such as select(3) can be used with a socket in blocking mode, this is not an advisable usage mode. If a socket is in blocking mode, calls to send(3) or recv(3) may block for some arbitrary period of time, meaning that our QUIC implementation cannot handle incoming data (if we are blocked on send), send outgoing data (if we are blocked on receive), or handle timeout events.
Though it can be argued that a select(3) call indicating readability or writeability should guarantee that a subsequent send(3) or recv(3) call will not block, there are several reasons why this is an extremely undesirable solution:
-
It is quite likely that there are buggy OSes out there which perform spurious wakeups from select(3).
-
The fact that a socket is writeable does not necessarily mean that a datagram of the size we wish to send is writeable, so a send(3) call could block anyway.
-
This usage pattern precludes multithreaded use barring some locking scheme due to the possibility of other threads racing between the call to select(3) and the subsequent I/O call. This undermines our intentions to support multi-threaded network I/O on the backend.
Moreover, our QUIC implementation will not drive the Berkeley sockets API directly but uses the BIO abstraction to access the network, so these issues are then compounded by the limitations of our existing BIO interfaces. We do not have a BIO interface which provides for select(3)-like functionality or which can implement the required semantics above. Therefore, trying to implement QUIC on top of blocking I/O in this way would require violating the BIO abstraction layer, and would not work with custom BIOs.
Blocking sockets and threads
Another conceptual possibility is that blocking calls could be kept ongoing in parallel threads. Under this model, there would be three threads:
- a thread which exists solely to execute blocking calls to the
BIO_write
of an underlying network BIO, - a thread which exists solely to execute blocking calls to the
BIO_read
of an underlying network BIO, - a thread which exists solely to wait for and dispatch timeout events.
This has a large number of disadvantages:
-
There is a hard requirement for threading functionality in order to be able to support blocking semantics at the application level. Applications which require blocking semantics would only be able to function in thread assisted mode. In environments where threading support is not available or desired, our APIs would only be usable in a non-blocking fashion.
-
Several threads are spawned which the application is not in control of. This undermines our general approach of providing the application with control over OpenSSL's use of resources, such as allowing the application to do its own polling or provide its own allocators.
At a minimum for a client, there must be two threads per connection. This means if an application opens many outgoing connections, there will need to be
2n
extra threads spawned. -
By blocking in
BIO_write
calls, this precludes correct implementation of QUIC. Unlike any analogue in TLS, QUIC packets are time sensitive and intended to be transmitted as soon as they are generated. QUIC packets contain fields such as the ACK Delay value, which is intended to describe the time between a packet being received and a return packet being generated. Correct calculation of this field is necessary to correct calculation of connection RTT. It is therefore important to only generate packets when they are ready to be sent, otherwise suboptimal performance will result. This is a usage model which aligns optimally to non-blocking I/O and which cannot be accommodated by blocking I/O. -
Since existing custom BIOs will not be expecting concurrent
BIO_read
andBIO_write
calls, they will need to be adapted to support this, which is likely to require substantial rework of those custom BIOs (trivial locking of calls obviously does not work since both of these calls must be able to block on network I/O simultaneously).
Moreover, this does not appear to be a realistically implementable approach:
-
The question is posed of how to handle connection teardown, which does not seem to be solvable. If parallel threads are blocking in blocking
BIO_read
andBIO_write
calls on some underlying network BIO, there needs to be some way to force these calls to return onceSSL_free
is called and we need to tear down the connection. However, the BIO interface does not provide any way to do this. At best we might assume the BIO is aBIO_s_dgram
(but cannot assume this in the general case), but even then we can only accomplish teardown by violating the BIO abstraction and closing the underlying socket.This is the only portable way to ensure that a recv(3) call to the same socket returns. This obviously is a highly application-visible change (and is likely to be far more disruptive than configuring the socket into non-blocking mode).
Moreover, it is not workable anyway because it only works for a socket-based BIO and violates the BIO abstraction. For BIOs in general, there does not appear to be any viable solution to the teardown issue.
Even if this approach were successfully implemented, applications will still need to change to using network BIOs with datagram semantics. For applications using custom BIOs, this is likely to require substantial rework of those BIOs. There is no possible way around this. Thus, even if this solution were adopted (notwithstanding the issues which preclude this noted above) for the purposes of accommodating applications using custom network BIOs in a blocking mode, these applications would still have to completely rework their implementation of those BIOs. In any case, it is expected to be very rare that sophisticated applications implementing their own custom BIOs will do so in a blocking mode.
Use of non-blocking I/O
By comparison, use of non-blocking I/O and select(3) or similar APIs on the network side makes satisfying our requirements for QUIC easy, and also allows our internal approach to I/O to be flexibly adapted in the future as requirements may evolve.
This is also the approach used by all other known QUIC implementations; it is highly unlikely that any QUIC implementations exist which use blocking network I/O, as (as mentioned above) it would lead to suboptimal performance due to the ACK delay issue.
Note that this is orthogonal to whether we provide blocking I/O semantics to the application. We can use blocking I/O internally while using this to provide either blocking or non-blocking semantics to the application, based on what the application requests.
This approach in general requires that a network socket be configured in
non-blocking mode. Though some OSes support a MSG_DONTWAIT
flag which allows a
single I/O operation to be made non-blocking, not all OSes support this (e.g.
Windows), thus this cannot be relied on. As such, we need to configure any
socket FD we use into non-blocking mode.
Of the approaches outlined in this document, the use of non-blocking I/O has the fewest disadvantages and is the only approach which appears to actually be implementable in practice. Moreover, each disadvantage can be readily mitigated:
-
We rely on having a select(3) or poll(3) like function available from the OS.
However:
-
Firstly, we already rely on select(3) in our code, so this does not appear to raise any portability issues;
-
Secondly, we have the option of providing a custom poller interface which allows an application to provide its own implementation of a select(3)-like function. In fact, this has the potential to be quite powerful and would allow the application to implement its own pollable BIOs, and therefore perform blocking I/O on top of any custom BIO.
For example, while historically none of our own memory-based BIOs have supported blocking semantics, a sophisticated application could if it wished choose to implement a custom blocking memory BIO and implement a custom poller which synchronises using a custom poll descriptor based around condition variables rather than sockets. Thus this scheme is highly flexible.
(It is worth noting also that the implementation of blocking semantics at the application level also does not rely on any privileged access to the internals of the QUIC implementation and an application could if it wished build blocking semantics out of a non-blocking QUIC instance; this is not particularly difficult, though providing custom pollers here would mean there should be no need for an application to do so.)
-
-
Configuring a socket into non-blocking mode might confuse an application.
However:
-
Applications will already have to make changes to any network-side BIOs, for example switching from a
BIO_s_socket
to aBIO_s_dgram
, or from a BIO pair to aBIO_s_dgram_pair
. Custom BIOs will need to be substantially reworked to switch from bytestream semantics to datagram semantics. Such applications will already need substantial changes, and this is unavoidable.Of course, application impacts and migration guidance can (and will) all be documented.
-
In order for an application to be confused by us putting a socket into non-blocking mode, it would need to be trying to use the socket in some way. But it is not possible for an application to pass a socket to our QUIC implementation, and also try to use the socket directly, and have QUIC still work. Using QUIC necessarily requires that an application not also be trying to make use of the same socket.
-
There are some circumstances where an application might want to multiplex other protocols onto the same UDP socket, for example with protocols like RTP/RTCP or STUN; this can be facilitated using the QUIC fixed bit. However, these use cases cannot be supported without explicit assistance from a QUIC implementation and this use case cannot be facilitated by simply sharing a network socket, as incoming datagrams will not be routed correctly. (We may offer some functionality in future to allow this to be coordinated but this is not for MVP.) Thus this also is not a concern. Moreover, it is extremely unlikely that any such applications are using sockets in blocking mode anyway.
-
Advantages:
-
An application retains full control of its event loop in non-blocking mode.
When using libssl in application-level blocking mode, via a custom poller interface, the application would actually able to exercise more control over I/O than it actually is at present when using libssl in blocking mode.
-
Feasible to implement and already working in tests. Minimises further development needed to ship.
-
Does not rely on creating threads and can support blocking I/O at the application level without relying on thread assisted mode.
-
Does not require an application-provided network-side custom BIO to be reworked to support concurrent calls to it.
-
Allows performance-optimal implementation of QUIC RFC requirements.
-
Ensures our internal I/O architecture remains flexible for future evolution without breaking compatibility in the future.
Use of Internal Non-Blocking I/O
Based on the above evaluation, implementation has been undertaken using non-blocking I/O internally. Applications can use blocking or non-blocking I/O at the libssl API level. Network-level BIOs must operate in a non-blocking mode or be configurable by QUIC to this end.
Support of arbitrary BIOs
We need to support not just socket FDs but arbitrary BIOs as the basis for the
use of QUIC. The use of QUIC with e.g. BIO_s_dgram_pair
, a bidirectional
memory buffer with datagram semantics, is to be supported as part of MVP. This
must be reconciled with the desire to support application-managed event loops.
Broadly, the intention so far has been to enable the use of QUIC with an application event loop in application-level non-blocking mode by exposing an appropriate OS-level synchronisation primitive to the application. On *NIX platforms, this essentially means we provide the application with:
- An FD which should be polled for readability, writability, or both; and
- A deadline (if any is currently applicable).
Once either of these conditions is met, the QUIC state machine can be
(potentially) advanced meaningfully, and the application is expected to reenter
the QUIC state machine by calling SSL_tick()
(or SSL_read()
or
SSL_write()
).
This model is readily supported when the read and write BIOs we are provided with are socket BIOs:
- The read-pollable FD is the FD of the read BIO.
- The write-pollable FD is the FD of the write BIO.
However, things become more complex when we are dealing with memory-based BIOs
such as BIO_dgram_pair
which do not naturally correspond to any OS primitive
which can be used for synchronisation, or when we are dealing with an
application-provided custom BIO.
Pollable and Non-Pollable BIOs
In order to accommodate these various cases, we draw a distinction between pollable and non-pollable BIOs.
-
A pollable BIO is a BIO which can provide some kind of OS-level synchronisation primitive, which can be used to determine when the BIO might be able to do useful work once more.
-
A non-pollable BIO has no naturally associated OS-level synchronisation primitive, but its state only changes in response to calls made to it (or to a related BIO, such as the other end of a pair).
Supporting Pollable BIOs
“OS-level synchronisation primitive” is deliberately vague. Most modern OSes use unified handle spaces (UNIX, Windows) though it is likely there are more obscure APIs on these platforms which have other handle spaces. However, this unification is not necessarily significant.
For example, Windows sockets are kernel handles and thus like any other object
they can be used with the generic Win32 WaitForSingleObject()
API, but not in
a useful manner; the generic readiness mechanism for WIndows handles is not
plumbed in for socket handles, and so sockets are simply never considered ready
for the purposes of this API, which will never return. Instead, the
WinSock-specific select()
call must be used. On the other hand, other kinds of
synchronisation primitive like a Win32 Event must use WaitForSingleObject()
.
Thus while in theory most modern operating systems have unified handle spaces in practice there are substantial usage differences between different handle types. As such, an API to expose a synchronisation primitive should be of a tagged union design supporting possible variation.
A BIO object will provide methods to retrieve a pollable OS-level synchronisation primitive which can be used to determine when the QUIC state machine can (potentially) do more work. This maintains the integrity of the BIO abstraction layer. Equivalent SSL object API calls which forward to the equivalent calls of the underlying network BIO will also be provided.
The core mechanic is as follows:
#define BIO_POLL_DESCRIPTOR_TYPE_NONE 0
#define BIO_POLL_DESCRIPTOR_TYPE_SOCK_FD 1
#define BIO_POLL_DESCRIPTOR_CUSTOM_START 8192
#define BIO_POLL_DESCRIPTOR_NUM_CUSTOM 4
typedef struct bio_poll_descriptor_st {
int type;
union {
int fd;
union {
void *ptr;
uint64_t u64;
} custom[BIO_POLL_DESCRIPTOR_NUM_CUSTOM];
} value;
} BIO_POLL_DESCRIPTOR;
int BIO_get_rpoll_descriptor(BIO *ssl, BIO_POLL_DESCRIPTOR *desc);
int BIO_get_wpoll_descriptor(BIO *ssl, BIO_POLL_DESCRIPTOR *desc);
int SSL_get_rpoll_descriptor(SSL *ssl, BIO_POLL_DESCRIPTOR *desc);
int SSL_get_wpoll_descriptor(SSL *ssl, BIO_POLL_DESCRIPTOR *desc);
Currently only a single descriptor type is defined, which is a FD on *NIX and a
Winsock socket handle on Windows. These use the same type to minimise code
changes needed on different platforms in the common case of an OS network
socket. (Use of an int
here is strictly incorrect for Windows; however, this
style of usage is prevalent in the OpenSSL codebase, so for consistency we
continue the pattern here.)
Poll descriptor types at or above BIO_POLL_DESCRIPTOR_CUSTOM_START
are
reserved for application-defined use. The value.custom
field of the
BIO_POLL_DESCRIPTOR
structure is provided for applications to store values of
their choice in. An application is free to define the semantics.
libssl will not know how to poll custom poll descriptors itself, thus these are only useful when the application will provide a custom poller function, which performs polling on behalf of libssl and which implements support for those custom poll descriptors.
For BIO_s_ssl
, the BIO_get_[rw]poll_descriptor
functions are equivalent to
the SSL_get_[rw]poll_descriptor
functions. The SSL_get_[rw]poll_descriptor
functions are equivalent to calling BIO_get_[rw]poll_descriptor
on the
underlying BIOs provided to the SSL object. For a socket BIO, this will likely
just yield the socket's FD. For memory-based BIOs, see below.
Supporting Non-Pollable BIOs
Where we are provided with a non-pollable BIO, we cannot provide the application
with any primitive used for synchronisation and it is assumed that the
application will handle its own network I/O, for example via a
BIO_s_dgram_pair
.
When libssl calls BIO_get_[rw]poll_descriptor
on the underlying BIO, the call
fails, indicating that a non-pollable BIO is being used. Thus, if an application
calls SSL_get_[rw]poll_descriptor
, that call also fails.
There are various circumstances which need to be handled:
-
The QUIC implementation wants to write data to the network but is currently unable to (e.g.
BIO_s_dgram_pair
is full).This is not hard as our internal TX record layer allows arbitrary buffering. The only limit comes when QUIC flow control (which only applies to application stream data) applies a limit; then calls to e.g.
SSL_write
we must fail withSSL_ERROR_WANT_WRITE
. -
The QUIC implementation wants to read data from the network but is currently unable to (e.g.
BIO_s_dgram_pair
is empty).Here calls like
SSL_read
need to fail withSSL_ERROR_WANT_READ
; we thereby support libssl's classic nonblocking I/O interface.
It is worth noting that theoretically a memory-based BIO could be implemented which is pollable, for example using condition variables. An application could implement a custom BIO, custom poll descriptor and custom poller to facilitate this.
Configuration of Blocking vs. Non-Blocking Mode
Traditionally an SSL object has operated either in blocking mode or non-blocking mode without requiring explicit configuration; if a socket returns EWOULDBLOCK or similar, it is handled appropriately, and if a socket call blocks, there is no issue. Since the QUIC implementation is building on non-blocking I/O, this implicit configuration of non-blocking mode is not feasible.
Note that Windows does not have an API for determining whether a socket is in blocking mode, so it is not possible to use the initial state of an underlying socket to determine if the application wants to use non-blocking I/O or not. Moreover this would undermine the BIO abstraction.
As such, an explicit call is introduced to configure an SSL (QUIC) object into non-blocking mode:
int SSL_set_blocking_mode(SSL *s, int blocking);
int SSL_get_blocking_mode(SSL *s);
Applications desiring non-blocking operation will need to call this API to configure a new QUIC connection accordingly. Blocking mode is chosen as the default for parity with traditional Berkeley sockets APIs and to make things simpler for blocking applications, which are likely to be seeking a simpler solution. However, blocking mode cannot be supported with a non-pollable BIO, and thus blocking mode defaults to off when used with such a BIO.
A method is also needed for the QUIC implementation to inform an underlying BIO
that it must not block. The SSL object will call this function when it is
provided with an underlying BIO. For a socket BIO this can set the socket as
non-blocking; for a memory-based BIO it is a no-op; for BIO_s_ssl
it is
equivalent to a call to SSL_set_blocking_mode()
.
Internal Polling
When blocking mode is configured, the QUIC implementation will call
BIO_get_[rw]poll_descriptor
on the underlying BIOs and use a suitable OS
function (e.g. select()
) or, if configured, custom poller function, to block.
This will be implemented by an internal function which can accept up to two poll
descriptors (one for the read BIO, one for the write BIO), which might be
identical.
Blocking mode cannot be used with a non-pollable underlying BIO. If
BIO_get[rw]poll_descriptor
is not implemented for either of the underlying
read and write BIOs, blocking mode cannot be enabled and blocking mode defaults
to off.