mirror of
https://github.com/openssl/openssl.git
synced 2025-01-18 13:44:20 +08:00
dda864793e
Reviewed-by: Tomas Mraz <tomas@openssl.org> Reviewed-by: Paul Dale <pauli@openssl.org> (Merged from https://github.com/openssl/openssl/pull/19770)
486 lines
23 KiB
Markdown
486 lines
23 KiB
Markdown
QUIC I/O Architecture
|
|
=====================
|
|
|
|
This document discusses possible implementation options for the I/O architecture
|
|
internal to the libssl QUIC implementation, discusses the underlying design
|
|
constraints driving this decision and introduces the resulting I/O architecture.
|
|
It also identifies potential hazards to existing applications, and identifies
|
|
how those hazards are mitigated.
|
|
|
|
Objectives
|
|
----------
|
|
|
|
The OpenSSL QUIC API design is intended to meet the following objectives,
|
|
amongst others:
|
|
|
|
- We want to support both blocking and non-blocking semantics
|
|
for application use of the libssl APIs.
|
|
|
|
- In the case of non-blocking applications, it must be possible
|
|
for an application to do its own polling and make its own event
|
|
loop.
|
|
|
|
Requirements
|
|
------------
|
|
|
|
These requirements are complicated by the fact that traditional use of the
|
|
libssl API allows an application to pass an arbitrary BIO to an SSL object; not
|
|
only that, separate BIOs can be passed for the read and write directions. The
|
|
nature of this BIO can be arbitrary; it could be a socket, or a memory buffer.
|
|
|
|
Implementation of QUIC will require that the underlying network BIO passed to
|
|
the QUIC implementation be configured to support datagram semantics instead of
|
|
bytestream semantics as has been the case with traditional TLS over TCP.
|
|
|
|
Implementation of QUIC requires handling of timer events as well as the
|
|
circumstances where a network socket becomes readable or writable. In many cases
|
|
we need to handle these events simultaneously (e.g. wait until a socket becomes
|
|
readable, or a timeout expires, whichever comes first).
|
|
|
|
Blocking vs. Non-Blocking I/O
|
|
-----------------------------
|
|
|
|
The above constraints make it effectively a requirement that non-blocking I/O be
|
|
used for the calls to the underlying network BIOs. To illustrate this point, we
|
|
first consider how QUIC might be implemented using blocking I/O internally.
|
|
|
|
To function correctly and provide blocking semantics at the application level,
|
|
our QUIC implementation must be able to block such that it can respond to any of
|
|
the following events for the underlying network read and write BIOs immediately:
|
|
|
|
- The underlying network write BIO becomes writeable;
|
|
- The underlying network read BIO becomes readable;
|
|
- A timeout expires.
|
|
|
|
### Blocking sockets and select(3)
|
|
|
|
Firstly, consider how this might be accomplished using the Berkeley sockets API.
|
|
Blocking on all three wakeup conditions listed above would require use of an API
|
|
such as select(3) or poll(3), regardless of whether the network socket is
|
|
configured in blocking mode or not.
|
|
|
|
While in principle APIs such as select(3) can be used with a socket in blocking
|
|
mode, this is not an advisable usage mode. If a socket is in blocking mode,
|
|
calls to send(3) or recv(3) may block for some arbitrary period of time, meaning
|
|
that our QUIC implementation cannot handle incoming data (if we are blocked on
|
|
send), send outgoing data (if we are blocked on receive), or handle timeout
|
|
events.
|
|
|
|
Though it can be argued that a select(3) call indicating readability or
|
|
writeability should guarantee that a subsequent send(3) or recv(3) call will not
|
|
block, there are several reasons why this is an extremely undesirable solution:
|
|
|
|
- It is quite likely that there are buggy OSes out there which perform spurious
|
|
wakeups from select(3).
|
|
|
|
- The fact that a socket is writeable does not necessarily mean that a datagram
|
|
of the size we wish to send is writeable, so a send(3) call could block
|
|
anyway.
|
|
|
|
- This usage pattern precludes multithreaded use barring some locking scheme
|
|
due to the possibility of other threads racing between the call to select(3)
|
|
and the subsequent I/O call. This undermines our intentions to support
|
|
multi-threaded network I/O on the backend.
|
|
|
|
Moreover, our QUIC implementation will not drive the Berkeley sockets API
|
|
directly but uses the BIO abstraction to access the network, so these issues are
|
|
then compounded by the limitations of our existing BIO interfaces. We do not
|
|
have a BIO interface which provides for select(3)-like functionality or which
|
|
can implement the required semantics above. Therefore, trying to implement QUIC
|
|
on top of blocking I/O in this way would require violating the BIO abstraction
|
|
layer, and would not work with custom BIOs.
|
|
|
|
### Blocking sockets and threads
|
|
|
|
Another conceptual possibility is that blocking calls could be kept ongoing in
|
|
parallel threads. Under this model, there would be three threads:
|
|
|
|
- a thread which exists solely to execute blocking calls to the `BIO_write` of
|
|
an underlying network BIO,
|
|
- a thread which exists solely to execute blocking calls to the `BIO_read` of an
|
|
underlying network BIO,
|
|
- a thread which exists solely to wait for and dispatch timeout events.
|
|
|
|
This has a large number of disadvantages:
|
|
|
|
- There is a hard requirement for threading functionality in order to be
|
|
able to support blocking semantics at the application level. Applications
|
|
which require blocking semantics would only be able to function in thread
|
|
assisted mode. In environments where threading support is not available or
|
|
desired, our APIs would only be usable in a non-blocking fashion.
|
|
|
|
- Several threads are spawned which the application is not in control of.
|
|
This undermines our general approach of providing the application with control
|
|
over OpenSSL's use of resources, such as allowing the application to do its
|
|
own polling or provide its own allocators.
|
|
|
|
At a minimum for a client, there must be two threads per connection. This
|
|
means if an application opens many outgoing connections, there will need
|
|
to be `2n` extra threads spawned.
|
|
|
|
- By blocking in `BIO_write` calls, this precludes correct implementation of
|
|
QUIC. Unlike any analogue in TLS, QUIC packets are time sensitive and intended
|
|
to be transmitted as soon as they are generated. QUIC packets contain fields
|
|
such as the ACK Delay value, which is intended to describe the time between a
|
|
packet being received and a return packet being generated. Correct calculation
|
|
of this field is necessary to correct calculation of connection RTT. It is
|
|
therefore important to only generate packets when they are ready to be sent,
|
|
otherwise suboptimal performance will result. This is a usage model which
|
|
aligns optimally to non-blocking I/O and which cannot be accommodated
|
|
by blocking I/O.
|
|
|
|
- Since existing custom BIOs will not be expecting concurrent `BIO_read` and
|
|
`BIO_write` calls, they will need to be adapted to support this, which is
|
|
likely to require substantial rework of those custom BIOs (trivial locking of
|
|
calls obviously does not work since both of these calls must be able to block
|
|
on network I/O simultaneously).
|
|
|
|
Moreover, this does not appear to be a realistically implementable approach:
|
|
|
|
- The question is posed of how to handle connection teardown, which does not
|
|
seem to be solvable. If parallel threads are blocking in blocking `BIO_read`
|
|
and `BIO_write` calls on some underlying network BIO, there needs to be some
|
|
way to force these calls to return once `SSL_free` is called and we need to
|
|
tear down the connection. However, the BIO interface does not provide
|
|
any way to do this. *At best* we might assume the BIO is a `BIO_s_dgram`
|
|
(but cannot assume this in the general case), but even then we can only
|
|
accomplish teardown by violating the BIO abstraction and closing the
|
|
underlying socket.
|
|
|
|
This is the only portable way to ensure that a recv(3) call to the same socket
|
|
returns. This obviously is a highly application-visible change (and is likely
|
|
to be far more disruptive than configuring the socket into non-blocking mode).
|
|
|
|
Moreover, it is not workable anyway because it only works for a socket-based
|
|
BIO and violates the BIO abstraction. For BIOs in general, there does not
|
|
appear to be any viable solution to the teardown issue.
|
|
|
|
Even if this approach were successfully implemented, applications will still
|
|
need to change to using network BIOs with datagram semantics. For applications
|
|
using custom BIOs, this is likely to require substantial rework of those BIOs.
|
|
There is no possible way around this. Thus, even if this solution were adopted
|
|
(notwithstanding the issues which preclude this noted above) for the purposes of
|
|
accommodating applications using custom network BIOs in a blocking mode, these
|
|
applications would still have to completely rework their implementation of those
|
|
BIOs. In any case, it is expected to be very rare that sophisticated
|
|
applications implementing their own custom BIOs will do so in a blocking mode.
|
|
|
|
### Use of non-blocking I/O
|
|
|
|
By comparison, use of non-blocking I/O and select(3) or similar APIs on the
|
|
network side makes satisfying our requirements for QUIC easy, and also allows
|
|
our internal approach to I/O to be flexibly adapted in the future as
|
|
requirements may evolve.
|
|
|
|
This is also the approach used by all other known QUIC implementations; it is
|
|
highly unlikely that any QUIC implementations exist which use blocking network
|
|
I/O, as (as mentioned above) it would lead to suboptimal performance due to the
|
|
ACK delay issue.
|
|
|
|
Note that this is orthogonal to whether we provide blocking I/O semantics to the
|
|
application. We can use blocking I/O internally while using this to provide
|
|
either blocking or non-blocking semantics to the application, based on what the
|
|
application requests.
|
|
|
|
This approach in general requires that a network socket be configured in
|
|
non-blocking mode. Though some OSes support a `MSG_DONTWAIT` flag which allows a
|
|
single I/O operation to be made non-blocking, not all OSes support this (e.g.
|
|
Windows), thus this cannot be relied on. As such, we need to configure any
|
|
socket FD we use into non-blocking mode.
|
|
|
|
Of the approaches outlined in this document, the use of non-blocking I/O has the
|
|
fewest disadvantages and is the only approach which appears to actually be
|
|
implementable in practice. Moreover, each disadvantage can be readily mitigated:
|
|
|
|
- We rely on having a select(3) or poll(3) like function available from the
|
|
OS.
|
|
|
|
However:
|
|
|
|
- Firstly, we already rely on select(3) in our code, so this does not appear
|
|
to raise any portability issues;
|
|
|
|
- Secondly, we have the option of providing a custom poller interface which
|
|
allows an application to provide its own implementation of a
|
|
select(3)-like function. In fact, this has the potential to be quite
|
|
powerful and would allow the application to implement its own pollable
|
|
BIOs, and therefore perform blocking I/O on top of any custom BIO.
|
|
|
|
For example, while historically none of our own memory-based BIOs have
|
|
supported blocking semantics, a sophisticated application could if it
|
|
wished choose to implement a custom blocking memory BIO and implement a
|
|
custom poller which synchronises using a custom poll descriptor based
|
|
around condition variables rather than sockets. Thus this scheme is
|
|
highly flexible.
|
|
|
|
(It is worth noting also that the implementation of blocking semantics at
|
|
the application level also does not rely on any privileged access to the
|
|
internals of the QUIC implementation and an application could if it wished
|
|
build blocking semantics out of a non-blocking QUIC instance; this is not
|
|
particularly difficult, though providing custom pollers here would mean
|
|
there should be no need for an application to do so.)
|
|
|
|
- Configuring a socket into non-blocking mode might confuse an application.
|
|
|
|
However:
|
|
|
|
- Applications will already have to make changes to any network-side BIOs,
|
|
for example switching from a `BIO_s_socket` to a `BIO_s_dgram`, or from a
|
|
BIO pair to a `BIO_s_dgram_pair`. Custom BIOs will need to be
|
|
substantially reworked to switch from bytestream semantics to datagram
|
|
semantics. Such applications will already need substantial changes, and
|
|
this is unavoidable.
|
|
|
|
Of course, application impacts and migration guidance can (and will) all
|
|
be documented.
|
|
|
|
- In order for an application to be confused by us putting a socket into
|
|
non-blocking mode, it would need to be trying to use the socket in some
|
|
way. But it is not possible for an application to pass a socket to our
|
|
QUIC implementation, and also try to use the socket directly, and have
|
|
QUIC still work. Using QUIC necessarily requires that an application not
|
|
also be trying to make use of the same socket.
|
|
|
|
- There are some circumstances where an application might want to multiplex
|
|
other protocols onto the same UDP socket, for example with protocols like
|
|
RTP/RTCP or STUN; this can be facilitated using the QUIC fixed bit.
|
|
However, these use cases cannot be supported without explicit assistance
|
|
from a QUIC implementation and this use case cannot be facilitated by
|
|
simply sharing a network socket, as incoming datagrams will not be routed
|
|
correctly. (We may offer some functionality in future to allow this to be
|
|
coordinated but this is not for MVP.) Thus this also is not a concern.
|
|
Moreover, it is extremely unlikely that any such applications are using
|
|
sockets in blocking mode anyway.
|
|
|
|
Advantages:
|
|
|
|
- An application retains full control of its event loop in non-blocking mode.
|
|
|
|
When using libssl in application-level blocking mode, via a custom poller
|
|
interface, the application would actually able to exercise more control over
|
|
I/O than it actually is at present when using libssl in blocking mode.
|
|
|
|
- Feasible to implement and already working in tests.
|
|
Minimises further development needed to ship.
|
|
|
|
- Does not rely on creating threads and can support blocking I/O at the
|
|
application level without relying on thread assisted mode.
|
|
|
|
- Does not require an application-provided network-side custom BIO to be
|
|
reworked to support concurrent calls to it.
|
|
|
|
- Allows performance-optimal implementation of QUIC RFC requirements.
|
|
|
|
- Ensures our internal I/O architecture remains flexible for future evolution
|
|
without breaking compatibility in the future.
|
|
|
|
Use of Internal Non-Blocking I/O
|
|
--------------------------------
|
|
|
|
Based on the above evaluation, implementation has been undertaken using
|
|
non-blocking I/O internally. Applications can use blocking or non-blocking I/O
|
|
at the libssl API level. Network-level BIOs must operate in a non-blocking mode
|
|
or be configurable by QUIC to this end.
|
|
|
|
![Block Diagram](images/quic-io-arch-1.png "Block Diagram")
|
|
|
|
### Support of arbitrary BIOs
|
|
|
|
We need to support not just socket FDs but arbitrary BIOs as the basis for the
|
|
use of QUIC. The use of QUIC with e.g. `BIO_s_dgram_pair`, a bidirectional
|
|
memory buffer with datagram semantics, is to be supported as part of MVP. This
|
|
must be reconciled with the desire to support application-managed event loops.
|
|
|
|
Broadly, the intention so far has been to enable the use of QUIC with an
|
|
application event loop in application-level non-blocking mode by exposing an
|
|
appropriate OS-level synchronisation primitive to the application. On \*NIX
|
|
platforms, this essentially means we provide the application with:
|
|
|
|
- An FD which should be polled for readability, writability, or both; and
|
|
- A deadline (if any is currently applicable).
|
|
|
|
Once either of these conditions is met, the QUIC state machine can be
|
|
(potentially) advanced meaningfully, and the application is expected to reenter
|
|
the QUIC state machine by calling `SSL_tick()` (or `SSL_read()` or
|
|
`SSL_write()`).
|
|
|
|
This model is readily supported when the read and write BIOs we are provided
|
|
with are socket BIOs:
|
|
|
|
- The read-pollable FD is the FD of the read BIO.
|
|
- The write-pollable FD is the FD of the write BIO.
|
|
|
|
However, things become more complex when we are dealing with memory-based BIOs
|
|
such as `BIO_dgram_pair` which do not naturally correspond to any OS primitive
|
|
which can be used for synchronisation, or when we are dealing with an
|
|
application-provided custom BIO.
|
|
|
|
### Pollable and Non-Pollable BIOs
|
|
|
|
In order to accommodate these various cases, we draw a distinction between
|
|
pollable and non-pollable BIOs.
|
|
|
|
- A pollable BIO is a BIO which can provide some kind of OS-level
|
|
synchronisation primitive, which can be used to determine when
|
|
the BIO might be able to do useful work once more.
|
|
|
|
- A non-pollable BIO has no naturally associated OS-level synchronisation
|
|
primitive, but its state only changes in response to calls made to it (or to
|
|
a related BIO, such as the other end of a pair).
|
|
|
|
#### Supporting Pollable BIOs
|
|
|
|
“OS-level synchronisation primitive” is deliberately vague. Most modern OSes use
|
|
unified handle spaces (UNIX, Windows) though it is likely there are more obscure
|
|
APIs on these platforms which have other handle spaces. However, this
|
|
unification is not necessarily significant.
|
|
|
|
For example, Windows sockets are kernel handles and thus like any other object
|
|
they can be used with the generic Win32 `WaitForSingleObject()` API, but not in
|
|
a useful manner; the generic readiness mechanism for WIndows handles is not
|
|
plumbed in for socket handles, and so sockets are simply never considered ready
|
|
for the purposes of this API, which will never return. Instead, the
|
|
WinSock-specific `select()` call must be used. On the other hand, other kinds of
|
|
synchronisation primitive like a Win32 Event must use `WaitForSingleObject()`.
|
|
|
|
Thus while in theory most modern operating systems have unified handle spaces in
|
|
practice there are substantial usage differences between different handle types.
|
|
As such, an API to expose a synchronisation primitive should be of a tagged
|
|
union design supporting possible variation.
|
|
|
|
A BIO object will provide methods to retrieve a pollable OS-level
|
|
synchronisation primitive which can be used to determine when the QUIC state
|
|
machine can (potentially) do more work. This maintains the integrity of the BIO
|
|
abstraction layer. Equivalent SSL object API calls which forward to the
|
|
equivalent calls of the underlying network BIO will also be provided.
|
|
|
|
The core mechanic is as follows:
|
|
|
|
```c
|
|
#define BIO_POLL_DESCRIPTOR_TYPE_NONE 0
|
|
#define BIO_POLL_DESCRIPTOR_TYPE_SOCK_FD 1
|
|
#define BIO_POLL_DESCRIPTOR_CUSTOM_START 8192
|
|
|
|
#define BIO_POLL_DESCRIPTOR_NUM_CUSTOM 4
|
|
|
|
typedef struct bio_poll_descriptor_st {
|
|
int type;
|
|
union {
|
|
int fd;
|
|
union {
|
|
void *ptr;
|
|
uint64_t u64;
|
|
} custom[BIO_POLL_DESCRIPTOR_NUM_CUSTOM];
|
|
} value;
|
|
} BIO_POLL_DESCRIPTOR;
|
|
|
|
int BIO_get_rpoll_descriptor(BIO *ssl, BIO_POLL_DESCRIPTOR *desc);
|
|
int BIO_get_wpoll_descriptor(BIO *ssl, BIO_POLL_DESCRIPTOR *desc);
|
|
|
|
int SSL_get_rpoll_descriptor(SSL *ssl, BIO_POLL_DESCRIPTOR *desc);
|
|
int SSL_get_wpoll_descriptor(SSL *ssl, BIO_POLL_DESCRIPTOR *desc);
|
|
```
|
|
|
|
Currently only a single descriptor type is defined, which is a FD on \*NIX and a
|
|
Winsock socket handle on Windows. These use the same type to minimise code
|
|
changes needed on different platforms in the common case of an OS network
|
|
socket. (Use of an `int` here is strictly incorrect for Windows; however, this
|
|
style of usage is prevalent in the OpenSSL codebase, so for consistency we
|
|
continue the pattern here.)
|
|
|
|
Poll descriptor types at or above `BIO_POLL_DESCRIPTOR_CUSTOM_START` are
|
|
reserved for application-defined use. The `value.custom` field of the
|
|
`BIO_POLL_DESCRIPTOR` structure is provided for applications to store values of
|
|
their choice in. An application is free to define the semantics.
|
|
|
|
libssl will not know how to poll custom poll descriptors itself, thus these are
|
|
only useful when the application will provide a custom poller function, which
|
|
performs polling on behalf of libssl and which implements support for those
|
|
custom poll descriptors.
|
|
|
|
For `BIO_s_ssl`, the `BIO_get_[rw]poll_descriptor` functions are equivalent to
|
|
the `SSL_get_[rw]poll_descriptor` functions. The `SSL_get_[rw]poll_descriptor`
|
|
functions are equivalent to calling `BIO_get_[rw]poll_descriptor` on the
|
|
underlying BIOs provided to the SSL object. For a socket BIO, this will likely
|
|
just yield the socket's FD. For memory-based BIOs, see below.
|
|
|
|
#### Supporting Non-Pollable BIOs
|
|
|
|
Where we are provided with a non-pollable BIO, we cannot provide the application
|
|
with any primitive used for synchronisation and it is assumed that the
|
|
application will handle its own network I/O, for example via a
|
|
`BIO_s_dgram_pair`.
|
|
|
|
When libssl calls `BIO_get_[rw]poll_descriptor` on the underlying BIO, the call
|
|
fails, indicating that a non-pollable BIO is being used. Thus, if an application
|
|
calls `SSL_get_[rw]poll_descriptor`, that call also fails.
|
|
|
|
There are various circumstances which need to be handled:
|
|
|
|
- The QUIC implementation wants to write data to the network but
|
|
is currently unable to (e.g. `BIO_s_dgram_pair` is full).
|
|
|
|
This is not hard as our internal TX record layer allows arbitrary buffering.
|
|
The only limit comes when QUIC flow control (which only applies to
|
|
application stream data) applies a limit; then calls to e.g. `SSL_write` we
|
|
must fail with `SSL_ERROR_WANT_WRITE`.
|
|
|
|
- The QUIC implementation wants to read data from the network
|
|
but is currently unable to (e.g. `BIO_s_dgram_pair` is empty).
|
|
|
|
Here calls like `SSL_read` need to fail with `SSL_ERROR_WANT_READ`; we
|
|
thereby support libssl's classic nonblocking I/O interface.
|
|
|
|
It is worth noting that theoretically a memory-based BIO could be implemented
|
|
which is pollable, for example using condition variables. An application could
|
|
implement a custom BIO, custom poll descriptor and custom poller to facilitate
|
|
this.
|
|
|
|
### Configuration of Blocking vs. Non-Blocking Mode
|
|
|
|
Traditionally an SSL object has operated either in blocking mode or non-blocking
|
|
mode without requiring explicit configuration; if a socket returns EWOULDBLOCK
|
|
or similar, it is handled appropriately, and if a socket call blocks, there is
|
|
no issue. Since the QUIC implementation is building on non-blocking I/O, this
|
|
implicit configuration of non-blocking mode is not feasible.
|
|
|
|
Note that Windows does not have an API for determining whether a socket is in
|
|
blocking mode, so it is not possible to use the initial state of an underlying
|
|
socket to determine if the application wants to use non-blocking I/O or not.
|
|
Moreover this would undermine the BIO abstraction.
|
|
|
|
As such, an explicit call is introduced to configure an SSL (QUIC) object into
|
|
non-blocking mode:
|
|
|
|
```c
|
|
int SSL_set_blocking_mode(SSL *s, int blocking);
|
|
int SSL_get_blocking_mode(SSL *s);
|
|
```
|
|
|
|
Applications desiring non-blocking operation will need to call this API to
|
|
configure a new QUIC connection accordingly. Blocking mode is chosen as the
|
|
default for parity with traditional Berkeley sockets APIs and to make things
|
|
simpler for blocking applications, which are likely to be seeking a simpler
|
|
solution. However, blocking mode cannot be supported with a non-pollable BIO,
|
|
and thus blocking mode defaults to off when used with such a BIO.
|
|
|
|
A method is also needed for the QUIC implementation to inform an underlying BIO
|
|
that it must not block. The SSL object will call this function when it is
|
|
provided with an underlying BIO. For a socket BIO this can set the socket as
|
|
non-blocking; for a memory-based BIO it is a no-op; for `BIO_s_ssl` it is
|
|
equivalent to a call to `SSL_set_blocking_mode()`.
|
|
|
|
### Internal Polling
|
|
|
|
When blocking mode is configured, the QUIC implementation will call
|
|
`BIO_get_[rw]poll_descriptor` on the underlying BIOs and use a suitable OS
|
|
function (e.g. `select()`) or, if configured, custom poller function, to block.
|
|
This will be implemented by an internal function which can accept up to two poll
|
|
descriptors (one for the read BIO, one for the write BIO), which might be
|
|
identical.
|
|
|
|
Blocking mode cannot be used with a non-pollable underlying BIO. If
|
|
`BIO_get[rw]poll_descriptor` is not implemented for either of the underlying
|
|
read and write BIOs, blocking mode cannot be enabled and blocking mode defaults
|
|
to off.
|