Typos in doc/man* will be fixed in a different commit. Reviewed-by: Tomas Mraz <tomas@openssl.org> Reviewed-by: Paul Dale <pauli@openssl.org> (Merged from https://github.com/openssl/openssl/pull/20910)
22 KiB
Datagram BIO API revisions for sendmmsg/recvmmsg
We need to evolve the API surface of BIO which is relevant to BIO_dgram (and the eventual BIO_dgram_mem) to support APIs which allow multiple datagrams to be sent or received simultaneously, such as sendmmsg(2)/recvmmsg(2).
The adopted design
Design decisions
The adopted design makes the following design decisions:
-
We use a sendmmsg/recvmmsg-like API. The alternative API was not considered for adoption because it is an explicit goal that the adopted API be suitable for concurrent use on the same BIO.
-
We define our own structures rather than using the OS's
struct mmsghdr
. The motivations for this are:-
It ensures portability between OSes and allows the API to be used on OSes which do not support
sendmmsg
orsendmsg
. -
It allows us to use structures in keeping with OpenSSL's existing abstraction layers (e.g.
BIO_ADDR
rather thanstruct sockaddr
). -
We do not have to expose functionality which we cannot guarantee we can support on all platforms (for example, arbitrary control messages).
-
It avoids the need to include OS headers in our own public headers, which would pollute the environment of applications which include our headers, potentially undesirably.
-
-
For OSes which do not support
sendmmsg
, we emulate it using repeated calls tosendmsg
. For OSes which do not supportsendmsg
, we emulate it usingsendto
to the extent feasible. This avoids the need for code consuming these new APIs to define a fallback code path. -
We do not define any flags at this time, as the flags previously considered for adoption cannot be supported on all platforms (Win32 does not have
MSG_DONTWAIT
). -
We ensure the extensibility of our
BIO_MSG
structure in a way that preserves ABI compatibility using astride
argument which callers must set tosizeof(BIO_MSG)
. Implementations can examine the stride field to determine whether a given field is part of aBIO_MSG
. This allows us to add optional fields toBIO_MSG
at a later time without breaking ABI. All new fields must be added to the end of the structure. -
The BIO methods are designed to support stateless operation in which they are simply calls to the equivalent system calls, where supported, without changing BIO state. In particular, this means that things like retry flags are not set or cleared by
BIO_sendmmsg
orBIO_recvmmsg
.The motivation for this is that these functions are intended to support concurrent use on the same BIO. If they read or modify BIO state, they would need to be synchronised with a lock, undermining performance on what (for
BIO_dgram
) would otherwise be a straight system call. -
We do not support iovecs. The motivations for this are:
-
Not all platforms can support iovecs (e.g. Windows).
-
The only way we could emulate iovecs on platforms which don't support them is by copying the data to be sent into a staging buffer. This would defeat all of the advantages of iovecs and prevent us from meeting our zero/single-copy requirements. Moreover, it would lead to extremely surprising performance variations for consumers of the API.
-
We do not believe iovecs are needed to meet our performance requirements for QUIC. The reason for this is that aside from a minimal packet header, all data in QUIC is encrypted, so all data sent via QUIC must pass through an encrypt step anyway, meaning that all data sent will already be copied and there is not going to be any issue depositing the ciphertext in a staging buffer together with the frame header.
-
Even if we did support iovecs, we would have to impose a limit on the number of iovecs supported, because we translate from our own structures (as discussed above) and also intend these functions to be stateless and not requiire locking. Therefore the OS-native iovec structures would need to be allocated on the stack.
-
-
Sometimes, an application may wish to learn the local interface address associated with a receive operation or specify the local interface address to be used for a send operation. We support this, but require this functionality to be explicitly enabled before use.
The reason for this is that enabling this functionality generally requires that the socket be reconfigured using
setsockopt
on most platforms. Doing this on-demand would require state in the BIO to determine whether this functionality is currently switched on, which would require otherwise unnecessary locking, undermining performance in concurrent usage of this API on a given BIO. By requiring this functionality to be enabled explicitly before use, this allows this initialization to be done up front without performance cost. It also aids users of the API to understand that this functionality is not always available and to detect when this functionality is available in advance.
Design
The currently proposed design is as follows:
typedef struct bio_msg_st {
void *data;
size_t data_len;
BIO_ADDR *peer, *local;
uint64_t flags;
} BIO_MSG;
#define BIO_UNPACK_ERRNO(e) /*...*/
#define BIO_IS_ERRNO(e) /*...*/
ossl_ssize_t BIO_sendmmsg(BIO *b, BIO_MSG *msg, size_t stride,
size_t num_msg, uint64_t flags);
ossl_ssize_t BIO_recvmmsg(BIO *b, BIO_MSG *msg, size_t stride,
size_t num_msg, uint64_t flags);
The API is used as follows:
-
msg
points to an array ofnum_msg
BIO_MSG
structures. -
Both functions have identical prototypes, and return the number of messages processed in the array. If no messages were sent due to an error,
-1
is returned. If an OS-level socket error occurs, a negative valuev
is returned. The caller should determine thatv
is an OS-level socket error by callingBIO_IS_ERRNO(v)
and may obtain the OS-level socket error code by callingBIO_UNPACK_ERRNO(v)
. -
stride
must be set tosizeof(BIO_MSG)
. -
data
points to the buffer of data to be sent or to be filled with received data.data_len
is the size of the buffer in bytes on call. If the given message in the array is processed (i.e., if the return value exceeds the index of that message in the array),data_len
is updated to the actual amount of data sent or received at return time. -
flags
in theBIO_MSG
structure provides per-message flags to theBIO_sendmmsg
orBIO_recvmmsg
call. If the given message in the array is processed,flags
is written with zero or more result flags at return time. Theflags
argument to the call itself provides for global flags affecting all messages in the array. Currently, no per-message or global flags are defined and all of these fields are set to zero on call and on return. -
peer
andlocal
are optional pointers toBIO_ADDR
structures into which the remote and local addresses are to be filled. If either of these are NULL, the given addressing information is not requested. Local address support may not be available in all circumstances, in which case processing of the message fails. (This means that the function returns the number of messages processed, or -1 if the message in question is the first message.)Support for
local
must be explicitly enabled before use, otherwise attempts to use it fail.
Local address support is enabled as follows:
int BIO_dgram_set_local_addr_enable(BIO *b, int enable);
int BIO_dgram_get_local_addr_enable(BIO *b);
int BIO_dgram_get_local_addr_cap(BIO *b);
BIO_dgram_get_local_addr_cap()
returns 1 if local address support is
available. It is then enabled using BIO_dgram_set_local_addr_enable()
, which
fails if support is not available.
Options which were considered
Options for the API surface which were considered included:
sendmmsg/recvmmsg-like API
This design was chosen to form the basis of the adopted design, which is described above.
int BIO_readm(BIO *b, BIO_mmsghdr *msgvec,
unsigned len, int flags, struct timespec *timeout);
int BIO_writem(BIO *b, BIO_mmsghdr *msgvec,
unsigned len, int flags, struct timespec *timeout);
We can either define BIO_mmsghdr
as a typedef of struct mmsghdr
or redefine
an equivalent structure. The former has the advantage that we can just pass the
structures through to the syscall without copying them.
Note that in BIO_mem_dgram
we will have to process and therefore understand
the contents of struct mmsghdr
ourselves. Therefore, initially we define a
subset of struct mmsghdr
as being supported, specifically no control messages;
msg_name
and msg_iov
only.
The flags argument is defined by us. Initially we can support something like
MSG_DONTWAIT
(say, BIO_DONTWAIT
).
Implementation Questions
If we go with this, there are some issues that arise:
-
Are
BIO_mmsghdr
,BIO_msghdr
andBIO_iovec
simple typedefs for OS-provided structures, or our own independent structure definitions?-
If we use OS-provided structures:
-
We would need to include the OS headers which provide these structures in our public API headers.
-
If we choose to support these functions when OS support is not available (see discussion below), We would need to define our own structures in this case (a “polyfill” approach).
-
-
If we use our own structures:
-
We would need to translate these structures during every call.
But we would need to have storage inside the BIO_dgram for m
struct msghdr
, m*v iovecs, etc. Since we want to support multithreaded use these allocations probably will need to be on the stack, and therefore must be limited.Limiting m isn't a problem, because
sendmmsg
returns the number of messages sent, so the existing semantics we are trying to match lets us just send or receive fewer messages than we were asked to.However, it does seem like we will need to limit v, the number of iovecs per message. So what limit should we give to v, the number of iovecs? We will need a fixed stack allocation of OS iovec structures and we can allocate from this stack allocation as we iterate through the
BIO_msghdr
we have been given. So in practice we could just only send messages until we reach our iovec limit, and then return.For example, suppose we allocate 64 iovecs internally:
struct iovec vecs[64];
If the first message passed to a call to
BIO_writem
has 64 iovecs attached to it, no further messages can be sent andBIO_writem
returns 1.If three messages are sent, with 32, 32, and 1 iovecs respectively, the first two messages are sent and
BIO_writem
returns 2.So the only important thing we would need to document in this API is the limit of iovecs on a single message; in other words, the number of iovecs which must not be exceeded if a forward progress guarantee is to be made. e.g. if we allocate 64 iovecs internally,
BIO_writem
with a single message with 65 iovecs will never work and this becomes part of the API contract.Obviously these quantities of iovecs are unrealistically large. iovecs are small, so we can afford to set the limit high enough that it shouldn't cause any problems in practice. We can increase the limit later without a breaking API change, but we cannot decrease it later. So we might want to start with something small, like 8.
-
-
-
We also need to decide what to do for OSes which don't support at least
sendmsg
/recvmsg
.-
Don't provide these functions and require all users of these functions to have an alternate code path which doesn't rely on them?
- Not providing these functions on OSes that don't support at least sendmsg/recvmsg is a simple solution but adds complexity to code using BIO_dgram. (Though it does communicate to code more realistic performance expectations since it knows when these functions are actually available.)
-
Provide these functions and emulate the functionality:
-
However there is a question here as to how we implement the iovec arguments on platforms without
sendmsg
/recvmsg
. (We cannot usewritev
/readv
because we need peer address information.) Logically implementing these would then have to be done by copying buffers around internally before callingsendto
/recvfrom
, defeating the point of iovecs and providing a performance profile which is surprising to code using BIO_dgram. -
Another option could be a variable limit on the number of iovecs, which can be queried from BIO_dgram. This would be a constant set when libcrypto is compiled. It would be 1 for platforms not supporting
sendmsg
/recvmsg
. This again adds burdens on the code using BIO_dgram, but it seems the only way to avoid the surprising performance pitfall of buffer copying to emulate iovec support. There is a fair risk of code being written which accidentally works on one platform but not another, because the author didn't realise the iovec limit is 1 on some platforms. Possibly we could have an “iovec limit” variable in the BIO_dgram which is 1 by default, which can be increased by a call to a function BIO_set_iovec_limit, but not beyond the fixed size discussed above. It would return failure if not possible and this would give client code a clear way to determine if its expectations are met.
-
-
Alternate API
Could we use a simplified API? For example, could we have an API that returns
one datagram where BIO_dgram uses readmmsg
internally and queues the returned
datagrams, thereby still avoiding extra syscalls but offering a simple API.
The problem here is we want to support “single-copy” (where the data is only
copied as it is decrypted). Thus BIO_dgram needs to know the final resting place
of encrypted data at the time it makes the readmmsg
call.
One option would be to allow the user to set a callback on BIO_dgram it can use to request a new buffer, then have an API which returns the buffer:
int BIO_dgram_set_read_callback(BIO *b,
void *(*cb)(size_t len, void *arg),
void *arg);
int BIO_dgram_set_read_free_callback(BIO *b,
void (*cb)(void *buf,
size_t buf_len,
void *arg),
void *arg);
int BIO_read_dequeue(BIO *b, void **buf, size_t *buf_len);
The BIO_dgram calls the specified callback when it needs to generate internal
iovecs for its readmmsg
call, and the received datagrams can then be popped by
the application and freed as it likes. (The read free callback above is only
used in rare circumstances, such as when calls to BIO_read
and
BIO_read_dequeue
are alternated, or when the BIO_dgram is destroyed prior to
all read buffers being dequeued; see below.) For convenience we could have an
extra call to allow a buffer to be pushed back into the BIO_dgram's internal
queue of unused read buffers, which avoids the need for the application to do
its own management of such recycled buffers:
int BIO_dgram_push_read_buffer(BIO *b, void *buf, size_t buf_len);
On the write side, the application provides buffers and can get a callback when
they are freed. BIO_write_queue just queues for transmission, and the sendmmsg
call is made when calling BIO_flush
. (TBD: whether it is reasonable to
overload the semantics of BIO_flush in this way.)
int BIO_dgram_set_write_done_callback(BIO *b,
void (*cb)(const void *buf,
size_t buf_len,
int status,
void *arg),
void *arg);
int BIO_write_queue(BIO *b, const void *buf, size_t buf_len);
int BIO_flush(BIO *b);
The status argument to the write done callback will be 1 on success, some negative value on failure, and some special negative value if the BIO_dgram is being freed before the write could be completed.
For send/receive addresses, we import the BIO_(set|get)_dgram_(origin|dest)
APIs proposed in the sendmsg/recvmsg PR (#5257). BIO_get_dgram_(origin|dest)
should be called immediately after BIO_read_dequeue
and
BIO_set_dgram_(origin|dest)
should be called immediately before
BIO_write_queue
.
This approach allows BIO_dgram
to support myriad options via composition of
successive function calls in a “builder” style rather than via a single function
call with an excessive number of arguments or pointers to unwieldy ever-growing
argument structures, requiring constant revision of the central read/write
functions of the BIO API.
Note that since BIO_set_dgram_(origin|dest)
sets data on outgoing packets and
BIO_get_dgram_(origin|dest)
gets data on incoming packets, it doesn't follow
that these are accessing the same data (they are not setters and getters of a
variables called "dgram origin" and "dgram destination", even though they look
like setters and getters of the same variables from the name.) We probably want
to separate these as there is no need for a getter for outgoing packet
destination, for example, and by separating these we allow the possibility of
multithreaded use (one thread reads, one thread writes) in the future. Possibly
we should choose less confusing names for these functions. Maybe
BIO_set_outgoing_dgram_(origin|dest)
and
BIO_get_incoming_dgram_(origin|dest)
.
Pros of this approach:
-
Application can generate one datagram at a time and still get the advantages of sendmmsg/recvmmsg (fewer syscalls, etc.)
We probably want this for our own QUIC implementation built on top of this anyway. Otherwise we will need another piece to do basically the same thing and agglomerate multiple datagrams into a single BIO call. Unless we only want use
sendmmsg
constructively in trivial cases (e.g. where we send two datagrams from the same function immediately after one another... doesn't seem like a common use case.) -
Flexible support for single-copy (zero-copy).
Cons of this approach:
- Very different way of doing reads/writes might be strange to existing
applications. But the primary consumer of this new API will be our own
QUIC implementation so probably not a big deal. We can always support
BIO_read
/BIO_write
as a less efficient fallback for existing third party users of BIO_dgram.
Compatibility interop
Suppose the following sequence happens:
- BIO_read (legacy call path)
- BIO_read_dequeue (
recvmmsg
based call path with callback-allocated buffer) - BIO_read (legacy call path)
For (1) we have two options
a. Use recvmmsg
and add the received datagrams to an RX queue just as for the
BIO_read_dequeue
path. We use an OpenSSL-provided default allocator
(OPENSSL_malloc
) and flag these datagrams as needing to be freed by OpenSSL,
not the application.
When the application calls BIO_read
, a copy is performed and the internal
buffer is freed.
b. Use recvfrom
directly. This means we have a recvmmsg
path and a
recvfrom
path depending on what API is being used.
The disadvantage of (a) is it yields an extra copy relative to what we have now,
whereas with (b) the buffer passed to BIO_read
gets passed through to the
syscall and we do not have to copy anything.
Since we will probably need to support platforms without
sendmmsg
/recvmmsg
support anyway, (b) seems like the better option.
For (2) the new API is used. Since the previous call to BIO_read is essentially
“stateless” (it's just a simple call to recvfrom
, and doesn't require mutation
of any internal BIO state other than maybe the last datagram source/destination
address fields), BIO_dgram can go ahead and start using the recvmmsg
code
path. Since the RX queue will obviously be empty at this point, it is
initialised and filled using recvmmsg
, then one datagram is popped from it.
For (3) we have a legacy BIO_read
but we have several datagrams still in the
RX queue. In this case we do have to copy - we have no choice. However this only
happens in circumstances where a user of BIO_dgram alternates between old and
new APIs, which should be very unusual.
Subsequently for (3) we have to free the buffer using the free callback. This is an unusual case where BIO_dgram is responsible for freeing read buffers and not the application (the only other case being premature destruction, see below). But since this seems a very strange API usage pattern, we may just want to fail in this case.
Probably not worth supporting this. So we can have the following rule:
- After the first call to
BIO_read_dequeue
is made on a BIO_dgram, all subsequent calls to ordinaryBIO_read
will fail.
Of course, all of the above applies analogously to the TX side.
BIO_dgram_pair
We will also implement from scratch a BIO_dgram_pair. This will be provided as a BIO pair which provides identical semantics to the BIO_dgram above, both for the legacy and zero-copy code paths.
Thread safety
It is a functional assumption of the above design that we would never want to have more than one thread doing TX on the same BIO and never have more than one thread doing RX on the same BIO.
If we did ever want to do this, multiple BIOs on the same FD is one possibility (for the BIO_dgram case at least). But I don't believe there is any general intention to support multithreaded use of a single BIO at this time (unless I am mistaken), so this seems like it isn't an issue.
If we wanted to support multithreaded use of the same FD using the same BIO, we
would need to revisit the set-call-then-execute-call API approach above
(BIO_(set|get)_dgram_(origin|dest)
) as this would pose a problem. But I mainly
mention this only for completeness. Our recent learnt lessons on cache
contention suggest that this probably wouldn't be a good idea anyway.
Other questions
BIO_dgram will call the allocation function to get buffers for recvmmsg
to
fill. We might want to have a way to specify how many buffers it should offer to
recvmmsg
, and thus how many buffers it allocates in advance.
Premature destruction
If BIO_dgram is freed before all datagrams are read, the read buffer free callback is used to free any unreturned read buffers.