openssl/doc/designs/quic-design/dgram-api.md

Datagram BIO API revisions for sendmmsg/recvmmsg
================================================

We need to evolve the API surface of BIO which is relevant to BIO_dgram (and the
eventual BIO_dgram_mem) to support APIs which allow multiple datagrams to be
sent or received simultaneously, such as sendmmsg(2)/recvmmsg(2).

The adopted design
------------------

### Design decisions

The adopted design makes the following design decisions:

- We use a sendmmsg/recvmmsg-like API. The alternative API was not considered
  for adoption because it is an explicit goal that the adopted API be suitable
  for concurrent use on the same BIO.

- We define our own structures rather than using the OS's `struct mmsghdr`.
  The motivations for this are:

  - It ensures portability between OSes and allows the API to be used
    on OSes which do not support `sendmmsg` or `sendmsg`.

  - It allows us to use structures in keeping with OpenSSL's existing
    abstraction layers (e.g. `BIO_ADDR` rather than `struct sockaddr`).

  - We do not have to expose functionality which we cannot guarantee
    we can support on all platforms (for example, arbitrary control messages).

  - It avoids the need to include OS headers in our own public headers,
    which would pollute the environment of applications which include
    our headers, potentially undesirably.

- For OSes which do not support `sendmmsg`, we emulate it using repeated
  calls to `sendmsg`. For OSes which do not support `sendmsg`, we emulate it
  using `sendto` to the extent feasible. This avoids the need for code consuming
  these new APIs to define a fallback code path.

- We do not define any flags at this time, as the flags previously considered
  for adoption cannot be supported on all platforms (Win32 does not have
  `MSG_DONTWAIT`).

- We ensure the extensibility of our `BIO_MSG` structure in a way that preserves
  ABI compatibility using a `stride` argument which callers must set to
  `sizeof(BIO_MSG)`. Implementations can examine the stride field to determine
  whether a given field is part of a `BIO_MSG`. This allows us to add optional
  fields to `BIO_MSG` at a later time without breaking ABI. All new fields must
  be added to the end of the structure.

- The BIO methods are designed to support stateless operation in which they
  are simply calls to the equivalent system calls, where supported, without
  changing BIO state. In particular, this means that things like retry flags are
  not set or cleared by `BIO_sendmmsg` or `BIO_recvmmsg`.

  The motivation for this is that these functions are intended to support
  concurrent use on the same BIO. If they read or modify BIO state, they would
  need to be synchronised with a lock, undermining performance on what (for
  `BIO_dgram`) would otherwise be a straight system call.

- We do not support iovecs. The motivations for this are:

  - Not all platforms can support iovecs (e.g. Windows).

  - The only way we could emulate iovecs on platforms which don't support
    them is by copying the data to be sent into a staging buffer. This would
    defeat all of the advantages of iovecs and prevent us from meeting our
    zero/single-copy requirements. Moreover, it would lead to extremely
    surprising performance variations for consumers of the API.

  - We do not believe iovecs are needed to meet our performance requirements
    for QUIC. The reason for this is that aside from a minimal packet header,
    all data in QUIC is encrypted, so all data sent via QUIC must pass through
    an encrypt step anyway, meaning that all data sent will already be copied
    and there is not going to be any issue depositing the ciphertext in a
    staging buffer together with the frame header.

  - Even if we did support iovecs, we would have to impose a limit
    on the number of iovecs supported, because we translate from our own
    structures (as discussed above) and also intend these functions to be
    stateless and not requiire locking. Therefore the OS-native iovec structures
    would need to be allocated on the stack.

- Sometimes, an application may wish to learn the local interface address
  associated with a receive operation or specify the local interface address to
  be used for a send operation. We support this, but require this functionality
  to be explicitly enabled before use.

  The reason for this is that enabling this functionality generally requires
  that the socket be reconfigured using `setsockopt` on most platforms. Doing
  this on-demand would require state in the BIO to determine whether this
  functionality is currently switched on, which would require otherwise
  unnecessary locking, undermining performance in concurrent usage of this API
  on a given BIO. By requiring this functionality to be enabled explicitly
  before use, this allows this initialization to be done up front without
  performance cost. It also aids users of the API to understand that this
  functionality is not always available and to detect when this functionality is
  available in advance.

### Design

The currently proposed design is as follows:

```c
typedef struct bio_msg_st {
    void *data;
    size_t data_len;
    BIO_ADDR *peer, *local;
    uint64_t flags;
} BIO_MSG;

#define BIO_UNPACK_ERRNO(e)     /*...*/
#define BIO_IS_ERRNO(e)         /*...*/

ossl_ssize_t BIO_sendmmsg(BIO *b, BIO_MSG *msg, size_t stride,
                          size_t num_msg, uint64_t flags);
ossl_ssize_t BIO_recvmmsg(BIO *b, BIO_MSG *msg, size_t stride,
                          size_t num_msg, uint64_t flags);
```

The API is used as follows:

- `msg` points to an array of `num_msg` `BIO_MSG` structures.

- Both functions have identical prototypes, and return the number of messages
  processed in the array. If no messages were sent due to an error, `-1` is
  returned. If an OS-level socket error occurs, a negative value `v` is
  returned. The caller should determine that `v` is an OS-level socket error by
  calling `BIO_IS_ERRNO(v)` and may obtain the OS-level socket error code by
  calling `BIO_UNPACK_ERRNO(v)`.

- `stride` must be set to `sizeof(BIO_MSG)`.

- `data` points to the buffer of data to be sent or to be filled with received
  data. `data_len` is the size of the buffer in bytes on call. If the
  given message in the array is processed (i.e., if the return value
  exceeds the index of that message in the array), `data_len` is updated
  to the actual amount of data sent or received at return time.

- `flags` in the `BIO_MSG` structure provides per-message flags to
  the `BIO_sendmmsg` or `BIO_recvmmsg` call. If the given message in the array
  is processed, `flags` is written with zero or more result flags at return
  time. The `flags` argument to the call itself provides for global flags
  affecting all messages in the array. Currently, no per-message or global flags
  are defined and all of these fields are set to zero on call and on return.

- `peer` and `local` are optional pointers to `BIO_ADDR` structures into
  which the remote and local addresses are to be filled. If either of these
  are NULL, the given addressing information is not requested. Local address
  support may not be available in all circumstances, in which case processing of
  the message fails. (This means that the function returns the number of
  messages processed, or -1 if the message in question is the first message.)

  Support for `local` must be explicitly enabled before use, otherwise
  attempts to use it fail.

Local address support is enabled as follows:

```c
int BIO_dgram_set_local_addr_enable(BIO *b, int enable);
int BIO_dgram_get_local_addr_enable(BIO *b);
int BIO_dgram_get_local_addr_cap(BIO *b);
```

`BIO_dgram_get_local_addr_cap()` returns 1 if local address support is
available. It is then enabled using `BIO_dgram_set_local_addr_enable()`, which
fails if support is not available.

Options which were considered
-----------------------------

Options for the API surface which were considered included:

### sendmmsg/recvmmsg-like API

This design was chosen to form the basis of the adopted design, which is
described above.

```c
int BIO_readm(BIO *b, BIO_mmsghdr *msgvec,
              unsigned len, int flags, struct timespec *timeout);
int BIO_writem(BIO *b, BIO_mmsghdr *msgvec,
              unsigned len, int flags, struct timespec *timeout);
```

We can either define `BIO_mmsghdr` as a typedef of `struct mmsghdr` or redefine
an equivalent structure. The former has the advantage that we can just pass the
structures through to the syscall without copying them.

Note that in `BIO_mem_dgram` we will have to process and therefore understand
the contents of `struct mmsghdr` ourselves. Therefore, initially we define a
subset of `struct mmsghdr` as being supported, specifically no control messages;
`msg_name` and `msg_iov` only.

The flags argument is defined by us. Initially we can support something like
`MSG_DONTWAIT` (say, `BIO_DONTWAIT`).

#### Implementation Questions

If we go with this, there are some issues that arise:

- Are `BIO_mmsghdr`, `BIO_msghdr` and `BIO_iovec` simple typedefs
  for OS-provided structures, or our own independent structure
  definitions?

  - If we use OS-provided structures:

    - We would need to include the OS headers which provide these
      structures in our public API headers.

    - If we choose to support these functions when OS support is not available
      (see discussion below), We would need to define our own structures in this
      case (a “polyfill” approach).

  - If we use our own structures:

    - We would need to translate these structures during every call.

      But we would need to have storage inside the BIO_dgram for *m* `struct
      msghdr`, *m\*v* iovecs, etc. Since we want to support multithreaded use
      these allocations probably will need to be on the stack, and therefore
      must be limited.

      Limiting *m* isn't a problem, because `sendmmsg` returns the number
      of messages sent, so the existing semantics we are trying to match
      lets us just send or receive fewer messages than we were asked to.

      However, it does seem like we will need to limit *v*, the number of iovecs
      per message. So what limit should we give to *v*, the number of iovecs? We
      will need a fixed stack allocation of OS iovec structures and we can
      allocate from this stack allocation as we iterate through the `BIO_msghdr`
      we have been given. So in practice we could just only send messages
      until we reach our iovec limit, and then return.

      For example, suppose we allocate 64 iovecs internally:

      ```c
      struct iovec vecs[64];
      ```

      If the first message passed to a call to `BIO_writem` has 64 iovecs
      attached to it, no further messages can be sent and `BIO_writem`
      returns 1.

      If three messages are sent, with 32, 32, and 1 iovecs respectively,
      the first two messages are sent and `BIO_writem` returns 2.

      So the only important thing we would need to document in this API
      is the limit of iovecs on a single message; in other words, the
      number of iovecs which must not be exceeded if a forward progress
      guarantee is to be made. e.g. if we allocate 64 iovecs internally,
      `BIO_writem` with a single message with 65 iovecs will never work
      and this becomes part of the API contract.

      Obviously these quantities of iovecs are unrealistically large.
      iovecs are small, so we can afford to set the limit high enough
      that it shouldn't cause any problems in practice. We can increase
      the limit later without a breaking API change, but we cannot decrease
      it later. So we might want to start with something small, like 8.

- We also need to decide what to do for OSes which don't support at least
  `sendmsg`/`recvmsg`.

  - Don't provide these functions and require all users of these functions to
    have an alternate code path which doesn't rely on them?

    - Not providing these functions on OSes that don't support
      at least sendmsg/recvmsg is a simple solution but adds
      complexity to code using BIO_dgram. (Though it does communicate
      to code more realistic performance expectations since it
      knows when these functions are actually available.)

  - Provide these functions and emulate the functionality:

    - However there is a question here as to how we implement
      the iovec arguments on platforms without `sendmsg`/`recvmsg`. (We cannot
      use `writev`/`readv` because we need peer address information.) Logically
      implementing these would then have to be done by copying buffers around
      internally before calling `sendto`/`recvfrom`, defeating the point of
      iovecs and providing a performance profile which is surprising to code
      using BIO_dgram.

    - Another option could be a variable limit on the number of iovecs,
      which can be queried from BIO_dgram. This would be a constant set
      when libcrypto is compiled. It would be 1 for platforms not supporting
      `sendmsg`/`recvmsg`. This again adds burdens on the code using
      BIO_dgram, but it seems the only way to avoid the surprising performance
      pitfall of buffer copying to emulate iovec support. There is a fair risk
      of code being written which accidentally works on one platform but not
      another, because the author didn't realise the iovec limit is 1 on some
      platforms. Possibly we could have an “iovec limit” variable in the
      BIO_dgram which is 1 by default, which can be increased by a call to a
      function BIO_set_iovec_limit, but not beyond the fixed size discussed
      above. It would return failure if not possible and this would give client
      code a clear way to determine if its expectations are met.

### Alternate API

Could we use a simplified API? For example, could we have an API that returns
one datagram where BIO_dgram uses `readmmsg` internally and queues the returned
datagrams, thereby still avoiding extra syscalls but offering a simple API.

The problem here is we want to support “single-copy” (where the data is only
copied as it is decrypted). Thus BIO_dgram needs to know the final resting place
of encrypted data at the time it makes the `readmmsg` call.

One option would be to allow the user to set a callback on BIO_dgram it can use
to request a new buffer, then have an API which returns the buffer:

```c
int BIO_dgram_set_read_callback(BIO *b,
                                void *(*cb)(size_t len, void *arg),
                                void *arg);
int BIO_dgram_set_read_free_callback(BIO *b,
                                     void (*cb)(void *buf,
                                                size_t buf_len,
                                                void *arg),
                                     void *arg);
int BIO_read_dequeue(BIO *b, void **buf, size_t *buf_len);
```

The BIO_dgram calls the specified callback when it needs to generate internal
iovecs for its `readmmsg` call, and the received datagrams can then be popped by
the application and freed as it likes. (The read free callback above is only
used in rare circumstances, such as when calls to `BIO_read` and
`BIO_read_dequeue` are alternated, or when the BIO_dgram is destroyed prior to
all read buffers being dequeued; see below.) For convenience we could have an
extra call to allow a buffer to be pushed back into the BIO_dgram's internal
queue of unused read buffers, which avoids the need for the application to do
its own management of such recycled buffers:

```c
int BIO_dgram_push_read_buffer(BIO *b, void *buf, size_t buf_len);
```

On the write side, the application provides buffers and can get a callback when
they are freed. BIO_write_queue just queues for transmission, and the `sendmmsg`
call is made when calling `BIO_flush`. (TBD: whether it is reasonable to
overload the semantics of BIO_flush in this way.)

```c
int BIO_dgram_set_write_done_callback(BIO *b,
                                      void (*cb)(const void *buf,
                                                 size_t buf_len,
                                                 int status,
                                                 void *arg),
                                      void *arg);
int BIO_write_queue(BIO *b, const void *buf, size_t buf_len);
int BIO_flush(BIO *b);
```

The status argument to the write done callback will be 1 on success, some
negative value on failure, and some special negative value if the BIO_dgram is
being freed before the write could be completed.

For send/receive addresses, we import the `BIO_(set|get)_dgram_(origin|dest)`
APIs proposed in the sendmsg/recvmsg PR (#5257). `BIO_get_dgram_(origin|dest)`
should be called immediately after `BIO_read_dequeue` and
`BIO_set_dgram_(origin|dest)` should be called immediately before
`BIO_write_queue`.

This approach allows `BIO_dgram` to support myriad options via composition of
successive function calls in a “builder” style rather than via a single function
call with an excessive number of arguments or pointers to unwieldy ever-growing
argument structures, requiring constant revision of the central read/write
functions of the BIO API.

Note that since `BIO_set_dgram_(origin|dest)` sets data on outgoing packets and
`BIO_get_dgram_(origin|dest)` gets data on incoming packets, it doesn't follow
that these are accessing the same data (they are not setters and getters of a
variables called "dgram origin" and "dgram destination", even though they look
like setters and getters of the same variables from the name.) We probably want
to separate these as there is no need for a getter for outgoing packet
destination, for example, and by separating these we allow the possibility of
multithreaded use (one thread reads, one thread writes) in the future. Possibly
we should choose less confusing names for these functions. Maybe
`BIO_set_outgoing_dgram_(origin|dest)` and
`BIO_get_incoming_dgram_(origin|dest)`.

Pros of this approach:

  - Application can generate one datagram at a time and still get the advantages
    of sendmmsg/recvmmsg (fewer syscalls, etc.)

    We probably want this for our own QUIC implementation built on top of this
    anyway. Otherwise we will need another piece to do basically the same thing
    and agglomerate multiple datagrams into a single BIO call. Unless we only
    want use `sendmmsg` constructively in trivial cases (e.g. where we send two
    datagrams from the same function immediately after one another... doesn't
    seem like a common use case.)

  - Flexible support for single-copy (zero-copy).

Cons of this approach:

  - Very different way of doing reads/writes might be strange to existing
    applications. *But* the primary consumer of this new API will be our own
    QUIC implementation so probably not a big deal. We can always support
    `BIO_read`/`BIO_write` as a less efficient fallback for existing third party
    users of BIO_dgram.

#### Compatibility interop

Suppose the following sequence happens:

1. BIO_read (legacy call path)
2. BIO_read_dequeue (`recvmmsg` based call path with callback-allocated buffer)
3. BIO_read (legacy call path)

For (1) we have two options

a. Use `recvmmsg` and add the received datagrams to an RX queue just as for the
   `BIO_read_dequeue` path. We use an OpenSSL-provided default allocator
   (`OPENSSL_malloc`) and flag these datagrams as needing to be freed by OpenSSL,
   not the application.

   When the application calls `BIO_read`, a copy is performed and the internal
   buffer is freed.

b. Use `recvfrom` directly. This means we have a `recvmmsg` path and a
   `recvfrom` path depending on what API is being used.

   The disadvantage of (a) is it yields an extra copy relative to what we have now,
   whereas with (b) the buffer passed to `BIO_read` gets passed through to the
   syscall and we do not have to copy anything.

   Since we will probably need to support platforms without
   `sendmmsg`/`recvmmsg` support anyway, (b) seems like the better option.

For (2) the new API is used. Since the previous call to BIO_read is essentially
“stateless” (it's just a simple call to `recvfrom`, and doesn't require mutation
of any internal BIO state other than maybe the last datagram source/destination
address fields), BIO_dgram can go ahead and start using the `recvmmsg` code
path. Since the RX queue will obviously be empty at this point, it is
initialised and filled using `recvmmsg`, then one datagram is popped from it.

For (3) we have a legacy `BIO_read` but we have several datagrams still in the
RX queue. In this case we do have to copy - we have no choice. However this only
happens in circumstances where a user of BIO_dgram alternates between old and
new APIs, which should be very unusual.

Subsequently for (3) we have to free the buffer using the free callback. This is
an unusual case where BIO_dgram is responsible for freeing read buffers and not
the application (the only other case being premature destruction, see below).
But since this seems a very strange API usage pattern, we may just want to fail
in this case.

Probably not worth supporting this. So we can have the following rule:

- After the first call to `BIO_read_dequeue` is made on a BIO_dgram, all
  subsequent calls to ordinary `BIO_read` will fail.

Of course, all of the above applies analogously to the TX side.

#### BIO_dgram_pair

We will also implement from scratch a BIO_dgram_pair. This will be provided as a
BIO pair which provides identical semantics to the BIO_dgram above, both for the
legacy and zero-copy code paths.

#### Thread safety

It is a functional assumption of the above design that we would never want to
have more than one thread doing TX on the same BIO and never have more than one
thread doing RX on the same BIO.

If we did ever want to do this, multiple BIOs on the same FD is one possibility
(for the BIO_dgram case at least). But I don't believe there is any general
intention to support multithreaded use of a single BIO at this time (unless I am
mistaken), so this seems like it isn't an issue.

If we wanted to support multithreaded use of the same FD using the same BIO, we
would need to revisit the set-call-then-execute-call API approach above
(`BIO_(set|get)_dgram_(origin|dest)`) as this would pose a problem. But I mainly
mention this only for completeness. Our recent learnt lessons on cache
contention suggest that this probably wouldn't be a good idea anyway.

#### Other questions

BIO_dgram will call the allocation function to get buffers for `recvmmsg` to
fill. We might want to have a way to specify how many buffers it should offer to
`recvmmsg`, and thus how many buffers it allocates in advance.

#### Premature destruction

If BIO_dgram is freed before all datagrams are read, the read buffer free
callback is used to free any unreturned read buffers.