mirror of
https://github.com/openssl/openssl.git
synced 2024-11-21 01:15:20 +08:00
Improve chacha20 perfomance on aarch64 by interleaving scalar with SVE/SVE2
The patch will process one extra block by scalar in addition to blocks by SVE/SVE2 in parallel. This is esp. helpful in the scenario where we only have 128-bit vector length. The actual uplift to performance is complicated, depending on the vector length and input data size. SVE/SVE2 implementation don't always perform better than Neon, but it should prevail in most cases On a CPU with 256-bit SVE/SVE2, interleaved processing can handle 9 blocks in parallel (8 blocks by SVE and 1 by Scalar). on 128-bit SVE/SVE2 it is 5 blocks. Input size that is a multiple of 9/5 blocks on respective CPU can be typically handled at maximum speed. Here are test data for 256-bit and 128-bit SVE/SVE2 by running "openssl speed -evp chacha20 -bytes 576" (and other size) ----------------------------------+--------------------------------- 256-bit SVE | 128-bit SVE2 ----------------------------------|--------------------------------- Input 576 bytes 512 bytes | 320 bytes 256 bytes ----------------------------------|--------------------------------- SVE 1716361.91k 1556699.18k | 1615789.06k 1302864.40k ----------------------------------|--------------------------------- Neon 1262643.44k 1509044.05k | 680075.67k 1060532.31k ----------------------------------+--------------------------------- If the input size gets very large, the advantage of SVE/SVE2 over Neon will fade out. Signed-off-by: Daniel Hu <Daniel.Hu@arm.com> Change-Id: Ieedfcb767b9c08280d7c8c9a8648919c69728fab Reviewed-by: Tomas Mraz <tomas@openssl.org> Reviewed-by: Paul Dale <pauli@openssl.org> (Merged from https://github.com/openssl/openssl/pull/18901)
This commit is contained in:
parent
6b5c7ef771
commit
3f42f41ad1