We currently load data byte by byte in order to byteswap it on big
endian. On little endian we can just do 8 byte loads.
A SHAKE128 benchmark runs 10% faster on POWER9 with this patch applied.
Reviewed-by: Paul Dale <pauli@openssl.org>
Reviewed-by: Tomas Mraz <tomas@openssl.org>
(Merged from https://github.com/openssl/openssl/pull/8455)