This minimizes inter-block overhead. Performance gain naturally
varies from case to case, up to 10% was spotted so far. There is
one thing to recognize, given same circumstances gain would be
higher faster computational part is. Or in other words biggest
improvement coefficient would have been observed with assembly.
Reviewed-by: Emilia Käsper <emilia@openssl.org>
Reviewed-by: Rich Salz <rsalz@openssl.org>