Makes e. g. matrix multiplication 2x faster:
name old cpu/op new cpu/op delta
BM_convers 181ms ± 1% 62ms ± 9% -65.82% (p=0.016 n=4+5)
Tested on all possible input values (not adding tests, since they
take a long time).
Activates vectorization of the Eigen::half versions of the tanh and
logistic functions when they run on Neon. Both functions convert their
inputs to float before computing the output, and as a result of this
commit, the conversions and the computation in float are vectorized.
We currently have plenty of type definitions with the alignment
qualifier coming after the type. The compiler warns about ignoring
them:
int EIGEN_ALIGN16 ai[4];
Turn this into:
EIGEN_ALIGN16 int ai[4];
Fixes compiler errors in expressions that look like
Eigen::Matrix<Eigen::half, 3, 1>::Random().maxCoeff()
The error comes from the code that creates the initial value for
vectorized reductions. The fix is to specify the scalar type of the
reduction's initial value.
The cahnge is necessary for Eigen::half because unlike other types,
Eigen::half scalars cannot be implicitly created from integers.
To elide the memcpy, we need to first load the `src` value into
registers by making a local copy. This avoids the need to resort
to potential UB by using `reinterpret_cast`.
This change doesn't seem to affect CPU (at least not with gcc/clang).
With optimizations on, the copy is also elided.
Looks like we need to update the
`EIGEN_INHERIT_ASSIGNMENT_EQUAL_OPERATOR` for newer versions of MSVC as
well when compiling with NVCC. Fixes build issues for VS 2017.
VS2017 doesn't like deducing alias types, leading to a bunch of compile
errors for functions involving the `tuple` alias. Replacing with
`TupleImpl` seems to solve this, allowing the test to compile/pass.
The `Complex.h` file applies equally to HIP/CUDA, so placing under the
generic `GPU` folder.
The `TensorReductionCuda.h` has already been deprecated, now removing
for the next Eigen version.
MSVC does not support specializing compound assignments for
`std::complex`, since it already specializes them (contrary to the
standard).
Trying to use one of these on device will currently lead to a
duplicate definition error. This is still probably preferable
to no error though. If we remove the definitions for MSVC, then
it will compile, but the kernel will fail silently.
The only proper solution would be to define our own custom `Complex`
type.