- The current implementation computes `size + total_threads`, which can
overflow and cause CUDA_ERROR_ILLEGAL_ADDRESS when size is close to
the maximum representable value.
- The num_blocks calculation can also overflow due to the implementation
of divup().
- This patch prevents these overflows and allows the kernel to work
correctly for the full representable range of tensor sizes.
- Also adds relevant tests.
- Doing computation with uninitialized (zero-ed ? but thanks Linux) matrix, or
worse NaN on other non-linux systems.
- This commit fixes it by initializing to Random().
For some reason, having `take<n, numeric_list<T>>` for `n > 0` causes
g++-11 to ICE with
```
sorry, unimplemented: unexpected AST of kind nontype_argument_pack
```
It does work with other versions of gcc, and with clang.
I filed a GCC bug
[here](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102999).
Technically we should never actually run into this case, since you
can't take n > 0 elements from an empty list. Commenting it out
allows our Eigen tests to pass.
The sum accuracy test currently uses the default test precision for
the given scalar type. However, scalars are generated via a normal
distribution, and given a large enough count and strong enough random
generator, the expected sum is zero. This causes the test to
periodically fail.
Here we estimate an upper-bound for the error as `sqrt(N) * prec` for
summing N values, with each having an approximate epsilon of `prec`.
Also fixed a few warnings generated by MSVC when compiling the
reduction test.
reducer0.reducePacket(accum1, accum0);
reducer0.reducePacket(accum2, accum0);
reducer0.reducePacket(accum3, accum0);
For the mean reducer this will increment the count as well as adding together the accumulators and result in the wrong count being divided into the sum at the end.
Makes e. g. matrix multiplication 2x faster:
name old cpu/op new cpu/op delta
BM_convers 181ms ± 1% 62ms ± 9% -65.82% (p=0.016 n=4+5)
Tested on all possible input values (not adding tests, since they
take a long time).
Activates vectorization of the Eigen::half versions of the tanh and
logistic functions when they run on Neon. Both functions convert their
inputs to float before computing the output, and as a result of this
commit, the conversions and the computation in float are vectorized.
We currently have plenty of type definitions with the alignment
qualifier coming after the type. The compiler warns about ignoring
them:
int EIGEN_ALIGN16 ai[4];
Turn this into:
EIGEN_ALIGN16 int ai[4];
- The current implementation computes `size + total_threads`, which can
overflow and cause CUDA_ERROR_ILLEGAL_ADDRESS when size is close to
the maximum representable value.
- The num_blocks calculation can also overflow due to the implementation
of divup().
- This patch prevents these overflows and allows the kernel to work
correctly for the full representable range of tensor sizes.
- Also adds relevant tests.
Fixes compiler errors in expressions that look like
Eigen::Matrix<Eigen::half, 3, 1>::Random().maxCoeff()
The error comes from the code that creates the initial value for
vectorized reductions. The fix is to specify the scalar type of the
reduction's initial value.
The cahnge is necessary for Eigen::half because unlike other types,
Eigen::half scalars cannot be implicitly created from integers.
For vectorized 1-dimensional inputs that do not take the special
blocking path (e.g. `std::complex<...>`), there was an
index-out-of-bounds error causing the broadcast size to be
computed incorrectly. Here we fix this, and make other minor
cleanup changes.
Fixes#2351.
& and | short-circuit, && and || don't. When both arguments to those
are boolean, the short-circuiting version is usually the desired one, so
clang warns on this.
Here, it is inconsequential, so switch to && and || to suppress the warning.
To elide the memcpy, we need to first load the `src` value into
registers by making a local copy. This avoids the need to resort
to potential UB by using `reinterpret_cast`.
This change doesn't seem to affect CPU (at least not with gcc/clang).
With optimizations on, the copy is also elided.
For moderately sized inputs, running the Tree reduction quickly
fills/overflows the GPU thread stack space, leading to memory errors.
This was happening in the `cxx11_tensor_complex_gpu` test, for example.
Disabling tree reduction on GPU fixes this.
Looks like we need to update the
`EIGEN_INHERIT_ASSIGNMENT_EQUAL_OPERATOR` for newer versions of MSVC as
well when compiling with NVCC. Fixes build issues for VS 2017.
VS2017 doesn't like deducing alias types, leading to a bunch of compile
errors for functions involving the `tuple` alias. Replacing with
`TupleImpl` seems to solve this, allowing the test to compile/pass.
The `Complex.h` file applies equally to HIP/CUDA, so placing under the
generic `GPU` folder.
The `TensorReductionCuda.h` has already been deprecated, now removing
for the next Eigen version.