We currently have plenty of type definitions with the alignment
qualifier coming after the type. The compiler warns about ignoring
them:
int EIGEN_ALIGN16 ai[4];
Turn this into:
EIGEN_ALIGN16 int ai[4];
- The current implementation computes `size + total_threads`, which can
overflow and cause CUDA_ERROR_ILLEGAL_ADDRESS when size is close to
the maximum representable value.
- The num_blocks calculation can also overflow due to the implementation
of divup().
- This patch prevents these overflows and allows the kernel to work
correctly for the full representable range of tensor sizes.
- Also adds relevant tests.
Fixes compiler errors in expressions that look like
Eigen::Matrix<Eigen::half, 3, 1>::Random().maxCoeff()
The error comes from the code that creates the initial value for
vectorized reductions. The fix is to specify the scalar type of the
reduction's initial value.
The cahnge is necessary for Eigen::half because unlike other types,
Eigen::half scalars cannot be implicitly created from integers.
For vectorized 1-dimensional inputs that do not take the special
blocking path (e.g. `std::complex<...>`), there was an
index-out-of-bounds error causing the broadcast size to be
computed incorrectly. Here we fix this, and make other minor
cleanup changes.
Fixes#2351.
& and | short-circuit, && and || don't. When both arguments to those
are boolean, the short-circuiting version is usually the desired one, so
clang warns on this.
Here, it is inconsequential, so switch to && and || to suppress the warning.
To elide the memcpy, we need to first load the `src` value into
registers by making a local copy. This avoids the need to resort
to potential UB by using `reinterpret_cast`.
This change doesn't seem to affect CPU (at least not with gcc/clang).
With optimizations on, the copy is also elided.
For moderately sized inputs, running the Tree reduction quickly
fills/overflows the GPU thread stack space, leading to memory errors.
This was happening in the `cxx11_tensor_complex_gpu` test, for example.
Disabling tree reduction on GPU fixes this.
Looks like we need to update the
`EIGEN_INHERIT_ASSIGNMENT_EQUAL_OPERATOR` for newer versions of MSVC as
well when compiling with NVCC. Fixes build issues for VS 2017.
VS2017 doesn't like deducing alias types, leading to a bunch of compile
errors for functions involving the `tuple` alias. Replacing with
`TupleImpl` seems to solve this, allowing the test to compile/pass.
The `Complex.h` file applies equally to HIP/CUDA, so placing under the
generic `GPU` folder.
The `TensorReductionCuda.h` has already been deprecated, now removing
for the next Eigen version.
Replaced deprecated `DetermineVSServicePack`macro with recommended
`CMAKE_CXX_COMPILER_VERSION`.
Deleted custom `OSVersion` detection. The windows-specific code is
highly outdated, and on other systems simply returns `CMAKE_SYSTEM`.
We will get values like `windows-10.0.17763`, but this is preferable
to `unknownwin`, and saves us needing to maintain a separate cmake file.
The original test times out after 60 minutes on Windows, even when
setting flags to optimize for speed. Reducing the number of
contractions performed from 3600->27 for subtests 8,9 allow the
two to run in just over a minute each.
Some checks used incorrect values, partly from copy-paste errors,
partly from the change in behaviour introduced in !398.
Modified results to match scipy, simplified tests by updating
`VERIFY_IS_CWISE_APPROX` to work for scalars.
MSVC does not support specializing compound assignments for
`std::complex`, since it already specializes them (contrary to the
standard).
Trying to use one of these on device will currently lead to a
duplicate definition error. This is still probably preferable
to no error though. If we remove the definitions for MSVC, then
it will compile, but the kernel will fail silently.
The only proper solution would be to define our own custom `Complex`
type.
Without this flag, when compiling with nvcc, if the compute architecture of a card does
not exactly match any of those listed for `-gencode arch=compute_<arch>,code=sm_<arch>`,
then the kernel will fail to run with:
```
cudaErrorNoKernelImageForDevice: no kernel image is available for execution on the device.
```
This can happen, for example, when compiling with an older cuda version
that does not support a newer architecture (e.g. T4 is `sm_75`, but cuda
9.2 only supports up to `sm_70`).
With the `-arch=<arch>` flag, the code will compile and run at the
supplied architecture.
- Unify test/CMakeLists.txt and unsupported/test/CMakeLists.txt
- Added `EIGEN_CUDA_FLAGS` that are appended to the set of flags passed
to the cuda compiler (nvcc or clang).
The latter is to support passing custom flags (e.g. `-arch=` to nvcc,
or to disable cuda-specific warnings).
The 2979 warning is yet another "calling a __host__ function from a
__host__ device__ function. Although we probably should eventually
address these, they are flooding the logs. Most of these are
harmless since we only call the original from the host.
In cases where these are actually called from device, an error is generated
instead anyways.
The 2977 warning is a bit strange - although the warning suggests the
`__device__` annotation is ignored, this doesn't actually seem to be
the case. Without the `__device__` declarations, the kernel actually
fails to run when attempting to construct such objects. Again,
these warnings are flooding the logs, so disabling for now.
reinterpret_cast between unrelated types is undefined behavior and leads
to misoptimizations on some platforms.
Use the safer (and faster) version via bit_cast
clang-tidy: Return type 'const T' is 'const'-qualified at the top level,
which may reduce code readability without improving const correctness
The types are somewhat long, but the affected return types are of the form:
```
const T my_func() { /**/ }
```
Change to:
```
T my_func() { /**/ }
```
These names are so common, IMO they should not exist directly in the
`Eigen::` namespace. This prevents us from using the `last` or `all`
names for any parameters or local variables, otherwise spitting out
warnings about shadowing or hiding the global values. Many external
projects (and our own examples) also heavily use
```
using namespace Eigen;
```
which means these conflict with external libraries as well, e.g.
`std::fill(first,last,value)`.
It seems originally these were placed in a separate namespace
`Eigen::placeholders`, which has since been deprecated. I propose
to un-deprecate this, and restore the original locations.
These symbols are also imported into `Eigen::indexing`, which
additionally imports `fix` and `seq`. An alternative is to remove the
`placeholders` namespace and stick with `indexing`.
NOTE: this is an API-breaking change.
Fixes#2321.
This is in preparation of adding GPU tests to the CI, allowing
us to limit building/testing of GPU-specific tests for a given
GPU-capable runner.
GPU tests are tagged with the label "gpu". The new targets
```
make buildtests_gpu
make check_gpu
```
allow building and running only the gpu tests.