Since `numeric_limits<half>::max_exponent` is a static inline constant,
it cannot be directly passed by reference. This triggers a linker error
in recent versions of `g++-powerpc64le`.
Changing `half` to take inputs by value fixes this. Wrapping
`max_exponent` with `int(...)` to make an addressable integer also fixes this
and may help with other custom `Scalar` types down-the-road.
Also eliminated some compile warnings for powerpc.
With !406, we accidentally broke arm 32-bit NEON builds, since
`vsqrt_f32` is only available for 64-bit.
Here we add back the `rsqrt` implementation for 32-bit, relying
on a `prsqrt` implementation with better handling of edge cases.
Note that several of the 32-bit NEON packet tests are currently
failing - either due to denormal handling (NEON versions flush
to zero, but scalar paths don't) or due to accuracy (e.g. sin/cos).
The original will saturate if the input does not fit into an integer
type. Here we fix this, returning the input if it doesn't have
enough precision to have a fractional part.
Also added `pceil` for NEON.
Fixes#1969.
The `std::result_of` meta struct is deprecated in C++17 and removed
in C++20. It was still slipping through due to a faulty definition of
`EIGEN_HAS_STD_RESULT_OF`.
Added a new macro `EIGEN_HAS_STD_INVOKE_RESULT` and
`Eigen::internal::invoke_result` implementation with fallback for
pre C++17.
Replaces the `result_of` definition with one based on `std::invoke_result`
for C++17 and higher.
For completeness, added nullary op support for c++03.
Fixes#1850.
Added `EIGEN_HAS_STD_HASH` macro, checking for C++11 support and not
running on GPU.
`std::hash<float>` is not a device function, so cannot be used by
`std::hash<bfloat16>`. Removed `EIGEN_DEVICE_FUNC` and only
define if `EIGEN_HAS_STD_HASH`. Same for `half`.
Added `EIGEN_CUDA_HAS_FP16_ARITHMETIC` to improve readability,
eliminate warnings about `EIGEN_CUDA_ARCH` not being defined.
Replaced a couple C-style casts with `reinterpret_cast` for aligned
loading of `half*` to `half2*`. This eliminates `-Wcast-align`
warnings in clang. Although not ideal due to potential type aliasing,
this is how CUDA handles these conversions internally.
macOS defines int64_t as long long even for C++03 and therefore expects
a template specialization
internal::make_unsigned<long long>,
for C++03. Since other platforms define int64_t as long for C++03 we
cannot add the specialization for all cases.
In two places in SuperLUSupport.h, a local variable 'size' is
created that is used only inside an eigen_assert. Remove these,
just fetch the required values inside the assert statements.
This avoids annoying -Wunused warnings (and -Werror=unused errors)
in NDEBUG builds.
The original clamping bounds on `_x` actually produce finite values:
```
exp(88.3762626647950) = 2.40614e+38 < 3.40282e+38
exp(709.437) = 1.27226e+308 < 1.79769e+308
```
so with an accurate `ldexp` implementation, `pexp` fails for large
inputs, producing finite values instead of `inf`.
This adjusts the bounds slightly outside the finite range so that
the output will overflow to +/- `inf` as expected.
The previous implementations produced garbage values if the exponent did
not fit within the exponent bits. See #2131 for a complete discussion,
and !375 for other possible implementations.
Here we implement the 4-factor version. See `pldexp_impl` in
`GenericPacketMathFunctions.h` for a full description.
The SSE `pcmp*` methods were moved down since `pcmp_le<Packet4i>`
requires `por`.
Left as a "TODO" is to delegate to a faster version if we know the
exponent does fit within the exponent bits.
Fixes#2131.
Currently if compiled by NVCC, the `MatrixBase::bdcSvd()` implementation
is skipped, leading to a linker error. This prevents it from running on
the host as well.
Seems it was disabled 6 years ago (5384e891) to match `jacobiSvd`, but
`jacobiSvd` is now enabled on host. Tested and runs fine on host, but
will not compile/run for device (though it's not labelled as a device
function, so this should be fine).
Fixes#2139
We are potentially seeing some accuracy issues with these. Ideally we
would hand off to `float`, but that's not trivial with the current
setup.
We may want to consider adding `ppow<Packet>` and `HasPow`, so
implementations can more easily specialize this.
Clang does a poor job of optimizing the GEBP microkernel on 32-bit ARM,
leading to excessive 16-byte register spills, slowing down basic f32
matrix multiplication by approx 50%.
By specializing `gebp_traits`, we can eliminate the register spills.
Volatile inline ASM both acts as a barrier to prevent reordering and
enforces strict register use. In a simple f32 matrix multiply example,
this modification reduces 16-byte spills from 109 instances to zero,
leading to a 1.5x speed increase (search for `16-byte Spill` in the
assembly in https://godbolt.org/z/chsPbE).
This is a replacement of !379. See there for further discussion.
Also moved `gebp_traits` specializations for NEON to
`Eigen/src/Core/arch/NEON/GeneralBlockPanelKernel.h` to be alongside
other NEON-specific code.
Fixes#2138.
Unfortunately `std::bit_and` and the like are host-only functions prior
to c++14 (since they are not `constexpr`). They also never exist in the
global namespace, so the current implementation always fails to compile via
NVCC - since `EIGEN_USING_STD` tries to import the symbol from the global
namespace on device.
To overcome these limitations, we implement these functionals here.
Allows the altivec packetmath tests to pass. There were a few issues:
- `pstoreu` was missing MSQ on `_BIG_ENDIAN` systems
- `cmp_*` didn't properly handle conversion of bool flags (0x7FC instead
of 0xFFFF)
- `pfrexp` needed to set the `exponent` argument.
Related to !370, #2128
cc: @ChipKerchner @pdrocaldeira
Tested on `_BIG_ENDIAN` running on QEMU with VSX. Couldn't figure out build
flags to get it to work for little endian.
Originating from
[this SO issue](https://stackoverflow.com/questions/65901014/how-to-solve-this-all-error-2-in-this-case),
some win32 compilers define `__int32` as a `long`, but MinGW defines
`std::int32_t` as an `int`, leading to a type conflict.
To avoid this, we remove the custom `typedef` definitions for win32. The
Tensor module requires C++11 anyways, so we are guaranteed to have
included `<cstdint>` already in `Eigen/Core`.
Also re-arranged the headers to only include `<cstdint>` in one place to
avoid this type of error again.
The new `generic_pow` implementation was failing for half/bfloat16 since
their construction from int/float is not `constexpr`. Modified
in `GenericPacketMathFunctions` to remove `constexpr`.
While adding tests for half/bfloat16, found other issues related to
implicit conversions.
Also needed to implement `numext::arg` for non-integer, non-complex,
non-float/double/long double types. These seem to be implicitly
converted to `std::complex<T>`, which then fails for half/bfloat16.
NVCC and older versions of clang do not fully support `std::complex` on device,
leading to either compile errors (Cannot call `__host__` function) or worse,
runtime errors (Illegal instruction). For most functions, we can
implement specialized `numext` versions. Here we specialize the standard
operators (with the exception of stream operators and member function operators
with a scalar that are already specialized in `<complex>`) so they can be used
in device code as well.
To import these operators into the current scope, use
`EIGEN_USING_STD_COMPLEX_OPERATORS`. By default, these are imported into
the `Eigen`, `Eigen:internal`, and `Eigen::numext` namespaces.
This allow us to remove specializations of the
sum/difference/product/quotient ops, and allow us to treat complex
numbers like most other scalars (e.g. in tests).
This patch adds support for Arm's new vector extension SVE (Scalable Vector Extension). In contrast to other vector extensions that are supported by Eigen, SVE types are inherently *sizeless*. For the use in Eigen we fix their size at compile-time (note that this is not necessary in general, SVE is *length agnostic*).
During compilation the flag `-msve-vector-bits=N` has to be set where `N` is a power of two in the range of `128`to `2048`, indicating the length of an SVE vector.
Since SVE is rather young, we decided to disable it by default even if it would be available. A user has to enable it explicitly by defining `EIGEN_ARM64_USE_SVE`.
This patch introduces the packet types `PacketXf` and `PacketXi` for packets of `float` and `int32_t` respectively. The size of these packets depends on the SVE vector length. E.g. if `-msve-vector-bits=512` is set, `PacketXf` will contain `512/32 = 16` elements.
This MR is joint work with Miguel Tairum <miguel.tairum@arm.com>.
The recent addition of vectorized pow (!330) relies on `pfrexp` and
`pldexp`. This was missing for `Eigen::half` and `Eigen::bfloat16`.
Adding tests for these packet ops also exposed an issue with handling
negative values in `pfrexp`, returning an incorrect exponent.
Added the missing implementations, corrected the exponent in `pfrexp1`,
and added `packetmath` tests.
Hex literals are interpreted as unsigned, leading to a comparison between
signed max supported function `abcd[0]` (which was negative) to the unsigned
literal `0x80000006`. Should not change result since signed is
implicitly converted to unsigned for the comparison, but eliminates the
warning.
I ran some testing (comparing to `std::pow(double(x), double(y)))` for `x` in the set of all (positive) floats in the interval `[std::sqrt(std::numeric_limits<float>::min()), std::sqrt(std::numeric_limits<float>::max())]`, and `y` in `{2, sqrt(2), -sqrt(2)}` I get the following error statistics:
```
max_rel_error = 8.34405e-07
rms_rel_error = 2.76654e-07
```
If I widen the range to all normal float I see lower accuracy for arguments where the result is subnormal, e.g. for `y = sqrt(2)`:
```
max_rel_error = 0.666667
rms = 6.8727e-05
count = 1335165689
argmax = 2.56049e-32, 2.10195e-45 != 1.4013e-45
```
which seems reasonable, since these results are subnormals with only couple of significant bits left.
Apparently `inf` is a macro on iOS for `std::numeric_limits<T>::infinity()`,
causing a compile error here. We don't need the local anyways since it's
only used in one spot.
Upon investigation, `JacobiSVD` is significantly faster than `BDCSVD`
for small matrices (twice as fast for 2x2, 20% faster for 3x3,
1% faster for 10x10). Since the majority of cases will be small,
let's stick with `JacobiSVD`. See !361.
In the previous code, in attempting to correct for a negative
determinant, we end up multiplying and dividing by a number that
is often very near, but not exactly +/-1. By flushing to +/-1,
we can replace a division with a multiplication, and results
are more numerically consistent.
The following commit breaks ROCm support for Eigen
f149e0ebc3
All unit tests fail with the following error
```
Building HIPCC object test/CMakeFiles/gpu_basic.dir/gpu_basic_generated_gpu_basic.cu.o
In file included from /home/rocm-user/eigen/test/gpu_basic.cu:19:
In file included from /home/rocm-user/eigen/test/main.h:356:
In file included from /home/rocm-user/eigen/Eigen/QR:11:
In file included from /home/rocm-user/eigen/Eigen/Core:166:
/home/rocm-user/eigen/Eigen/src/Core/MathFunctionsImpl.h:105:35: error: __host__ __device__ function 'complex_sqrt' cannot overload __host__ function 'complex_sqrt'
EIGEN_DEVICE_FUNC std::complex<T> complex_sqrt(const std::complex<T>& z) {
^
/home/rocm-user/eigen/Eigen/src/Core/MathFunctions.h:342:38: note: previous declaration is here
template<typename T> std::complex<T> complex_sqrt(const std::complex<T>& a_x);
^
1 error generated when compiling for gfx900.
CMake Error at gpu_basic_generated_gpu_basic.cu.o.cmake:192 (message):
Error generating file
/home/rocm-user/eigen/build/test/CMakeFiles/gpu_basic.dir//./gpu_basic_generated_gpu_basic.cu.o
test/CMakeFiles/gpu_basic.dir/build.make:63: recipe for target 'test/CMakeFiles/gpu_basic.dir/gpu_basic_generated_gpu_basic.cu.o' failed
make[3]: *** [test/CMakeFiles/gpu_basic.dir/gpu_basic_generated_gpu_basic.cu.o] Error 1
CMakeFiles/Makefile2:16618: recipe for target 'test/CMakeFiles/gpu_basic.dir/all' failed
make[2]: *** [test/CMakeFiles/gpu_basic.dir/all] Error 2
CMakeFiles/Makefile2:16625: recipe for target 'test/CMakeFiles/gpu_basic.dir/rule' failed
make[1]: *** [test/CMakeFiles/gpu_basic.dir/rule] Error 2
Makefile:5401: recipe for target 'gpu_basic' failed
make: *** [gpu_basic] Error 2
```
The error message is accurate, and the fix (provided in thsi commit) is trivial.
MSVC incorrectly handles `inf` cases for `std::sqrt<std::complex<T>>`.
Here we replace it with a custom version (currently used on GPU).
Also fixed the `packetmath` test, which previously skipped several
corner cases since `CHECK_CWISE1` only tests the first `PacketSize`
elements.
Since `eigen_assert` is a macro, the statements can become noops (e.g.
when compiling for GPU), so they may not execute the contained logic -- which
in this case is the entire `Ref` construction. We need to separate the assert
from statements which have consequences.
Fixes#2113
The existing `Ref` class failed to consider cases where the Ref's
`Stride` setting *could* match the underlying referred object's stride,
but **didn't** at runtime. This led to trying to set invalid stride values,
causing runtime failures in some cases, and garbage due to mismatched
strides in others.
Here we add the missing runtime checks. This involves computing the
strides necessary to align with the referred object's storage, and
verifying we can actually set those strides at runtime.
In the `const` case, if it *may* be possible to refer to the original
storage at compile-time but fails at runtime, then we defer to the
`construct(...)` method that makes a copy.
Added more tests to check these cases.
Fixes#2093.
This is to support scalar `sqrt` of complex numbers `std::complex<T>` on
device, requested by Tensorflow folks.
Technically `std::complex` is not supported by NVCC on device
(though it is by clang), so the default `sqrt(std::complex<T>)` function only
works on the host. Here we create an overload to add back the
functionality.
Also modified the CMake file to add `--relaxed-constexpr` (or
equivalent) flag for NVCC to allow calling constexpr functions from
device functions, and added support for specifying compute architecture for
NVCC (was already available for clang).
For these to exist we would need to define `_USE_MATH_DEFINES` before
`cmath` or `math.h` is first included. However, we don't
control the include order for projects outside Eigen, so even defining
the macro in `Eigen/Core` does not fix the issue for projects that
end up including `<cmath>` before Eigen does (explicitly or transitively).
To fix this, we define `EIGEN_LOG2E` and `EIGEN_LN2` ourselves.
The following commit introduced a breakage in ROCm/HIP support for Eigen.
5ec4907434 (1958e65719641efe5483abc4ce0b61806270f6f3_525_517)
```
Building HIPCC object test/CMakeFiles/gpu_basic.dir/gpu_basic_generated_gpu_basic.cu.o
In file included from /home/rocm-user/eigen/test/gpu_basic.cu:20:
In file included from /home/rocm-user/eigen/test/main.h:356:
In file included from /home/rocm-user/eigen/Eigen/QR:11:
In file included from /home/rocm-user/eigen/Eigen/Core:222:
/home/rocm-user/eigen/Eigen/src/Core/arch/GPU/PacketMath.h:556:10: error: use of undeclared identifier 'half2half2'; did you mean '__half2half2'?
return half2half2(from);
^~~~~~~~~~
__half2half2
/opt/rocm/hip/include/hip/hcc_detail/hip_fp16.h:547:21: note: '__half2half2' declared here
__half2 __half2half2(__half x)
^
1 error generated when compiling for gfx900.
```
The cause seems to be a copy-paster error, and the fix is trivial
The previous code had `__host__ __device__` functions calling `__device__`
functions (e.g. `__low2half`) which caused build failures in tensorflow.
Also tried to simplify the `#ifdef` guards to make them more clear.
In the current `dense_assignment_loop` implementations, if the
destination's inner or outer size is zero at compile time and if the kernel
involves a product, we currently get a compile error (#2080). This is
triggered by attempting to multiply a non-existent row by a column (or
vice-versa).
To address this, we add a specialization for zero-sized assignments
(`AllAtOnceTraversal`) which evaluates to a no-op. We also add a static
check to ensure the size is in-fact zero. This now seems to be the only
existing use of `AllAtOnceTraversal`.
Fixes#2080.
Removed redundant checks and redundant code for CUDA/HIP.
Note: there are several issues here of calling `__device__` functions
from `__host__ __device__` functions, in particular `__low2half`.
We do not address that here -- only modifying this file enough
to get our current tests to compile.
Fixed: #1847
Current implementations fail to consider half-float packets, only
half-float scalars. Added specializations for packets on AVX, AVX512 and
NEON. Added tests to `special_packetmath`.
The current `special_functions` tests would fail for half and bfloat16 due to
lack of precision. The NEON tests also fail with precision issues and
due to different handling of `sqrt(inf)`, so special functions bessel, ndtri
have been disabled.
Tested with AVX, AVX512.
The `shfl*` functions are `__device__` only, and adjusted `#ifdef`s so
they are defined whenever the corresponding CUDA/HIP ones are.
Also changed the HIP/CUDA<9.0 versions to cast to int instead of
doing the conversion `half`<->`float`.
Fixes#2083
Adding the term e*ln(2) is split into two step for no obvious reason.
This dates back to the original Cephes code from which the algorithm is adapted.
It appears that this was done in Cephes to prevent the compiler from reordering
the addition of the 3 terms in the approximation
log(1+x) ~= x - 0.5*x^2 + x^3*P(x)/Q(x)
which must be added in reverse order since |x| < (sqrt(2)-1).
This allows rewriting the code to just 2 pmadd and 1 padd instructions,
which on a Skylake processor speeds up the code by 5-7%.
The current impl corrupts the comparison masks when converting
from float back to bfloat16. The resulting masks are then
no longer all zeros or all ones, which breaks when used with
`pselect` (e.g. in `pmin<PropagateNumbers>`). This was
causing `packetmath_15` to fail on arm.
Introducing a simple `F32MaskToBf16Mask` corrects this (takes
the lower 16-bits for each float mask).
Prior to this fix, `TensorContractionGpu` and the `cxx11_tensor_of_float16_gpu`
test are broken, as well as several ops in Tensorflow. The gpu functions
`__shfl*` became ambiguous now that `Eigen::half` implicitly converts to float.
Here we add the required specializations.
`bit_cast` cannot be `constexpr`, so we need to remove `EIGEN_CONSTEXPR` from
`raw_half_as_uint16(...)`. This shouldn't affect anything else, since
it is only used in `a bit_cast<uint16_t,half>()` which is not itself
`constexpr`.
Fixes#2077.
This allows the `packetmath` tests to pass for AVX512 on skylake.
Made `half` and `bfloat16` consistent in terms of ops they support.
Note the `log` tests are currently disabled for `bfloat16` since
they fail due to poor precision (they were previously disabled for
`Packet8bf` via test function specialization -- I just removed that
specialization and disabled it in the generic test).
The `half_float` test was failing with `-mcpu=cortex-a55` (native `__fp16`) due
to a bad NaN bit-pattern comparison (in the case of casting a float to `__fp16`,
the signaling `NaN` is quieted). There was also an inconsistency between
`numeric_limits<half>::quiet_NaN()` and `NumTraits::quiet_NaN()`. Here we
correct the inconsistency and compare NaNs according to the IEEE 754
definition.
Also modified the `bfloat16_float` test to match.
Tested with `cortex-a53` and `cortex-a55`.
This fixes some gcc warnings such as:
```
Eigen/src/Core/GenericPacketMath.h:655:63: warning: implicit conversion turns floating-point number into bool: 'typename __gnu_cxx::__enable_if<__is_integer<bool>::__value, double>::__type' (aka 'double') to 'bool' [-Wimplicit-conversion-floating-point-to-bool]
Packet psqrt(const Packet& a) { EIGEN_USING_STD(sqrt); return sqrt(a); }
```
Details:
- Added `scalar_sqrt_op<bool>` (`-Wimplicit-conversion-floating-point-to-bool`).
- Added `scalar_square_op<bool>` and `scalar_cube_op<bool>`
specializations (`-Wint-in-bool-context`)
- Deprecated above specialized ops for bool.
- Modified `cxx11_tensor_block_eval` to specialize generator for
booleans (`-Wint-in-bool-context`) and to use `abs` instead of `square` to
avoid deprecated bool ops.
Minimal implementation of AVX `Eigen::half` ops to bring in line
with `bfloat16`. Allows `packetmath_13` to pass.
Also adjusted `bfloat16` packet traits to match the supported set
of ops (e.g. Bessel is not actually implemented).
The `half_float` test was failing with `-mcpu=cortex-a55` (native `__fp16`) due
to a bad NaN bit-pattern comparison (in the case of casting a float to `__fp16`,
the signaling `NaN` is quieted). There was also an inconsistency between
`numeric_limits<half>::quiet_NaN()` and `NumTraits::quiet_NaN()`. Here we
correct the inconsistency and compare NaNs according to the IEEE 754
definition.
Also modified the `bfloat16_float` test to match.
Tested with `cortex-a53` and `cortex-a55`.
The AVX half implementation is incomplete, causing the `packetmath_13` test
to fail. This disables the test.
Also refactored the existing AVX implementation to use `bit_cast`
instead of direct access to `.x`.
Missing inline breaks blas, since symbol generated in
`complex_single.cpp`, `complex_double.cpp`, `single.cpp`, `double.cpp`
Changed rest of inlines to `EIGEN_STRONG_INLINE`.
Both, Eigen::half and Eigen::Bfloat16 are implicitly convertible to
float and can hence be converted to bool via the conversion chain
Eigen::{half,bfloat16} -> float -> bool
We thus remove the explicit cast operator to bool.
Multiplication of column-major `DynamicSparseMatrix`es involves three
temporaries:
- two for transposing twice to sort the coefficients
(`ConservativeSparseSparseProduct.h`, L160-161)
- one for a final copy assignment (`SparseAssign.h`, L108)
The latter is avoided in an optimization for `SparseMatrix`.
Since `DynamicSparseMatrix` is deprecated in favor of `SparseMatrix`, it's not
worth the effort to optimize further, so I simply disabled counting
temporaries via a macro.
Note that due to the inclusion of `sparse_product.cpp`, the `sparse_extra`
tests actually re-run all the original `sparse_product` tests as well.
We may want to simply drop the `DynamicSparseMatrix` tests altogether, which
would eliminate the test duplication.
Related to #2048
The existing `TensorRandom.h` implementation makes the assumption that
`half` (`bfloat16`) has a `uint16_t` member `x` (`value`), which is not
always true. This currently fails on arm64, where `x` has type `__fp16`.
Added `bit_cast` specializations to allow casting to/from `uint16_t`
for both `half` and `bfloat16`. Also added tests in
`half_float`, `bfloat16_float`, and `cxx11_tensor_random` to catch
these errors in the future.
When calling `internal::cast<S, std::complex<T>>(x)`, clang often
generates an implicit conversion warning due to an implicit cast
from type `S` to `T`. This currently affects the following tests:
- `basicstuff`
- `bfloat16_float`
- `cxx11_tensor_casts`
The implicit cast leads to widening/narrowing float conversions.
Widening warnings only seem to be generated by clang (`-Wdouble-promotion`).
To eliminate the warning, we explicitly cast the real-component first
from `S` to `T`. We also adjust tests to use `internal::cast` instead
of `static_cast` when a complex type may be involved.