eigen

mirror of https://gitlab.com/libeigen/eigen.git synced 2024-12-21 07:19:46 +08:00

Author	SHA1	Message	Date
rgreenblatt	fdf2ee62c5	Fix missing EIGEN_DEVICE_FUNC	2020-12-20 23:22:53 -05:00
Rasmus Munk Larsen	05754100fe	* Add iterative psqrt<double> for AVX and SSE when FMA is available. This provides a ~10% speedup. * Write iterative sqrt explicitly in terms of pmadd. This gives up to 7% speedup for psqrt<float> with AVX & SSE with FMA. * Remove iterative psqrt<double> for NEON, because the initial rsqrt apprimation is not accurate enough for convergence in 2 Newton-Raphson steps and with 3 steps, just calling the builtin sqrt insn is faster. The following benchmarks were compiled with clang "-O2 -fast-math -mfma" and with and without -mavx. AVX+FMA (float) name old cpu/op new cpu/op delta BM_eigen_sqrt_float/1 1.08ns ± 0% 1.09ns ± 1% ~ BM_eigen_sqrt_float/8 2.07ns ± 0% 2.08ns ± 1% ~ BM_eigen_sqrt_float/64 12.4ns ± 0% 12.4ns ± 1% ~ BM_eigen_sqrt_float/512 95.7ns ± 0% 95.5ns ± 0% ~ BM_eigen_sqrt_float/4k 776ns ± 0% 763ns ± 0% -1.67% BM_eigen_sqrt_float/32k 6.57µs ± 1% 6.13µs ± 0% -6.69% BM_eigen_sqrt_float/256k 83.7µs ± 3% 83.3µs ± 2% ~ BM_eigen_sqrt_float/1M 335µs ± 2% 332µs ± 2% ~ SSE+FMA (float) name old cpu/op new cpu/op delta BM_eigen_sqrt_float/1 1.08ns ± 0% 1.09ns ± 0% ~ BM_eigen_sqrt_float/8 2.07ns ± 0% 2.06ns ± 0% ~ BM_eigen_sqrt_float/64 12.4ns ± 0% 12.4ns ± 1% ~ BM_eigen_sqrt_float/512 95.7ns ± 0% 96.3ns ± 4% ~ BM_eigen_sqrt_float/4k 774ns ± 0% 763ns ± 0% -1.50% BM_eigen_sqrt_float/32k 6.58µs ± 2% 6.11µs ± 0% -7.06% BM_eigen_sqrt_float/256k 82.7µs ± 1% 82.6µs ± 1% ~ BM_eigen_sqrt_float/1M 330µs ± 1% 329µs ± 2% ~ SSE+FMA (double) BM_eigen_sqrt_double/1 1.63ns ± 0% 1.63ns ± 0% ~ BM_eigen_sqrt_double/8 6.51ns ± 0% 6.08ns ± 0% -6.68% BM_eigen_sqrt_double/64 52.1ns ± 0% 46.5ns ± 1% -10.65% BM_eigen_sqrt_double/512 417ns ± 0% 374ns ± 1% -10.29% BM_eigen_sqrt_double/4k 3.33µs ± 0% 2.97µs ± 1% -11.00% BM_eigen_sqrt_double/32k 26.7µs ± 0% 23.7µs ± 0% -11.07% BM_eigen_sqrt_double/256k 213µs ± 0% 206µs ± 1% -3.31% BM_eigen_sqrt_double/1M 862µs ± 0% 870µs ± 2% +0.96% AVX+FMA (double) name old cpu/op new cpu/op delta BM_eigen_sqrt_double/1 1.63ns ± 0% 1.63ns ± 0% ~ BM_eigen_sqrt_double/8 6.51ns ± 0% 6.06ns ± 0% -6.95% BM_eigen_sqrt_double/64 52.1ns ± 0% 46.5ns ± 1% -10.80% BM_eigen_sqrt_double/512 417ns ± 0% 373ns ± 1% -10.59% BM_eigen_sqrt_double/4k 3.33µs ± 0% 2.97µs ± 1% -10.79% BM_eigen_sqrt_double/32k 26.7µs ± 0% 23.8µs ± 0% -10.94% BM_eigen_sqrt_double/256k 214µs ± 0% 208µs ± 2% -2.76% BM_eigen_sqrt_double/1M 866µs ± 3% 923µs ± 7% ~	2020-12-16 18:16:11 +00:00
Turing Eret	3bee9422d6	Merge branch 'lambdaknight/eigen-master'	2020-12-16 09:18:24 -07:00
Turing Eret	19e6496ce0	Replace call to FixedDimensions() with a singleton instance of FixedDimensions.	2020-12-16 07:34:44 -07:00
Rasmus Munk Larsen	6cee8d347e	Add an additional step of Newton-Raphson for `psqrt<double>` on Arm, which otherwise has an error of ~1000 ulps.	2020-12-15 04:06:41 +00:00
Turing Eret	bc7d1599fb	TensorStorage with FixedDimensions now has zero instance memory overhead. Removed m_dimension as instance member of TensorStorage with FixedDimensions and instead use the template parameter. This means that the sizeof a pure fixed-size storage is exactly equal to the data it is storing.	2020-12-14 07:19:34 -07:00
Alexander Grund	cf0b5b0344	Remove code checking for CMake < 3.5 As the CMake version is at least 3.5 the code checking for earlier versions can be removed.	2020-12-14 09:57:44 +00:00
David Tellenbach	751f18f2c0	Remove comma at the end of enumeration list to silence C++03 warnings	2020-12-13 18:11:02 +01:00
Antonio Sanchez	5dc2fbabee	Fix implicit cast to double. Triggers `-Wimplicit-float-conversion`, causing a bunch of build errors in Google due to `-Wall`.	2020-12-12 09:26:20 -08:00
Antonio Sanchez	55967f87d1	Fix NEON pmax<PropagateNumbers,Packet4bf>. Simple typo, the max impl called pmin instead of pmax for floats.	2020-12-11 21:50:52 -08:00
Antonio Sanchez	839aa505c3	Fix typo in AVX512 packet math.	2020-12-11 21:35:44 -08:00
David Tellenbach	536c8a79f2	Remove unused macro in Half.h	2020-12-12 00:53:26 +01:00
Antonio Sanchez	8c9976d7f0	Fix more SSE/AVX packet conversions for peven. MSVC doesn't like function-style casts and forces us to use intrinsics.	2020-12-11 15:46:42 -08:00
Antonio Sanchez	c6efc4e0ba	Replace M_LOG2E and M_LN2 with custom macros. For these to exist we would need to define `_USE_MATH_DEFINES` before `cmath` or `math.h` is first included. However, we don't control the include order for projects outside Eigen, so even defining the macro in `Eigen/Core` does not fix the issue for projects that end up including `<cmath>` before Eigen does (explicitly or transitively). To fix this, we define `EIGEN_LOG2E` and `EIGEN_LN2` ourselves.	2020-12-11 14:34:31 -08:00
Antonio Sanchez	e82722a4a7	Fix MSVC SSE casts. MSVC doesn't like __m128(__m128i) c-style casts, so packets need to be converted using intrinsic methods.	2020-12-11 08:52:59 -08:00
Deven Desai	f3d2ea48f5	Fix for broken ROCm/HIP Support The following commit introduced a breakage in ROCm/HIP support for Eigen. `5ec4907434 (1958e65719641efe5483abc4ce0b61806270f6f3_525_517)` ``` Building HIPCC object test/CMakeFiles/gpu_basic.dir/gpu_basic_generated_gpu_basic.cu.o In file included from /home/rocm-user/eigen/test/gpu_basic.cu:20: In file included from /home/rocm-user/eigen/test/main.h:356: In file included from /home/rocm-user/eigen/Eigen/QR:11: In file included from /home/rocm-user/eigen/Eigen/Core:222: /home/rocm-user/eigen/Eigen/src/Core/arch/GPU/PacketMath.h:556:10: error: use of undeclared identifier 'half2half2'; did you mean '__half2half2'? return half2half2(from); ^~~~~~~~~~ __half2half2 /opt/rocm/hip/include/hip/hcc_detail/hip_fp16.h:547:21: note: '__half2half2' declared here __half2 __half2half2(__half x) ^ 1 error generated when compiling for gfx900. ``` The cause seems to be a copy-paster error, and the fix is trivial	2020-12-11 16:14:57 +00:00
David Tellenbach	c7eb3a74cb	Don't guard psqrt for std::complex<float> with EIGEN_ARCH_ARM64	2020-12-11 12:41:52 +01:00
Everton Constantino	bccf055a7c	Add Armv8 guard on PropagateNumbers implementation.	2020-12-10 22:01:55 -03:00
Antonio Sanchez	82c0c18a83	Remove private access of std::deque::_M_impl. This no longer works on gcc or clang, so we should just remove the hack. The default should compile to similar code anyways.	2020-12-10 14:59:34 -08:00
David Tellenbach	00be0a7ff3	Fix vectorization of complex sqrt on NEON	2020-12-10 15:23:23 +00:00
David Tellenbach	8eb461a431	Remove comma at end of enumerator list in NEON PacketMath	2020-12-10 15:22:55 +01:00
David Tellenbach	2e8f850c78	Fix a typo in SparseMatrix documentation. This fixes issue #2091.	2020-12-09 14:48:24 +01:00
Rasmus Munk Larsen	125cc9a5df	Implement vectorized complex square root. Closes #1905 Measured speedup for sqrt of `complex<float>` on Skylake: SSE: ``` name old time/op new time/op delta BM_eigen_sqrt_ctype/1 49.4ns ± 0% 54.3ns ± 0% +10.01% BM_eigen_sqrt_ctype/8 332ns ± 0% 50ns ± 1% -84.97% BM_eigen_sqrt_ctype/64 2.81µs ± 1% 0.38µs ± 0% -86.49% BM_eigen_sqrt_ctype/512 23.8µs ± 0% 3.0µs ± 0% -87.32% BM_eigen_sqrt_ctype/4k 202µs ± 0% 24µs ± 2% -88.03% BM_eigen_sqrt_ctype/32k 1.63ms ± 0% 0.19ms ± 0% -88.18% BM_eigen_sqrt_ctype/256k 13.0ms ± 0% 1.5ms ± 1% -88.20% BM_eigen_sqrt_ctype/1M 52.1ms ± 0% 6.2ms ± 0% -88.18% ``` AVX2: ``` name old cpu/op new cpu/op delta BM_eigen_sqrt_ctype/1 53.6ns ± 0% 55.6ns ± 0% +3.71% BM_eigen_sqrt_ctype/8 334ns ± 0% 27ns ± 0% -91.86% BM_eigen_sqrt_ctype/64 2.79µs ± 0% 0.22µs ± 2% -92.28% BM_eigen_sqrt_ctype/512 23.8µs ± 1% 1.7µs ± 1% -92.81% BM_eigen_sqrt_ctype/4k 201µs ± 0% 14µs ± 1% -93.24% BM_eigen_sqrt_ctype/32k 1.62ms ± 0% 0.11ms ± 1% -93.29% BM_eigen_sqrt_ctype/256k 13.0ms ± 0% 0.9ms ± 1% -93.31% BM_eigen_sqrt_ctype/1M 52.0ms ± 0% 3.5ms ± 1% -93.31% ``` AVX512: ``` name old cpu/op new cpu/op delta BM_eigen_sqrt_ctype/1 53.7ns ± 0% 56.2ns ± 1% +4.75% BM_eigen_sqrt_ctype/8 334ns ± 0% 18ns ± 2% -94.63% BM_eigen_sqrt_ctype/64 2.79µs ± 0% 0.12µs ± 1% -95.54% BM_eigen_sqrt_ctype/512 23.9µs ± 1% 1.0µs ± 1% -95.89% BM_eigen_sqrt_ctype/4k 202µs ± 0% 8µs ± 1% -96.13% BM_eigen_sqrt_ctype/32k 1.63ms ± 0% 0.06ms ± 1% -96.15% BM_eigen_sqrt_ctype/256k 13.0ms ± 0% 0.5ms ± 4% -96.11% BM_eigen_sqrt_ctype/1M 52.1ms ± 0% 2.0ms ± 1% -96.13% ```	2020-12-08 18:13:35 -08:00
Antonio Sanchez	8cfe0db108	Fix host/device calls for __half. The previous code had `__host__ __device__` functions calling `__device__` functions (e.g. `__low2half`) which caused build failures in tensorflow. Also tried to simplify the `#ifdef` guards to make them more clear.	2020-12-08 20:31:02 +00:00
Everton Constantino	baf9d762b7	- Enabling PropagateNaN and PropagateNumbers for NEON. - Adding propagate tests to bfloat16.	2020-12-08 17:05:05 +00:00
Antonio Sanchez	634bd79b0e	Fix unused warning on new `dense_assignment_loop` impl.	2020-12-07 19:14:21 -08:00
Antonio Sanchez	655c3a4042	Add specialization for compile-time zero-sized dense assignment. In the current `dense_assignment_loop` implementations, if the destination's inner or outer size is zero at compile time and if the kernel involves a product, we currently get a compile error (#2080). This is triggered by attempting to multiply a non-existent row by a column (or vice-versa). To address this, we add a specialization for zero-sized assignments (`AllAtOnceTraversal`) which evaluates to a no-op. We also add a static check to ensure the size is in-fact zero. This now seems to be the only existing use of `AllAtOnceTraversal`. Fixes #2080.	2020-12-07 08:38:43 -08:00
Antonio Sanchez	5ec4907434	Clean up `#if`s in GPU PacketPath. Removed redundant checks and redundant code for CUDA/HIP. Note: there are several issues here of calling `__device__` functions from `__host__ __device__` functions, in particular `__low2half`. We do not address that here -- only modifying this file enough to get our current tests to compile. Fixed: #1847	2020-12-04 16:14:03 -08:00
Rasmus Munk Larsen	f9fac1d5b0	Add log2() to Eigen.	2020-12-04 21:45:09 +00:00
Antonio Sanchez	2dbac2f99f	Fix bad NEON fp16 check	2020-12-04 13:42:18 -08:00
Antonio Sanchez	e2f21465fe	Special function implementations for half/bfloat16 packets. Current implementations fail to consider half-float packets, only half-float scalars. Added specializations for packets on AVX, AVX512 and NEON. Added tests to `special_packetmath`. The current `special_functions` tests would fail for half and bfloat16 due to lack of precision. The NEON tests also fail with precision issues and due to different handling of `sqrt(inf)`, so special functions bessel, ndtri have been disabled. Tested with AVX, AVX512.	2020-12-04 10:16:29 -08:00
David Tellenbach	305b8bd277	Remove duplicate #if clause	2020-12-04 18:55:46 +01:00
Antonio Sanchez	9ee9ac81de	Fix shfl* macros for CUDA/HIP The `shfl*` functions are `__device__` only, and adjusted `#ifdef`s so they are defined whenever the corresponding CUDA/HIP ones are. Also changed the HIP/CUDA<9.0 versions to cast to int instead of doing the conversion `half`<->`float`. Fixes #2083	2020-12-04 17:18:32 +00:00
shrek1402	a9a2f2bebf	The function 'prefetch' did not work correctly on the win64 platform	2020-12-04 17:18:08 +00:00
Rasmus Munk Larsen	f23dc5b971	Revert "Add log2() operator to Eigen" This reverts commit `4d91519a9b`.	2020-12-03 14:32:45 -08:00
Rasmus Munk Larsen	4d91519a9b	Add log2() operator to Eigen	2020-12-03 22:31:44 +00:00
Rasmus Munk Larsen	25d8ae7465	Small cleanup of generic plog implementations: Adding the term eln(2) is split into two step for no obvious reason. This dates back to the original Cephes code from which the algorithm is adapted. It appears that this was done in Cephes to prevent the compiler from reordering the addition of the 3 terms in the approximation log(1+x) ~= x - 0.5x^2 + x^3*P(x)/Q(x) which must be added in reverse order since \|x\| < (sqrt(2)-1). This allows rewriting the code to just 2 pmadd and 1 padd instructions, which on a Skylake processor speeds up the code by 5-7%.	2020-12-03 19:40:40 +00:00
Antonio Sanchez	eb4d4ae070	Include chrono in main for c++11. Hack to fix tensor tests, since min/max are overridden by `main.h`.	2020-12-03 11:27:32 -08:00
Rasmus Munk Larsen	71c85df4c1	Clean up the Tensor header and get rid of the EIGEN_SLEEP macro.	2020-12-02 11:04:04 -08:00
Antonio Sanchez	70fbcf82ed	Fix typo in `F32MaskToBf16Mask`.	2020-12-02 07:58:34 -08:00
Antonio Sanchez	2627e2f2e6	Fix neon cmp* functions for bf16. The current impl corrupts the comparison masks when converting from float back to bfloat16. The resulting masks are then no longer all zeros or all ones, which breaks when used with `pselect` (e.g. in `pmin<PropagateNumbers>`). This was causing `packetmath_15` to fail on arm. Introducing a simple `F32MaskToBf16Mask` corrects this (takes the lower 16-bits for each float mask).	2020-12-02 01:29:34 +00:00
Antonio Sanchez	ddd48b242c	Implement CUDA __shfl* for Eigen::half Prior to this fix, `TensorContractionGpu` and the `cxx11_tensor_of_float16_gpu` test are broken, as well as several ops in Tensorflow. The gpu functions `__shfl*` became ambiguous now that `Eigen::half` implicitly converts to float. Here we add the required specializations.	2020-12-01 14:36:52 -08:00
Rasmus Munk Larsen	e57281a741	Fix a few issues for AVX512. This change enables vectorized versions of log, exp, log1p, expm1 when AVX512DQ is not available.	2020-12-01 11:31:47 -08:00
Antonio Sanchez	1992af3de2	Fix #2077 , `EIGEN_CONSTEXPR` in `Half`. `bit_cast` cannot be `constexpr`, so we need to remove `EIGEN_CONSTEXPR` from `raw_half_as_uint16(...)`. This shouldn't affect anything else, since it is only used in `a bit_cast<uint16_t,half>()` which is not itself `constexpr`. Fixes #2077.	2020-12-01 03:10:21 +00:00
acxz	7b80609d49	add EIGEN_DEVICE_FUNC to methods	2020-12-01 03:08:47 +00:00
Antonio Sanchez	89f90b585d	AVX512 missing ops. This allows the `packetmath` tests to pass for AVX512 on skylake. Made `half` and `bfloat16` consistent in terms of ops they support. Note the `log` tests are currently disabled for `bfloat16` since they fail due to poor precision (they were previously disabled for `Packet8bf` via test function specialization -- I just removed that specialization and disabled it in the generic test).	2020-11-30 16:28:57 +00:00
Florian Maurin	c5985c46f5	Fix typo in doc	2020-11-30 10:53:29 +00:00
Jim Lersch	68f69414f7	Workaround for doxygen class template titles in which the template part of the class signature is lost due to a problem with forward declarations. The problem is probably caused by doxygen bug #7689. It is confirmed to be fixed in doxygen >= 1.8.19.	2020-11-27 19:52:16 -07:00
Jim Lersch	a7170f2aca	Fix doxygen class blocks that were not associated with the correct classes.	2020-11-27 08:48:11 -07:00
David Tellenbach	550e8f8f57	Include CMakeDependentOption to be able to use cmake_dependent_option	2020-11-27 13:21:49 +01:00

1 2 3 4 5 ...

11260 Commits