eigen

mirror of https://gitlab.com/libeigen/eigen.git synced 2024-12-21 07:19:46 +08:00

Author	SHA1	Message	Date
Antonio Sanchez	bde6741641	Improved std::complex sqrt and rsqrt. Replaces `std::sqrt` with `complex_sqrt` for all platforms (previously `complex_sqrt` was only used for CUDA and MSVC), and implements custom `complex_rsqrt`. Also introduces `numext::rsqrt` to simplify implementation, and modified `numext::hypot` to adhere to IEEE IEC 6059 for special cases. The `complex_sqrt` and `complex_rsqrt` implementations were found to be significantly faster than `std::sqrt<std::complex<T>>` and `1/numext::sqrt<std::complex<T>>`. Benchmark file attached. ``` GCC 10, Intel Xeon, x86_64: --------------------------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------------------------- BM_Sqrt<std::complex<float>> 9.21 ns 9.21 ns 73225448 BM_StdSqrt<std::complex<float>> 17.1 ns 17.1 ns 40966545 BM_Sqrt<std::complex<double>> 8.53 ns 8.53 ns 81111062 BM_StdSqrt<std::complex<double>> 21.5 ns 21.5 ns 32757248 BM_Rsqrt<std::complex<float>> 10.3 ns 10.3 ns 68047474 BM_DivSqrt<std::complex<float>> 16.3 ns 16.3 ns 42770127 BM_Rsqrt<std::complex<double>> 11.3 ns 11.3 ns 61322028 BM_DivSqrt<std::complex<double>> 16.5 ns 16.5 ns 42200711 Clang 11, Intel Xeon, x86_64: --------------------------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------------------------- BM_Sqrt<std::complex<float>> 7.46 ns 7.45 ns 90742042 BM_StdSqrt<std::complex<float>> 16.6 ns 16.6 ns 42369878 BM_Sqrt<std::complex<double>> 8.49 ns 8.49 ns 81629030 BM_StdSqrt<std::complex<double>> 21.8 ns 21.7 ns 31809588 BM_Rsqrt<std::complex<float>> 8.39 ns 8.39 ns 82933666 BM_DivSqrt<std::complex<float>> 14.4 ns 14.4 ns 48638676 BM_Rsqrt<std::complex<double>> 9.83 ns 9.82 ns 70068956 BM_DivSqrt<std::complex<double>> 15.7 ns 15.7 ns 44487798 Clang 9, Pixel 2, aarch64: --------------------------------------------------------------------------- Benchmark Time CPU Iterations --------------------------------------------------------------------------- BM_Sqrt<std::complex<float>> 24.2 ns 24.1 ns 28616031 BM_StdSqrt<std::complex<float>> 104 ns 103 ns 6826926 BM_Sqrt<std::complex<double>> 31.8 ns 31.8 ns 22157591 BM_StdSqrt<std::complex<double>> 128 ns 128 ns 5437375 BM_Rsqrt<std::complex<float>> 31.9 ns 31.8 ns 22384383 BM_DivSqrt<std::complex<float>> 99.2 ns 98.9 ns 7250438 BM_Rsqrt<std::complex<double>> 46.0 ns 45.8 ns 15338689 BM_DivSqrt<std::complex<double>> 119 ns 119 ns 5898944 ```	2021-01-17 08:50:57 -08:00
Guoqiang QI	38ae5353ab	1)provide a better generic paddsub op implementation 2)make paddsub op support the Packet2cf/Packet4f/Packet2f in NEON 3)make paddsub op support the Packet2cf/Packet4f in SSE	2021-01-13 22:54:03 +00:00
Antonio Sanchez	352f1422d3	Remove `inf` local variable. Apparently `inf` is a macro on iOS for `std::numeric_limits<T>::infinity()`, causing a compile error here. We don't need the local anyways since it's only used in one spot.	2021-01-12 10:33:15 -08:00
Antonio Sanchez	2044084979	Remove TODO from Transform::computeScaleRotation() Upon investigation, `JacobiSVD` is significantly faster than `BDCSVD` for small matrices (twice as fast for 2x2, 20% faster for 3x3, 1% faster for 10x10). Since the majority of cases will be small, let's stick with `JacobiSVD`. See !361.	2021-01-11 11:30:01 -08:00
Antonio Sanchez	3daf92c7a5	Transform::computeScalingRotation flush determinant to +/- 1. In the previous code, in attempting to correct for a negative determinant, we end up multiplying and dividing by a number that is often very near, but not exactly +/-1. By flushing to +/-1, we can replace a division with a multiplication, and results are more numerically consistent.	2021-01-11 10:13:38 -08:00
Antonio Sanchez	587fd6ab70	Only specialize complex `sqrt_impl` for CUDA if not MSVC. We already specialize `sqrt_impl` on windows due to MSVC's mishandling of `inf` (!355).	2021-01-11 09:15:45 -08:00
Deven Desai	2a6addb4f9	Fix for breakage in ROCm support - 210108 The following commit breaks ROCm support for Eigen `f149e0ebc3` All unit tests fail with the following error ``` Building HIPCC object test/CMakeFiles/gpu_basic.dir/gpu_basic_generated_gpu_basic.cu.o In file included from /home/rocm-user/eigen/test/gpu_basic.cu:19: In file included from /home/rocm-user/eigen/test/main.h:356: In file included from /home/rocm-user/eigen/Eigen/QR:11: In file included from /home/rocm-user/eigen/Eigen/Core:166: /home/rocm-user/eigen/Eigen/src/Core/MathFunctionsImpl.h:105:35: error: __host__ __device__ function 'complex_sqrt' cannot overload __host__ function 'complex_sqrt' EIGEN_DEVICE_FUNC std::complex<T> complex_sqrt(const std::complex<T>& z) { ^ /home/rocm-user/eigen/Eigen/src/Core/MathFunctions.h:342:38: note: previous declaration is here template<typename T> std::complex<T> complex_sqrt(const std::complex<T>& a_x); ^ 1 error generated when compiling for gfx900. CMake Error at gpu_basic_generated_gpu_basic.cu.o.cmake:192 (message): Error generating file /home/rocm-user/eigen/build/test/CMakeFiles/gpu_basic.dir//./gpu_basic_generated_gpu_basic.cu.o test/CMakeFiles/gpu_basic.dir/build.make:63: recipe for target 'test/CMakeFiles/gpu_basic.dir/gpu_basic_generated_gpu_basic.cu.o' failed make[3]: * [test/CMakeFiles/gpu_basic.dir/gpu_basic_generated_gpu_basic.cu.o] Error 1 CMakeFiles/Makefile2:16618: recipe for target 'test/CMakeFiles/gpu_basic.dir/all' failed make[2]: * [test/CMakeFiles/gpu_basic.dir/all] Error 2 CMakeFiles/Makefile2:16625: recipe for target 'test/CMakeFiles/gpu_basic.dir/rule' failed make[1]: * [test/CMakeFiles/gpu_basic.dir/rule] Error 2 Makefile:5401: recipe for target 'gpu_basic' failed make: * [gpu_basic] Error 2 ``` The error message is accurate, and the fix (provided in thsi commit) is trivial.	2021-01-08 18:04:40 +00:00
Antonio Sanchez	f149e0ebc3	Fix MSVC complex sqrt and packetmath test. MSVC incorrectly handles `inf` cases for `std::sqrt<std::complex<T>>`. Here we replace it with a custom version (currently used on GPU). Also fixed the `packetmath` test, which previously skipped several corner cases since `CHECK_CWISE1` only tests the first `PacketSize` elements.	2021-01-08 01:17:19 +00:00
Essex Edwards	e741b43668	Make Transform::computeRotationScaling(0,&S) continuous	2021-01-07 17:45:14 +00:00
David Tellenbach	0bdc0dba20	Add missing #endif directive in Macros.h	2021-01-07 12:32:41 +01:00
shrek1402	cb654b1c45	#define was defined incorrectly because the result_of function was deprecated in c++17 and removed in c++20. Also, EIGEN_COMP_MSVC (which is _MSC_VER) only affects result_of indirectly, which can cause errors.	2021-01-07 10:12:25 +00:00
Antonio Sanchez	52d1dd979a	Fix Ref initialization. Since `eigen_assert` is a macro, the statements can become noops (e.g. when compiling for GPU), so they may not execute the contained logic -- which in this case is the entire `Ref` construction. We need to separate the assert from statements which have consequences. Fixes #2113	2021-01-06 13:14:20 -08:00
Antonio Sanchez	166fcdecdb	Allow CwiseUnaryView to be used on device. Added `EIGEN_DEVICE_FUNC` to methods.	2021-01-06 09:16:52 -08:00
Antonio Sanchez	bb1de9dbde	Fix Ref Stride checks. The existing `Ref` class failed to consider cases where the Ref's `Stride` setting could match the underlying referred object's stride, but didn't at runtime. This led to trying to set invalid stride values, causing runtime failures in some cases, and garbage due to mismatched strides in others. Here we add the missing runtime checks. This involves computing the strides necessary to align with the referred object's storage, and verifying we can actually set those strides at runtime. In the `const` case, if it may be possible to refer to the original storage at compile-time but fails at runtime, then we defer to the `construct(...)` method that makes a copy. Added more tests to check these cases. Fixes #2093.	2021-01-05 10:41:25 -08:00
Christoph Hertzberg	12dda34b15	Eliminate boolean product warnings by factoring out a `combine_scalar_factors` helper function.	2021-01-05 18:15:30 +00:00
Antonio Sanchez	070d303d56	Add CUDA complex sqrt. This is to support scalar `sqrt` of complex numbers `std::complex<T>` on device, requested by Tensorflow folks. Technically `std::complex` is not supported by NVCC on device (though it is by clang), so the default `sqrt(std::complex<T>)` function only works on the host. Here we create an overload to add back the functionality. Also modified the CMake file to add `--relaxed-constexpr` (or equivalent) flag for NVCC to allow calling constexpr functions from device functions, and added support for specifying compute architecture for NVCC (was already available for clang).	2020-12-22 23:25:23 -08:00
rgreenblatt	fdf2ee62c5	Fix missing EIGEN_DEVICE_FUNC	2020-12-20 23:22:53 -05:00
Rasmus Munk Larsen	05754100fe	* Add iterative psqrt<double> for AVX and SSE when FMA is available. This provides a ~10% speedup. * Write iterative sqrt explicitly in terms of pmadd. This gives up to 7% speedup for psqrt<float> with AVX & SSE with FMA. * Remove iterative psqrt<double> for NEON, because the initial rsqrt apprimation is not accurate enough for convergence in 2 Newton-Raphson steps and with 3 steps, just calling the builtin sqrt insn is faster. The following benchmarks were compiled with clang "-O2 -fast-math -mfma" and with and without -mavx. AVX+FMA (float) name old cpu/op new cpu/op delta BM_eigen_sqrt_float/1 1.08ns ± 0% 1.09ns ± 1% ~ BM_eigen_sqrt_float/8 2.07ns ± 0% 2.08ns ± 1% ~ BM_eigen_sqrt_float/64 12.4ns ± 0% 12.4ns ± 1% ~ BM_eigen_sqrt_float/512 95.7ns ± 0% 95.5ns ± 0% ~ BM_eigen_sqrt_float/4k 776ns ± 0% 763ns ± 0% -1.67% BM_eigen_sqrt_float/32k 6.57µs ± 1% 6.13µs ± 0% -6.69% BM_eigen_sqrt_float/256k 83.7µs ± 3% 83.3µs ± 2% ~ BM_eigen_sqrt_float/1M 335µs ± 2% 332µs ± 2% ~ SSE+FMA (float) name old cpu/op new cpu/op delta BM_eigen_sqrt_float/1 1.08ns ± 0% 1.09ns ± 0% ~ BM_eigen_sqrt_float/8 2.07ns ± 0% 2.06ns ± 0% ~ BM_eigen_sqrt_float/64 12.4ns ± 0% 12.4ns ± 1% ~ BM_eigen_sqrt_float/512 95.7ns ± 0% 96.3ns ± 4% ~ BM_eigen_sqrt_float/4k 774ns ± 0% 763ns ± 0% -1.50% BM_eigen_sqrt_float/32k 6.58µs ± 2% 6.11µs ± 0% -7.06% BM_eigen_sqrt_float/256k 82.7µs ± 1% 82.6µs ± 1% ~ BM_eigen_sqrt_float/1M 330µs ± 1% 329µs ± 2% ~ SSE+FMA (double) BM_eigen_sqrt_double/1 1.63ns ± 0% 1.63ns ± 0% ~ BM_eigen_sqrt_double/8 6.51ns ± 0% 6.08ns ± 0% -6.68% BM_eigen_sqrt_double/64 52.1ns ± 0% 46.5ns ± 1% -10.65% BM_eigen_sqrt_double/512 417ns ± 0% 374ns ± 1% -10.29% BM_eigen_sqrt_double/4k 3.33µs ± 0% 2.97µs ± 1% -11.00% BM_eigen_sqrt_double/32k 26.7µs ± 0% 23.7µs ± 0% -11.07% BM_eigen_sqrt_double/256k 213µs ± 0% 206µs ± 1% -3.31% BM_eigen_sqrt_double/1M 862µs ± 0% 870µs ± 2% +0.96% AVX+FMA (double) name old cpu/op new cpu/op delta BM_eigen_sqrt_double/1 1.63ns ± 0% 1.63ns ± 0% ~ BM_eigen_sqrt_double/8 6.51ns ± 0% 6.06ns ± 0% -6.95% BM_eigen_sqrt_double/64 52.1ns ± 0% 46.5ns ± 1% -10.80% BM_eigen_sqrt_double/512 417ns ± 0% 373ns ± 1% -10.59% BM_eigen_sqrt_double/4k 3.33µs ± 0% 2.97µs ± 1% -10.79% BM_eigen_sqrt_double/32k 26.7µs ± 0% 23.8µs ± 0% -10.94% BM_eigen_sqrt_double/256k 214µs ± 0% 208µs ± 2% -2.76% BM_eigen_sqrt_double/1M 866µs ± 3% 923µs ± 7% ~	2020-12-16 18:16:11 +00:00
Rasmus Munk Larsen	6cee8d347e	Add an additional step of Newton-Raphson for `psqrt<double>` on Arm, which otherwise has an error of ~1000 ulps.	2020-12-15 04:06:41 +00:00
David Tellenbach	751f18f2c0	Remove comma at the end of enumeration list to silence C++03 warnings	2020-12-13 18:11:02 +01:00
Antonio Sanchez	5dc2fbabee	Fix implicit cast to double. Triggers `-Wimplicit-float-conversion`, causing a bunch of build errors in Google due to `-Wall`.	2020-12-12 09:26:20 -08:00
Antonio Sanchez	55967f87d1	Fix NEON pmax<PropagateNumbers,Packet4bf>. Simple typo, the max impl called pmin instead of pmax for floats.	2020-12-11 21:50:52 -08:00
Antonio Sanchez	839aa505c3	Fix typo in AVX512 packet math.	2020-12-11 21:35:44 -08:00
David Tellenbach	536c8a79f2	Remove unused macro in Half.h	2020-12-12 00:53:26 +01:00
Antonio Sanchez	8c9976d7f0	Fix more SSE/AVX packet conversions for peven. MSVC doesn't like function-style casts and forces us to use intrinsics.	2020-12-11 15:46:42 -08:00
Antonio Sanchez	c6efc4e0ba	Replace M_LOG2E and M_LN2 with custom macros. For these to exist we would need to define `_USE_MATH_DEFINES` before `cmath` or `math.h` is first included. However, we don't control the include order for projects outside Eigen, so even defining the macro in `Eigen/Core` does not fix the issue for projects that end up including `<cmath>` before Eigen does (explicitly or transitively). To fix this, we define `EIGEN_LOG2E` and `EIGEN_LN2` ourselves.	2020-12-11 14:34:31 -08:00
Antonio Sanchez	e82722a4a7	Fix MSVC SSE casts. MSVC doesn't like __m128(__m128i) c-style casts, so packets need to be converted using intrinsic methods.	2020-12-11 08:52:59 -08:00
Deven Desai	f3d2ea48f5	Fix for broken ROCm/HIP Support The following commit introduced a breakage in ROCm/HIP support for Eigen. `5ec4907434 (1958e65719641efe5483abc4ce0b61806270f6f3_525_517)` ``` Building HIPCC object test/CMakeFiles/gpu_basic.dir/gpu_basic_generated_gpu_basic.cu.o In file included from /home/rocm-user/eigen/test/gpu_basic.cu:20: In file included from /home/rocm-user/eigen/test/main.h:356: In file included from /home/rocm-user/eigen/Eigen/QR:11: In file included from /home/rocm-user/eigen/Eigen/Core:222: /home/rocm-user/eigen/Eigen/src/Core/arch/GPU/PacketMath.h:556:10: error: use of undeclared identifier 'half2half2'; did you mean '__half2half2'? return half2half2(from); ^~~~~~~~~~ __half2half2 /opt/rocm/hip/include/hip/hcc_detail/hip_fp16.h:547:21: note: '__half2half2' declared here __half2 __half2half2(__half x) ^ 1 error generated when compiling for gfx900. ``` The cause seems to be a copy-paster error, and the fix is trivial	2020-12-11 16:14:57 +00:00
David Tellenbach	c7eb3a74cb	Don't guard psqrt for std::complex<float> with EIGEN_ARCH_ARM64	2020-12-11 12:41:52 +01:00
Everton Constantino	bccf055a7c	Add Armv8 guard on PropagateNumbers implementation.	2020-12-10 22:01:55 -03:00
Antonio Sanchez	82c0c18a83	Remove private access of std::deque::_M_impl. This no longer works on gcc or clang, so we should just remove the hack. The default should compile to similar code anyways.	2020-12-10 14:59:34 -08:00
David Tellenbach	00be0a7ff3	Fix vectorization of complex sqrt on NEON	2020-12-10 15:23:23 +00:00
David Tellenbach	8eb461a431	Remove comma at end of enumerator list in NEON PacketMath	2020-12-10 15:22:55 +01:00
David Tellenbach	2e8f850c78	Fix a typo in SparseMatrix documentation. This fixes issue #2091.	2020-12-09 14:48:24 +01:00
Rasmus Munk Larsen	125cc9a5df	Implement vectorized complex square root. Closes #1905 Measured speedup for sqrt of `complex<float>` on Skylake: SSE: ``` name old time/op new time/op delta BM_eigen_sqrt_ctype/1 49.4ns ± 0% 54.3ns ± 0% +10.01% BM_eigen_sqrt_ctype/8 332ns ± 0% 50ns ± 1% -84.97% BM_eigen_sqrt_ctype/64 2.81µs ± 1% 0.38µs ± 0% -86.49% BM_eigen_sqrt_ctype/512 23.8µs ± 0% 3.0µs ± 0% -87.32% BM_eigen_sqrt_ctype/4k 202µs ± 0% 24µs ± 2% -88.03% BM_eigen_sqrt_ctype/32k 1.63ms ± 0% 0.19ms ± 0% -88.18% BM_eigen_sqrt_ctype/256k 13.0ms ± 0% 1.5ms ± 1% -88.20% BM_eigen_sqrt_ctype/1M 52.1ms ± 0% 6.2ms ± 0% -88.18% ``` AVX2: ``` name old cpu/op new cpu/op delta BM_eigen_sqrt_ctype/1 53.6ns ± 0% 55.6ns ± 0% +3.71% BM_eigen_sqrt_ctype/8 334ns ± 0% 27ns ± 0% -91.86% BM_eigen_sqrt_ctype/64 2.79µs ± 0% 0.22µs ± 2% -92.28% BM_eigen_sqrt_ctype/512 23.8µs ± 1% 1.7µs ± 1% -92.81% BM_eigen_sqrt_ctype/4k 201µs ± 0% 14µs ± 1% -93.24% BM_eigen_sqrt_ctype/32k 1.62ms ± 0% 0.11ms ± 1% -93.29% BM_eigen_sqrt_ctype/256k 13.0ms ± 0% 0.9ms ± 1% -93.31% BM_eigen_sqrt_ctype/1M 52.0ms ± 0% 3.5ms ± 1% -93.31% ``` AVX512: ``` name old cpu/op new cpu/op delta BM_eigen_sqrt_ctype/1 53.7ns ± 0% 56.2ns ± 1% +4.75% BM_eigen_sqrt_ctype/8 334ns ± 0% 18ns ± 2% -94.63% BM_eigen_sqrt_ctype/64 2.79µs ± 0% 0.12µs ± 1% -95.54% BM_eigen_sqrt_ctype/512 23.9µs ± 1% 1.0µs ± 1% -95.89% BM_eigen_sqrt_ctype/4k 202µs ± 0% 8µs ± 1% -96.13% BM_eigen_sqrt_ctype/32k 1.63ms ± 0% 0.06ms ± 1% -96.15% BM_eigen_sqrt_ctype/256k 13.0ms ± 0% 0.5ms ± 4% -96.11% BM_eigen_sqrt_ctype/1M 52.1ms ± 0% 2.0ms ± 1% -96.13% ```	2020-12-08 18:13:35 -08:00
Antonio Sanchez	8cfe0db108	Fix host/device calls for __half. The previous code had `__host__ __device__` functions calling `__device__` functions (e.g. `__low2half`) which caused build failures in tensorflow. Also tried to simplify the `#ifdef` guards to make them more clear.	2020-12-08 20:31:02 +00:00
Everton Constantino	baf9d762b7	- Enabling PropagateNaN and PropagateNumbers for NEON. - Adding propagate tests to bfloat16.	2020-12-08 17:05:05 +00:00
Antonio Sanchez	634bd79b0e	Fix unused warning on new `dense_assignment_loop` impl.	2020-12-07 19:14:21 -08:00
Antonio Sanchez	655c3a4042	Add specialization for compile-time zero-sized dense assignment. In the current `dense_assignment_loop` implementations, if the destination's inner or outer size is zero at compile time and if the kernel involves a product, we currently get a compile error (#2080). This is triggered by attempting to multiply a non-existent row by a column (or vice-versa). To address this, we add a specialization for zero-sized assignments (`AllAtOnceTraversal`) which evaluates to a no-op. We also add a static check to ensure the size is in-fact zero. This now seems to be the only existing use of `AllAtOnceTraversal`. Fixes #2080.	2020-12-07 08:38:43 -08:00
Antonio Sanchez	5ec4907434	Clean up `#if`s in GPU PacketPath. Removed redundant checks and redundant code for CUDA/HIP. Note: there are several issues here of calling `__device__` functions from `__host__ __device__` functions, in particular `__low2half`. We do not address that here -- only modifying this file enough to get our current tests to compile. Fixed: #1847	2020-12-04 16:14:03 -08:00
Rasmus Munk Larsen	f9fac1d5b0	Add log2() to Eigen.	2020-12-04 21:45:09 +00:00
Antonio Sanchez	e2f21465fe	Special function implementations for half/bfloat16 packets. Current implementations fail to consider half-float packets, only half-float scalars. Added specializations for packets on AVX, AVX512 and NEON. Added tests to `special_packetmath`. The current `special_functions` tests would fail for half and bfloat16 due to lack of precision. The NEON tests also fail with precision issues and due to different handling of `sqrt(inf)`, so special functions bessel, ndtri have been disabled. Tested with AVX, AVX512.	2020-12-04 10:16:29 -08:00
David Tellenbach	305b8bd277	Remove duplicate #if clause	2020-12-04 18:55:46 +01:00
Antonio Sanchez	9ee9ac81de	Fix shfl* macros for CUDA/HIP The `shfl*` functions are `__device__` only, and adjusted `#ifdef`s so they are defined whenever the corresponding CUDA/HIP ones are. Also changed the HIP/CUDA<9.0 versions to cast to int instead of doing the conversion `half`<->`float`. Fixes #2083	2020-12-04 17:18:32 +00:00
shrek1402	a9a2f2bebf	The function 'prefetch' did not work correctly on the win64 platform	2020-12-04 17:18:08 +00:00
Rasmus Munk Larsen	f23dc5b971	Revert "Add log2() operator to Eigen" This reverts commit `4d91519a9b`.	2020-12-03 14:32:45 -08:00
Rasmus Munk Larsen	4d91519a9b	Add log2() operator to Eigen	2020-12-03 22:31:44 +00:00
Rasmus Munk Larsen	25d8ae7465	Small cleanup of generic plog implementations: Adding the term eln(2) is split into two step for no obvious reason. This dates back to the original Cephes code from which the algorithm is adapted. It appears that this was done in Cephes to prevent the compiler from reordering the addition of the 3 terms in the approximation log(1+x) ~= x - 0.5x^2 + x^3*P(x)/Q(x) which must be added in reverse order since \|x\| < (sqrt(2)-1). This allows rewriting the code to just 2 pmadd and 1 padd instructions, which on a Skylake processor speeds up the code by 5-7%.	2020-12-03 19:40:40 +00:00
Antonio Sanchez	70fbcf82ed	Fix typo in `F32MaskToBf16Mask`.	2020-12-02 07:58:34 -08:00
Antonio Sanchez	2627e2f2e6	Fix neon cmp* functions for bf16. The current impl corrupts the comparison masks when converting from float back to bfloat16. The resulting masks are then no longer all zeros or all ones, which breaks when used with `pselect` (e.g. in `pmin<PropagateNumbers>`). This was causing `packetmath_15` to fail on arm. Introducing a simple `F32MaskToBf16Mask` corrects this (takes the lower 16-bits for each float mask).	2020-12-02 01:29:34 +00:00

1 2 3 4 5 ...

6399 Commits