eigen

mirror of https://gitlab.com/libeigen/eigen.git synced 2024-12-27 07:29:52 +08:00

Author	SHA1	Message	Date
David Tellenbach	8f8d77b516	Add EIGEN prefix for HAS_LGAMMA_R	2020-10-08 18:32:19 +02:00
Eugene Zhulenev	2279f2c62f	Use lgamma_r if it is available (update check for glibc 2.19+)	2020-10-08 00:26:45 +00:00
Rasmus Munk Larsen	b431024404	Don't make assumptions about NaN-propagation for pmin/pmax - it various across platforms. Change test to only test for NaN-propagation for pfmin/pfmax.	2020-10-07 19:05:18 +00:00
Zhuyie	e4b24e7fb2	Fix Eigen::ThreadPool::CurrentThreadId returning wrong thread id when EIGEN_AVOID_THREAD_LOCAL and NDEBUG are defined	2020-09-25 09:36:43 +00:00
Rasmus Munk Larsen	e55182ac09	Get rid of initialization logic for blueNorm by making the computed constants static const or constexpr. Move macro definition EIGEN_CONSTEXPR to Core and make all methods in NumTraits constexpr when EIGEN_HASH_CONSTEXPR is 1.	2020-09-18 17:38:58 +00:00
Deven Desai	603e213d13	Fixing a CUDA / P100 regression introduced by PR 181 PR 181 ( https://gitlab.com/libeigen/eigen/-/merge_requests/181 ) adds `__launch_bounds__(1024)` attribute to GPU kernels, that did not have that attribute explicitly specified. That PR seems to cause regressions on the CUDA platform. This PR/commit makes the changes in PR 181, to be applicable for HIP only	2020-08-20 00:29:57 +00:00
David Tellenbach	d4a727d092	Disable min/max NaN propagation in test cxx11_tensor_expr The current pmin/pmax implementation for Arm Neon propagate NaNs differently than std::min/std::max. See issue https://gitlab.com/libeigen/eigen/-/issues/1937	2020-08-14 16:16:27 +00:00
Deven Desai	46f8a18567	Adding an explicit launch_bounds(1024) attribute for GPU kernels. Starting with ROCm 3.5, the HIP compiler will change from HCC to hip-clang. This compiler change introduce a change in the default value of the `__launch_bounds__` attribute associated with a GPU kernel. (default value means the value assumed by the compiler as the `__launch_bounds attribute__` value, when it is not explicitly specified by the user) Currently (i.e. for HIP with ROCm 3.3 and older), the default value is 1024. That changes to 256 with ROCm 3.5 (i.e. hip-clang compiler). As a consequence of this change, if a GPU kernel with a `__luanch_bounds__` attribute of 256 is launched at runtime with a threads_per_block value > 256, it leads to a runtime error. This is leading to a couple of Eigen unit test failures with ROCm 3.5. This commit adds an explicit `__launch_bounds(1024)__` attribute to every GPU kernel that currently does not have it explicitly specified (and hence will end up getting the default value of 256 with the change to hip-clang)	2020-08-05 01:46:34 +00:00
Rasmus Munk Larsen	b92206676c	Inherit alignment trait from argument in TensorBroadcasting to avoid segfault when the argument is unaligned.	2020-07-28 19:19:37 +00:00
Rasmus Munk Larsen	ac2eca6b11	Update tensor reduction test to avoid undefined division of bfloat16 by int.	2020-07-22 00:35:51 +00:00
Antonio Sanchez	9cb8771e9c	Fix tensor casts for large packets and casts to/from std::complex The original tensor casts were only defined for `SrcCoeffRatio`:`TgtCoeffRatio` 1:1, 1:2, 2:1, 4:1. Here we add the missing 1:N and 8:1. We also add casting `Eigen::half` to/from `std::complex<T>`, which was missing to make it consistent with `Eigen:bfloat16`, and generalize the overload to work for any complex type. Tests were added to `basicstuff`, `packetmath`, and `cxx11_tensor_casts` to test all cast configurations.	2020-06-30 18:53:55 +00:00
Teng Lu	386d809bde	Support BFloat16 in Eigen	2020-06-20 19:16:24 +00:00
Ilya Tokar	231ce21535	Run two independent chains, when reducing tensors. Running two chains exposes more instruction level parallelism, by allowing to execute both chains at the same time. Results are a bit noisy, but for medium length we almost hit theoretical upper bound of 2x. BM_fullReduction_16T/3 [using 16 threads] 17.3ns ±11% 17.4ns ± 9% ~ (p=0.178 n=18+19) BM_fullReduction_16T/4 [using 16 threads] 17.6ns ±17% 17.0ns ±18% ~ (p=0.835 n=20+19) BM_fullReduction_16T/7 [using 16 threads] 18.9ns ±12% 18.2ns ±10% ~ (p=0.756 n=20+18) BM_fullReduction_16T/8 [using 16 threads] 19.8ns ±13% 19.4ns ±21% ~ (p=0.512 n=20+20) BM_fullReduction_16T/10 [using 16 threads] 23.5ns ±15% 20.8ns ±24% -11.37% (p=0.000 n=20+19) BM_fullReduction_16T/15 [using 16 threads] 35.8ns ±21% 26.9ns ±17% -24.76% (p=0.000 n=20+19) BM_fullReduction_16T/16 [using 16 threads] 38.7ns ±22% 27.7ns ±18% -28.40% (p=0.000 n=20+19) BM_fullReduction_16T/31 [using 16 threads] 146ns ±17% 74ns ±11% -49.05% (p=0.000 n=20+18) BM_fullReduction_16T/32 [using 16 threads] 154ns ±19% 84ns ±30% -45.79% (p=0.000 n=20+19) BM_fullReduction_16T/64 [using 16 threads] 603ns ± 8% 308ns ±12% -48.94% (p=0.000 n=17+17) BM_fullReduction_16T/128 [using 16 threads] 2.44µs ±13% 1.22µs ± 1% -50.29% (p=0.000 n=17+17) BM_fullReduction_16T/256 [using 16 threads] 9.84µs ±14% 5.13µs ±30% -47.82% (p=0.000 n=19+19) BM_fullReduction_16T/512 [using 16 threads] 78.0µs ± 9% 56.1µs ±17% -28.02% (p=0.000 n=18+20) BM_fullReduction_16T/1k [using 16 threads] 325µs ± 5% 263µs ± 4% -19.00% (p=0.000 n=20+16) BM_fullReduction_16T/2k [using 16 threads] 1.09ms ± 3% 0.99ms ± 1% -9.04% (p=0.000 n=20+20) BM_fullReduction_16T/4k [using 16 threads] 7.66ms ± 3% 7.57ms ± 3% -1.24% (p=0.017 n=20+20) BM_fullReduction_16T/10k [using 16 threads] 65.3ms ± 4% 65.0ms ± 3% ~ (p=0.718 n=20+20)	2020-06-16 15:55:11 -04:00
Antonio Sanchez	a7d2552af8	Remove HasCast and fix packetmath cast tests. The use of the `packet_traits<>::HasCast` field is currently inconsistent with `type_casting_traits<>`, and is unused apart from within `test/packetmath.cpp`. In addition, those packetmath cast tests do not currently reflect how casts are performed in practice: they ignore the `SrcCoeffRatio` and `TgtCoeffRatio` fields, assuming a 1:1 ratio. Here we remove the unsed `HasCast`, and modify the packet cast tests to better reflect their usage.	2020-06-11 17:26:56 +00:00
Thales Sabino	1fcaaf460f	Update FindComputeCpp.cmake to fix build problems on Windows - Use standard types in SYCL/PacketMath.h to avoid compilation problems on Windows - Add EIGEN_HAS_CONSTEXPR to cxx11_tensor_argmax_sycl.cpp to fix build problems on Windows	2020-06-05 20:51:20 +00:00
Antonio Sánchez	8719b9c5bc	Disable test for 32-bit systems (e.g. ARM, i386) Both i386 and 32-bit ARM do not define __uint128_t. On most systems, if __uint128_t is defined, then so is the macro __SIZEOF_INT128__. https://stackoverflow.com/questions/18531782/how-to-know-if-uint128-t-is-defined1	2020-05-28 17:40:15 +00:00
mehdi-goli	d3e81db6c5	Eigen moved the `scanLauncehr` function inside the internal namespace. This commit applies the following changes: - Moving the `scamLauncher` specialization inside internal namespace to fix compiler crash on TensorScan for SYCL backend. - Replacing `SYCL/sycl.hpp` to `CL/sycl.hpp` in order to follow SYCL 1.2.1 standard. - minor fixes: commenting out an unused variable to avoid compiler warnings.	2020-05-11 16:10:33 +01:00
Rasmus Munk Larsen	2fd8a5a08f	Add parallelization of TensorScanOp for types without packet ops. Clean up the code a bit and do a few micro-optimizations to improve performance for small tensors. Benchmark numbers for Tensor<uint32_t>: name old time/op new time/op delta BM_cumSumRowReduction_1T/8 [using 1 threads] 76.5ns ± 0% 61.3ns ± 4% -19.80% (p=0.008 n=5+5) BM_cumSumRowReduction_1T/64 [using 1 threads] 2.47µs ± 1% 2.40µs ± 1% -2.77% (p=0.008 n=5+5) BM_cumSumRowReduction_1T/256 [using 1 threads] 39.8µs ± 0% 39.6µs ± 0% -0.60% (p=0.008 n=5+5) BM_cumSumRowReduction_1T/4k [using 1 threads] 13.9ms ± 0% 13.4ms ± 1% -4.19% (p=0.008 n=5+5) BM_cumSumRowReduction_2T/8 [using 2 threads] 76.8ns ± 0% 59.1ns ± 0% -23.09% (p=0.016 n=5+4) BM_cumSumRowReduction_2T/64 [using 2 threads] 2.47µs ± 1% 2.41µs ± 1% -2.53% (p=0.008 n=5+5) BM_cumSumRowReduction_2T/256 [using 2 threads] 39.8µs ± 0% 34.7µs ± 6% -12.74% (p=0.008 n=5+5) BM_cumSumRowReduction_2T/4k [using 2 threads] 13.8ms ± 1% 7.2ms ± 6% -47.74% (p=0.008 n=5+5) BM_cumSumRowReduction_8T/8 [using 8 threads] 76.4ns ± 0% 61.8ns ± 3% -19.02% (p=0.008 n=5+5) BM_cumSumRowReduction_8T/64 [using 8 threads] 2.47µs ± 1% 2.40µs ± 1% -2.84% (p=0.008 n=5+5) BM_cumSumRowReduction_8T/256 [using 8 threads] 39.8µs ± 0% 28.3µs ±11% -28.75% (p=0.008 n=5+5) BM_cumSumRowReduction_8T/4k [using 8 threads] 13.8ms ± 0% 2.7ms ± 5% -80.39% (p=0.008 n=5+5) BM_cumSumColReduction_1T/8 [using 1 threads] 59.1ns ± 0% 80.3ns ± 0% +35.94% (p=0.029 n=4+4) BM_cumSumColReduction_1T/64 [using 1 threads] 3.06µs ± 0% 3.08µs ± 1% ~ (p=0.114 n=4+4) BM_cumSumColReduction_1T/256 [using 1 threads] 175µs ± 0% 176µs ± 0% ~ (p=0.190 n=4+5) BM_cumSumColReduction_1T/4k [using 1 threads] 824ms ± 1% 844ms ± 1% +2.37% (p=0.008 n=5+5) BM_cumSumColReduction_2T/8 [using 2 threads] 59.0ns ± 0% 90.7ns ± 0% +53.74% (p=0.029 n=4+4) BM_cumSumColReduction_2T/64 [using 2 threads] 3.06µs ± 0% 3.10µs ± 0% +1.08% (p=0.016 n=4+5) BM_cumSumColReduction_2T/256 [using 2 threads] 176µs ± 0% 189µs ±18% ~ (p=0.151 n=5+5) BM_cumSumColReduction_2T/4k [using 2 threads] 836ms ± 2% 611ms ±14% -26.92% (p=0.008 n=5+5) BM_cumSumColReduction_8T/8 [using 8 threads] 59.3ns ± 2% 90.6ns ± 0% +52.79% (p=0.008 n=5+5) BM_cumSumColReduction_8T/64 [using 8 threads] 3.07µs ± 0% 3.10µs ± 0% +0.99% (p=0.016 n=5+4) BM_cumSumColReduction_8T/256 [using 8 threads] 176µs ± 0% 80µs ±19% -54.51% (p=0.008 n=5+5) BM_cumSumColReduction_8T/4k [using 8 threads] 827ms ± 2% 180ms ±14% -78.24% (p=0.008 n=5+5)	2020-05-06 14:48:37 -07:00
Rasmus Munk Larsen	0e59f786e1	Fix accidental copy of loop variable.	2020-05-05 21:35:38 +00:00
Rasmus Munk Larsen	7b76c85daf	Vectorize and parallelize TensorScanOp. TensorScanOp is used in TensorFlow for a number of operations, such as cumulative logexp reduction and cumulative sum and product reductions. The benchmarks numbers below are for cumulative row- and column reductions of NxN matrices. name old time/op new time/op delta BM_cumSumRowReduction_1T/4 [using 1 threads ] 25.1ns ± 1% 35.2ns ± 1% +40.45% BM_cumSumRowReduction_1T/8 [using 1 threads ] 73.4ns ± 0% 82.7ns ± 3% +12.74% BM_cumSumRowReduction_1T/32 [using 1 threads ] 988ns ± 0% 832ns ± 0% -15.77% BM_cumSumRowReduction_1T/64 [using 1 threads ] 4.07µs ± 2% 3.47µs ± 0% -14.70% BM_cumSumRowReduction_1T/128 [using 1 threads ] 18.0µs ± 0% 16.8µs ± 0% -6.58% BM_cumSumRowReduction_1T/512 [using 1 threads ] 287µs ± 0% 281µs ± 0% -2.22% BM_cumSumRowReduction_1T/2k [using 1 threads ] 4.78ms ± 1% 4.78ms ± 2% ~ BM_cumSumRowReduction_1T/10k [using 1 threads ] 117ms ± 1% 117ms ± 1% ~ BM_cumSumRowReduction_8T/4 [using 8 threads ] 25.0ns ± 0% 35.2ns ± 0% +40.82% BM_cumSumRowReduction_8T/8 [using 8 threads ] 77.2ns ±16% 81.3ns ± 0% ~ BM_cumSumRowReduction_8T/32 [using 8 threads ] 988ns ± 0% 833ns ± 0% -15.67% BM_cumSumRowReduction_8T/64 [using 8 threads ] 4.08µs ± 2% 3.47µs ± 0% -14.95% BM_cumSumRowReduction_8T/128 [using 8 threads ] 18.0µs ± 0% 17.3µs ±10% ~ BM_cumSumRowReduction_8T/512 [using 8 threads ] 287µs ± 0% 58µs ± 6% -79.92% BM_cumSumRowReduction_8T/2k [using 8 threads ] 4.79ms ± 1% 0.64ms ± 1% -86.58% BM_cumSumRowReduction_8T/10k [using 8 threads ] 117ms ± 1% 18ms ± 6% -84.50% BM_cumSumColReduction_1T/4 [using 1 threads ] 23.9ns ± 0% 33.4ns ± 1% +39.68% BM_cumSumColReduction_1T/8 [using 1 threads ] 71.6ns ± 1% 49.1ns ± 3% -31.40% BM_cumSumColReduction_1T/32 [using 1 threads ] 973ns ± 0% 165ns ± 2% -83.10% BM_cumSumColReduction_1T/64 [using 1 threads ] 4.06µs ± 1% 0.57µs ± 1% -85.94% BM_cumSumColReduction_1T/128 [using 1 threads ] 33.4µs ± 1% 4.1µs ± 1% -87.67% BM_cumSumColReduction_1T/512 [using 1 threads ] 1.72ms ± 4% 0.21ms ± 5% -87.91% BM_cumSumColReduction_1T/2k [using 1 threads ] 119ms ±53% 11ms ±35% -90.42% BM_cumSumColReduction_1T/10k [using 1 threads ] 1.59s ±67% 0.35s ±49% -77.96% BM_cumSumColReduction_8T/4 [using 8 threads ] 23.8ns ± 0% 33.3ns ± 0% +40.06% BM_cumSumColReduction_8T/8 [using 8 threads ] 71.6ns ± 1% 49.2ns ± 5% -31.33% BM_cumSumColReduction_8T/32 [using 8 threads ] 1.01µs ±12% 0.17µs ± 3% -82.93% BM_cumSumColReduction_8T/64 [using 8 threads ] 4.15µs ± 4% 0.58µs ± 1% -86.09% BM_cumSumColReduction_8T/128 [using 8 threads ] 33.5µs ± 0% 4.1µs ± 4% -87.65% BM_cumSumColReduction_8T/512 [using 8 threads ] 1.71ms ± 3% 0.06ms ±16% -96.21% BM_cumSumColReduction_8T/2k [using 8 threads ] 97.1ms ±14% 3.0ms ±23% -96.88% BM_cumSumColReduction_8T/10k [using 8 threads ] 1.97s ± 8% 0.06s ± 2% -96.74%	2020-05-05 00:19:43 +00:00
Rasmus Munk Larsen	ab773c7e91	Extend support for Packet16b: * Add ptranspose<,4> to support matmul and add unit test for Matrix<bool> Matrix<bool> * work around a bug in slicing of Tensor<bool>. * Add tensor tests This speeds up matmul for boolean matrices by about 10x name old time/op new time/op delta BM_MatMul<bool>/8 267ns ± 0% 479ns ± 0% +79.25% (p=0.008 n=5+5) BM_MatMul<bool>/32 6.42µs ± 0% 0.87µs ± 0% -86.50% (p=0.008 n=5+5) BM_MatMul<bool>/64 43.3µs ± 0% 5.9µs ± 0% -86.42% (p=0.008 n=5+5) BM_MatMul<bool>/128 315µs ± 0% 44µs ± 0% -85.98% (p=0.008 n=5+5) BM_MatMul<bool>/256 2.41ms ± 0% 0.34ms ± 0% -85.68% (p=0.008 n=5+5) BM_MatMul<bool>/512 18.8ms ± 0% 2.7ms ± 0% -85.53% (p=0.008 n=5+5) BM_MatMul<bool>/1k 149ms ± 0% 22ms ± 0% -85.40% (p=0.008 n=5+5)	2020-04-28 16:12:47 +00:00
Eugene Zhulenev	3c02fefec5	Add async evaluation support to TensorSlicingOp. Device::memcpy is not async-safe and might lead to deadlocks. Always evaluate slice expression in async mode.	2020-04-22 19:55:01 +00:00
Rasmus Munk Larsen	2f6ddaa25c	Add partial vectorization for matrices and tensors of bool. This speeds up boolean operations on Tensors by up to 25x. Benchmark numbers for the logical and of two NxN tensors: name old time/op new time/op delta BM_booleanAnd_1T/3 [using 1 threads] 14.6ns ± 0% 14.4ns ± 0% -0.96% BM_booleanAnd_1T/4 [using 1 threads] 20.5ns ±12% 9.0ns ± 0% -56.07% BM_booleanAnd_1T/7 [using 1 threads] 41.7ns ± 0% 10.5ns ± 0% -74.87% BM_booleanAnd_1T/8 [using 1 threads] 52.1ns ± 0% 10.1ns ± 0% -80.59% BM_booleanAnd_1T/10 [using 1 threads] 76.3ns ± 0% 13.8ns ± 0% -81.87% BM_booleanAnd_1T/15 [using 1 threads] 167ns ± 0% 16ns ± 0% -90.45% BM_booleanAnd_1T/16 [using 1 threads] 188ns ± 0% 16ns ± 0% -91.57% BM_booleanAnd_1T/31 [using 1 threads] 667ns ± 0% 34ns ± 0% -94.83% BM_booleanAnd_1T/32 [using 1 threads] 710ns ± 0% 35ns ± 0% -95.01% BM_booleanAnd_1T/64 [using 1 threads] 2.80µs ± 0% 0.11µs ± 0% -95.93% BM_booleanAnd_1T/128 [using 1 threads] 11.2µs ± 0% 0.4µs ± 0% -96.11% BM_booleanAnd_1T/256 [using 1 threads] 44.6µs ± 0% 2.5µs ± 0% -94.31% BM_booleanAnd_1T/512 [using 1 threads] 178µs ± 0% 10µs ± 0% -94.35% BM_booleanAnd_1T/1k [using 1 threads] 717µs ± 0% 78µs ± 1% -89.07% BM_booleanAnd_1T/2k [using 1 threads] 2.87ms ± 0% 0.31ms ± 1% -89.08% BM_booleanAnd_1T/4k [using 1 threads] 11.7ms ± 0% 1.9ms ± 4% -83.55% BM_booleanAnd_1T/10k [using 1 threads] 70.3ms ± 0% 17.2ms ± 4% -75.48%	2020-04-20 20:16:28 +00:00
Changming Sun	b1aa07a8d3	Fix a bug in TensorIndexList.h	2020-04-13 18:22:03 +00:00
jangsoopark	39142904cc	Resolve C4346 when building eigen on windows	2020-04-08 14:55:39 +09:00
Aaron Franke	5c22c7a7de	Make file formatting comply with POSIX and Unix standards UTF-8, LF, no BOM, and newlines at the end of files	2020-03-23 18:09:02 +00:00
Deven Desai	7158ed4e0e	Fixing HIP breakage caused by the recent commit that introduces Packet4h2 as the Eigen::Half packet type	2020-03-12 01:06:24 +00:00
Sami Kama	b733b8b680	remove duplicate pset1 for half and add some comments about why we need expose pmul/add/div/min/max on host	2020-03-10 20:28:43 +00:00
Cédric Hubert	98bfc5aaa8	Update MarketIO.h	2020-02-28 12:41:51 +00:00
Ilya Tokar	eb6cc29583	Avoid a division in NonBlockingThreadPool::Steal. Looking at profiles we spend ~10-20% of Steal on simply computing random % size. We can reduce random 32-bit int into [0, size) range with a single multiplication and shift. This transformation is described in https://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-modulo-reduction/	2020-02-14 16:02:57 -05:00
Eugene Zhulenev	f584bd9b30	Fail at compile time if default executor tries to use non-default device	2020-02-06 22:43:24 +00:00
Eugene Zhulenev	3fda850c46	Remove dead code from TensorReduction.h	2020-01-29 18:45:31 +00:00
Jeff Daily	b5df8cabd7	fix hip-clang compilation due to new HIP scalar accessor	2020-01-20 21:08:52 +00:00
Deven Desai	6d284bb1b7	Fix for HIP breakage - 200115. Adding a missing EIGEN_DEVICE_FUNC attr	2020-01-16 00:51:43 +00:00
Srinivas Vasudevan	f6c6de5d63	Ensure Igamma does not NaN or Inf for large values.	2020-01-14 21:32:48 +00:00
Eugene Zhulenev	b9362fb8f7	Convert StridedLinearBufferCopy::Kind to enum class	2020-01-13 11:43:24 -08:00
Srinivas Vasudevan	2e099e8d8f	Added special_packetmath test and tweaked bounds on tests. Refactor shared packetmath code to header file. (Squashed from PR !38)	2020-01-11 10:31:21 +00:00
Matthew Powelson	2ea5a715cf	Properly initialize b vector in SplineFitting InterpolateWithDerivative does not initialize the be vector correctly. This issue is discussed In stackoverflow question 48382939.	2020-01-09 21:29:04 +00:00
Ilya Tokar	19876ced76	Bug #1785 : Introduce numext::rint. This provides a new op that matches std::rint and previous behavior of pround. Also adds corresponding unsupported/../Tensor op. Performance is the same as e. g. floor (tested SSE/AVX).	2020-01-07 21:22:44 +00:00
mehdi-goli	d0ae052da4	[SYCL Backend] * Adding Missing operations for vector comparison in SYCL. This caused compiler error for vector comparison when compiling SYCL * Fixing the compiler error for placement new in TensorForcedEval.h This caused compiler error when compiling SYCL backend * Reducing the SYCL warning by removing the abort function inside the kernel * Adding Strong inline to functions inside SYCL interop.	2020-01-07 15:13:37 +00:00
Deven Desai	636e2bb3fa	Fix for HIP breakage - 191220 The breakage was introduced by the following commit : `ae07801dd8` After the commit, HIPCC errors out on some tests with the following error ``` Building HIPCC object unsupported/test/CMakeFiles/cxx11_tensor_device_1.dir/cxx11_tensor_device_1_generated_cxx11_tensor_device.cu.o In file included from /home/rocm-user/eigen/unsupported/test/cxx11_tensor_device.cu:17: In file included from /home/rocm-user/eigen/unsupported/Eigen/CXX11/Tensor💯 /home/rocm-user/eigen/unsupported/Eigen/CXX11/src/Tensor/TensorBlock.h:129:12: error: no matching constructor for initialization of 'Eigen::internal::TensorBlockResourceRequirements' return {merge(lhs.shape_type, rhs.shape_type), // shape_type ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /home/rocm-user/eigen/unsupported/Eigen/CXX11/src/Tensor/TensorBlock.h:75:8: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 3 were provided struct TensorBlockResourceRequirements { ^ /home/rocm-user/eigen/unsupported/Eigen/CXX11/src/Tensor/TensorBlock.h:75:8: note: candidate constructor (the implicit move constructor) not viable: requires 1 argument, but 3 were provided /home/rocm-user/eigen/unsupported/Eigen/CXX11/src/Tensor/TensorBlock.h:75:8: note: candidate constructor (the implicit copy constructor) not viable: requires 5 arguments, but 3 were provided /home/rocm-user/eigen/unsupported/Eigen/CXX11/src/Tensor/TensorBlock.h:75:8: note: candidate constructor (the implicit default constructor) not viable: requires 0 arguments, but 3 were provided ... ... ``` The fix is to explicitly decalre the (implicitly called) constructor as a device func	2019-12-20 21:28:00 +00:00
Christoph Hertzberg	1e9664b147	Bug #1796 : Make matrix squareroot usable for Map and Ref types	2019-12-20 18:10:22 +01:00
Christoph Hertzberg	d86544d654	Reduce code duplication and avoid confusing Doxygen	2019-12-19 19:48:39 +01:00
Christoph Hertzberg	dde279f57d	Hide recursive meta templates from Doxygen	2019-12-19 19:47:23 +01:00
Christoph Hertzberg	c21771ac04	Use double-braces initialization (as everywhere else in the test-suite).	2019-12-19 19:20:48 +01:00
Christoph Hertzberg	a3273aeff8	Fix trivial shadow warning	2019-12-19 19:13:11 +01:00
Eugene Zhulenev	7a65219a2e	Fix TensorPadding bug in squeezed reads from inner dimension	2019-12-19 05:43:57 +00:00
Eugene Zhulenev	73e55525e5	Return const data pointer from TensorRef evaluator.data()	2019-12-18 23:19:36 +00:00
Eugene Zhulenev	ae07801dd8	Tensor block evaluation cost model	2019-12-18 20:07:00 +00:00
Jeff Daily	de07c4d1c2	fix compilation due to new HIP scalar accessor	2019-12-17 20:27:30 +00:00
Eugene Zhulenev	788bef6ab5	Reduce block evaluation overhead for small tensor expressions	2019-12-17 19:06:14 +00:00
Eugene Zhulenev	381f8f3139	Initialize non-trivially constructible types when allocating a temp buffer.	2019-12-12 01:31:30 +00:00
Eugene Zhulenev	64272c7f40	Squeeze reads from two inner dimensions in TensorPadding	2019-12-11 16:54:51 -08:00
Eugene Zhulenev	963ba1015b	Add back accidentally deleted default constructor to TensorExecutorTilingContext.	2019-12-11 18:47:55 +00:00
Eugene Zhulenev	c9220c035f	Remove block memory allocation required by removed block evaluation API	2019-12-10 17:15:55 -08:00
Eugene Zhulenev	1c879eb010	Remove V2 suffix from TensorBlock	2019-12-10 15:40:23 -08:00
Eugene Zhulenev	dbca11e880	Remove TensorBlock.h and old TensorBlock/BlockMapper	2019-12-10 14:31:44 -08:00
Deven Desai	c49f0d851a	Fix for HIP breakage detected on 191210 The following commit introduces compile errors when running eigen with hipcc `2918f85ba9` hipcc errors out because it requies the device attribute on the methods within the TensorBlockV2ResourceRequirements struct instroduced by the commit above. The fix is to add the device attribute to those methods	2019-12-10 22:14:05 +00:00
Eugene Zhulenev	2918f85ba9	Do not use std::vector in getResourceRequirements	2019-12-09 16:19:55 -08:00
Artem Belevich	8056a05b54	Undo the block size change. .z is used by the EigenContractionKernelInternal().	2019-12-09 11:10:29 -08:00
Eugene Zhulenev	dbb703d44e	Add async evaluation support to TensorSelectOp	2019-12-09 18:36:13 +00:00
Janek Kozicki	11d6465326	fix AlignedVector3 inconsisent interface with other Vector classes, default constructor and operator- were missing.	2019-12-06 21:07:39 +01:00
Eugene Zhulenev	bb7ccac3af	Add recursive work splitting to EvalShardedByInnerDimContext	2019-12-05 14:51:49 -08:00
Artem Belevich	25230d1862	Improve performance of contraction kernels * Force-inline implementations. They pass around pointers to shared memory blocks. Without inlining compiler must operate via generic pointers. Inlining allows compiler to detect that we're operating on shared memory which allows generation of substantially faster code. * Fixed a long-standing typo which resulted in launching 8x more kernels than we needed (.z dimension of the block is unused by the kernel).	2019-12-05 12:48:34 -08:00
Rasmus Munk Larsen	366cf005b0	Add missing initialization in cxx11_tensor_trace.cpp.	2019-12-04 23:56:37 +00:00
Eugene Zhulenev	8f4536e852	Capture TensorMap by value inside tensor expression AST	2019-12-03 16:39:05 -08:00
Rasmus Munk Larsen	4e696901f8	Remove __host__ annotation for device-only function.	2019-12-03 14:33:19 -08:00
Rasmus Munk Larsen	ead81559c8	Use EIGEN_DEVICE_FUNC macro instead of __device__.	2019-12-03 12:08:22 -08:00
Mehdi Goli	00f32752f7	[SYCL] Rebasing the SYCL support branch on top of the Einge upstream master branch. * Unifying all loadLocalTile from lhs and rhs to an extract_block function. * Adding get_tensor operation which was missing in TensorContractionMapper. * Adding the -D method missing from cmake for Disable_Skinny Contraction operation. * Wrapping all the indices in TensorScanSycl into Scan parameter struct. * Fixing typo in Device SYCL * Unifying load to private register for tall/skinny no shared * Unifying load to vector tile for tensor-vector/vector-tensor operation * Removing all the LHS/RHS class for extracting data from global * Removing Outputfunction from TensorContractionSkinnyNoshared. * Combining the local memory version of tall/skinny and normal tensor contraction into one kernel. * Combining the no-local memory version of tall/skinny and normal tensor contraction into one kernel. * Combining General Tensor-Vector and VectorTensor contraction into one kernel. * Making double buffering optional for Tensor contraction when local memory is version is used. * Modifying benchmark to accept custom Reduction Sizes * Disabling AVX optimization for SYCL backend on the host to allow SSE optimization to the host * Adding Test for SYCL * Modifying SYCL CMake	2019-11-28 10:08:54 +00:00
Eugene Zhulenev	5496d0da0b	Add async evaluation support to TensorReverse	2019-11-26 15:02:24 -08:00
Eugene Zhulenev	bc66c88255	Add async evaluation support to TensorPadding/TensorImagePatch/TensorShuffling	2019-11-26 11:41:57 -08:00
Hans Johnson	8c8cab1afd	STYLE: Convert CMake-language commands to lower case Ancient CMake versions required upper-case commands. Later command names became case-insensitive. Now the preferred style is lower-case.	2019-10-31 11:36:37 -05:00
Hans Johnson	6fb3e5f176	STYLE: Remove CMake-language block-end command arguments Ancient versions of CMake required else(), endif(), and similar block termination commands to have arguments matching the command starting the block. This is no longer the preferred style.	2019-10-31 11:36:27 -05:00
Gael Guennebaud	c3f6fcf2c0	bug #1747 : one more fix for MSVC regarding the Bessel implementation.	2019-11-15 11:12:35 +01:00
Gael Guennebaud	b9837ca9ae	bug #1281 : fix AutoDiffScalar's make_coherent for nested expression of constant ADs.	2019-11-14 14:58:08 +01:00
Eugene Zhulenev	13c3327f5c	Remove legacy block evaluation support	2019-11-12 10:12:28 -08:00
Rasmus Munk Larsen	0ed0338593	Fix a race in async tensor evaluation: Don't run on_done() until after device.deallocate() / evaluator.cleanup() complete, since the device might be destroyed after on_done() runs.	2019-11-11 12:26:41 -08:00
Eugene Zhulenev	c952b8dfda	Break loop dependence in TensorGenerator block access	2019-11-11 10:32:57 -08:00
Rasmus Munk Larsen	ebf04fb3e8	Fix data race in css11_tensor_notification test.	2019-11-08 17:44:50 -08:00
Rasmus Munk Larsen	cc3d0e6a40	Add EIGEN_HAS_INTRINSIC_INT128 macro Add a new EIGEN_HAS_INTRINSIC_INT128 macro, and use this instead of __SIZEOF_INT128__. This fixes related issues with TensorIntDiv.h when building with Clang for Windows, where support for 128-bit integer arithmetic is advertised but broken in practice.	2019-11-06 14:24:33 -08:00
Rasmus Munk Larsen	ee404667e2	Rollback or PR-746 and partial rollback of `668ab3fc47` . std::array is still not supported in CUDA device code on Windows.	2019-11-05 17:17:58 -08:00
Rasmus Larsen	0c9745903a	Merged in ezhulenev/eigen-01 (pull request PR-746) Remove internal::smart_copy and replace with std::copy	2019-11-04 20:18:38 +00:00
Eugene Zhulenev	73ecb2c57d	Cleanup includes in Tensor module after switch to C++11 and above	2019-10-29 15:49:54 -07:00
Eugene Zhulenev	e7ed4bd388	Remove internal::smart_copy and replace with std::copy	2019-10-29 11:25:24 -07:00
Eugene Zhulenev	fbc0a9a3ec	Fix CXX11Meta compilation with MSVC	2019-10-28 18:30:10 -07:00
Eugene Zhulenev	bd864ab42b	Prevent potential ODR in TensorExecutor	2019-10-28 15:45:09 -07:00
Mehdi Goli	6332aff0b2	This PR fixes: * The specialization of array class in the different namespace for GCC<=6.4 * The implicit call to `std::array` constructor using the initializer list for GCC <=6.1	2019-10-23 15:56:56 +01:00
Rasmus Larsen	8e4e29ae99	Merged in deven-amd/eigen-hip-fix-191018 (pull request PR-738) Fix for the HIP build+test errors.	2019-10-22 22:18:38 +00:00
Rasmus Munk Larsen	97c0c5d485	Add block evaluation V2 to TensorAsyncExecutor. Add async evaluation to a number of ops.	2019-10-22 12:42:44 -07:00
Deven Desai	102cf2a72d	Fix for the HIP build+test errors. The errors were introduced by this commit : After the above mentioned commit, some of the tests started failing with the following error ``` Built target cxx11_tensor_reduction Building HIPCC object unsupported/test/CMakeFiles/cxx11_tensor_reduction_gpu_5.dir/cxx11_tensor_reduction_gpu_5_generated_cxx11_tensor_reduction_gpu.cu.o In file included from /home/rocm-user/eigen/unsupported/test/cxx11_tensor_reduction_gpu.cu:16: In file included from /home/rocm-user/eigen/unsupported/Eigen/CXX11/Tensor:117: /home/rocm-user/eigen/unsupported/Eigen/CXX11/src/Tensor/TensorBlockV2.h:155:5: error: the field type is not amp-compatible DestinationBufferKind m_kind; ^ /home/rocm-user/eigen/unsupported/Eigen/CXX11/src/Tensor/TensorBlockV2.h:211:3: error: the field type is not amp-compatible DestinationBuffer m_destination; ^ ``` For some reason HIPCC does not like device code to contain enum types which do not have the base-type explicitly declared. The fix is trivial, explicitly state "int" as the basetype	2019-10-22 19:21:27 +00:00
Rasmus Munk Larsen	668ab3fc47	Drop support for c++03 in Eigen tensor. Get rid of some code used to emulate c++11 functionality with older compilers.	2019-10-18 16:42:00 -07:00
Eugene Zhulenev	df0e8b8137	Propagate block evaluation preference through rvalue tensor expressions	2019-10-17 11:17:33 -07:00
Eugene Zhulenev	0d2a14ce11	Cleanup Tensor block destination and materialized block storage allocation	2019-10-16 17:14:37 -07:00
Eugene Zhulenev	02431cbe71	TensorBroadcasting support for random/uniform blocks	2019-10-16 13:26:28 -07:00
Eugene Zhulenev	d380c23b2c	Block evaluation for TensorGenerator/TensorReverse/TensorShuffling	2019-10-14 14:31:59 -07:00
Gael Guennebaud	39fb9eeccf	bug #1747 : fix compilation with MSVC	2019-10-14 22:50:23 +02:00
Eugene Zhulenev	a411e9f344	Block evaluation for TensorGenerator + TensorReverse + fixed bug in tensor reverse op	2019-10-10 10:56:58 -07:00
Eugene Zhulenev	33e1746139	Block evaluation for TensorChipping + fixed bugs in TensorPadding and TensorSlicing	2019-10-09 12:45:31 -07:00
Gael Guennebaud	f0a4642bab	Implement c++03 compatible fix for changeset `7a43af1a33`	2019-10-09 16:00:57 +02:00
Gael Guennebaud	7a43af1a33	Fix compilation of FFTW unit test	2019-10-08 08:58:35 +02:00
Eugene Zhulenev	f74ab8cb8d	Add block evaluation to TensorEvalTo and fix few small bugs	2019-10-07 15:34:26 -07:00
Brian Zhao	3afb640b56	Fixing incorrect size in Tensor documentation.	2019-10-04 21:30:35 -07:00
Rasmus Munk Larsen	20c4a9118f	Use "pdiv" rather than operator/ to support packet types.	2019-10-04 16:54:03 -07:00
Eugene Zhulenev	98bdd7252e	Fix compilation warnings and errors with clang in TensorBlockV2 code and tests	2019-10-04 10:15:33 -07:00
Eugene Zhulenev	60ae24ee1a	Add block evaluation to TensorReshaping/TensorCasting/TensorPadding/TensorSelect	2019-10-02 12:44:06 -07:00
Eugene Zhulenev	6e40454a6e	Add beta to TensorContractionKernel and make memset optional	2019-10-02 11:06:02 -07:00
Rasmus Munk Larsen	13ef08e5ac	Move implementation of vectorized error function erf() to SpecialFunctionsImpl.h.	2019-09-27 13:56:04 -07:00
Eugene Zhulenev	7c8bc0d928	Fix cxx11_tensor_block_io test	2019-09-25 11:48:11 -07:00
Eugene Zhulenev	71d5bedf72	Fix compilation warnings and errors with clang in TensorBlockV2	2019-09-25 11:25:22 -07:00
Deven Desai	5e186b1987	Fix for the HIP build+test errors. The errors were introduced by this commit : `d38e6fbc27` After the above mentioned commit, some of the tests started failing with the following error ``` Building HIPCC object unsupported/test/CMakeFiles/cxx11_tensor_reduction_gpu_5.dir/cxx11_tensor_reduction_gpu_5_generated_cxx11_tensor_reduction_gpu.cu.o In file included from /home/rocm-user/eigen/unsupported/test/cxx11_tensor_reduction_gpu.cu:16: In file included from /home/rocm-user/eigen/unsupported/Eigen/CXX11/Tensor:29: In file included from /home/rocm-user/eigen/unsupported/Eigen/CXX11/../SpecialFunctions:70: /home/rocm-user/eigen/unsupported/Eigen/CXX11/../src/SpecialFunctions/SpecialFunctionsHalf.h:28:22: error: call to 'erf' is ambiguous return Eigen::half(Eigen::numext::erf(static_cast<float>(a))); ^~~~~~~~~~~~~~~~~~ /home/rocm-user/eigen/unsupported/test/../../Eigen/src/Core/MathFunctions.h:1600:7: note: candidate function [with T = float] float erf(const float &x) { return ::erff(x); } ^ /home/rocm-user/eigen/unsupported/Eigen/CXX11/../src/SpecialFunctions/SpecialFunctionsImpl.h:1897:5: note: candidate function [with Scalar = float] erf(const Scalar& x) { ^ In file included from /home/rocm-user/eigen/unsupported/test/cxx11_tensor_reduction_gpu.cu:16: In file included from /home/rocm-user/eigen/unsupported/Eigen/CXX11/Tensor:29: In file included from /home/rocm-user/eigen/unsupported/Eigen/CXX11/../SpecialFunctions:75: /home/rocm-user/eigen/unsupported/Eigen/CXX11/../src/SpecialFunctions/arch/GPU/GpuSpecialFunctions.h:87:23: error: call to 'erf' is ambiguous return make_double2(erf(a.x), erf(a.y)); ^~~ /home/rocm-user/eigen/unsupported/test/../../Eigen/src/Core/MathFunctions.h:1603:8: note: candidate function [with T = double] double erf(const double &x) { return ::erf(x); } ^ /home/rocm-user/eigen/unsupported/Eigen/CXX11/../src/SpecialFunctions/SpecialFunctionsImpl.h:1897:5: note: candidate function [with Scalar = double] erf(const Scalar& x) { ^ In file included from /home/rocm-user/eigen/unsupported/test/cxx11_tensor_reduction_gpu.cu:16: In file included from /home/rocm-user/eigen/unsupported/Eigen/CXX11/Tensor:29: In file included from /home/rocm-user/eigen/unsupported/Eigen/CXX11/../SpecialFunctions:75: /home/rocm-user/eigen/unsupported/Eigen/CXX11/../src/SpecialFunctions/arch/GPU/GpuSpecialFunctions.h:87:33: error: call to 'erf' is ambiguous return make_double2(erf(a.x), erf(a.y)); ^~~ /home/rocm-user/eigen/unsupported/test/../../Eigen/src/Core/MathFunctions.h:1603:8: note: candidate function [with T = double] double erf(const double &x) { return ::erf(x); } ^ /home/rocm-user/eigen/unsupported/Eigen/CXX11/../src/SpecialFunctions/SpecialFunctionsImpl.h:1897:5: note: candidate function [with Scalar = double] erf(const Scalar& x) { ^ 3 errors generated. ``` This PR fixes the compile error by removing the "old" implementation for "erf" (assuming that the "new" implementation is what we want going forward. from a GPU point-of-view both implementations are the same). This PR also fixes what seems like a cut-n-paste error in the aforementioned commit	2019-09-25 15:39:13 +00:00
Eugene Zhulenev	f35b9ab510	Fix a bug in a packed block type in TensorContractionThreadPool	2019-09-24 16:54:36 -07:00
Rasmus Larsen	d38e6fbc27	Merged in rmlarsen/eigen (pull request PR-704) Add generic PacketMath implementation of the Error Function (erf).	2019-09-24 23:40:29 +00:00
Rasmus Munk Larsen	591a554c68	Add TODO to cleanup FMA cost modelling.	2019-09-24 16:39:25 -07:00
Eugene Zhulenev	c64396b4c6	Choose TensorBlock StridedLinearCopy type statically	2019-09-24 16:04:29 -07:00
Eugene Zhulenev	c97b208468	Add new TensorBlock api implementation + tests	2019-09-24 15:17:35 -07:00
Eugene Zhulenev	ef9dfee7bd	Tensor block evaluation V2 support for unary/binary/broadcsting	2019-09-24 12:52:45 -07:00
Christoph Hertzberg	e4c1b3c1d2	Fix implicit conversion warnings and use pnegate to negate packets	2019-09-23 16:07:43 +02:00
Christoph Hertzberg	ba0736fa8e	Fix (or mask away) conversion warnings introduced in `553caeb6a3` .	2019-09-23 15:58:05 +02:00
Rasmus Munk Larsen	1d5af0693c	Add support for asynchronous evaluation of tensor casting expressions.	2019-09-19 13:54:49 -07:00
Rasmus Munk Larsen	6de5ed08d8	Add generic PacketMath implementation of the Error Function (erf).	2019-09-19 12:48:30 -07:00
Srinivas Vasudevan	df0816b71f	Merging eigen/eigen.	2019-09-16 19:33:29 -04:00
Srinivas Vasudevan	6e215cf109	Add Bessel functions to SpecialFunctions. - Split SpecialFunctions files in to a separate BesselFunctions file. In particular add: - Modified bessel functions of the second kind k0, k1, k0e, k1e - Bessel functions of the first kind j0, j1 - Bessel functions of the second kind y0, y1	2019-09-14 12:16:47 -04:00
Eugene Zhulenev	bf8866b466	Fix maybe-unitialized warnings in TensorContractionThreadPool	2019-09-13 14:29:55 -07:00
Eugene Zhulenev	553caeb6a3	Use ThreadLocal container in TensorContractionThreadPool	2019-09-13 12:14:44 -07:00
Srinivas Vasudevan	facdec5aa7	Add packetized versions of i0e and i1e special functions. - In particular refactor the i0e and i1e code so scalar and vectorized path share code. - Move chebevl to GenericPacketMathFunctions. A brief benchmark with building Eigen with FMA, AVX and AVX2 flags Before: CPU: Intel Haswell with HyperThreading (6 cores) Benchmark Time(ns) CPU(ns) Iterations ----------------------------------------------------------------- BM_eigen_i0e_double/1 57.3 57.3 10000000 BM_eigen_i0e_double/8 398 398 1748554 BM_eigen_i0e_double/64 3184 3184 218961 BM_eigen_i0e_double/512 25579 25579 27330 BM_eigen_i0e_double/4k 205043 205042 3418 BM_eigen_i0e_double/32k 1646038 1646176 422 BM_eigen_i0e_double/256k 13180959 13182613 53 BM_eigen_i0e_double/1M 52684617 52706132 10 BM_eigen_i0e_float/1 28.4 28.4 24636711 BM_eigen_i0e_float/8 75.7 75.7 9207634 BM_eigen_i0e_float/64 512 512 1000000 BM_eigen_i0e_float/512 4194 4194 166359 BM_eigen_i0e_float/4k 32756 32761 21373 BM_eigen_i0e_float/32k 261133 261153 2678 BM_eigen_i0e_float/256k 2087938 2088231 333 BM_eigen_i0e_float/1M 8380409 8381234 84 BM_eigen_i1e_double/1 56.3 56.3 10000000 BM_eigen_i1e_double/8 397 397 1772376 BM_eigen_i1e_double/64 3114 3115 223881 BM_eigen_i1e_double/512 25358 25361 27761 BM_eigen_i1e_double/4k 203543 203593 3462 BM_eigen_i1e_double/32k 1613649 1613803 428 BM_eigen_i1e_double/256k 12910625 12910374 54 BM_eigen_i1e_double/1M 51723824 51723991 10 BM_eigen_i1e_float/1 28.3 28.3 24683049 BM_eigen_i1e_float/8 74.8 74.9 9366216 BM_eigen_i1e_float/64 505 505 1000000 BM_eigen_i1e_float/512 4068 4068 171690 BM_eigen_i1e_float/4k 31803 31806 21948 BM_eigen_i1e_float/32k 253637 253692 2763 BM_eigen_i1e_float/256k 2019711 2019918 346 BM_eigen_i1e_float/1M 8238681 8238713 86 After: CPU: Intel Haswell with HyperThreading (6 cores) Benchmark Time(ns) CPU(ns) Iterations ----------------------------------------------------------------- BM_eigen_i0e_double/1 15.8 15.8 44097476 BM_eigen_i0e_double/8 99.3 99.3 7014884 BM_eigen_i0e_double/64 777 777 886612 BM_eigen_i0e_double/512 6180 6181 100000 BM_eigen_i0e_double/4k 48136 48140 14678 BM_eigen_i0e_double/32k 385936 385943 1801 BM_eigen_i0e_double/256k 3293324 3293551 228 BM_eigen_i0e_double/1M 12423600 12424458 57 BM_eigen_i0e_float/1 16.3 16.3 43038042 BM_eigen_i0e_float/8 30.1 30.1 23456931 BM_eigen_i0e_float/64 169 169 4132875 BM_eigen_i0e_float/512 1338 1339 516860 BM_eigen_i0e_float/4k 10191 10191 68513 BM_eigen_i0e_float/32k 81338 81337 8531 BM_eigen_i0e_float/256k 651807 651984 1000 BM_eigen_i0e_float/1M 2633821 2634187 268 BM_eigen_i1e_double/1 16.2 16.2 42352499 BM_eigen_i1e_double/8 110 110 6316524 BM_eigen_i1e_double/64 822 822 851065 BM_eigen_i1e_double/512 6480 6481 100000 BM_eigen_i1e_double/4k 51843 51843 10000 BM_eigen_i1e_double/32k 414854 414852 1680 BM_eigen_i1e_double/256k 3320001 3320568 212 BM_eigen_i1e_double/1M 13442795 13442391 53 BM_eigen_i1e_float/1 17.6 17.6 41025735 BM_eigen_i1e_float/8 35.5 35.5 19597891 BM_eigen_i1e_float/64 240 240 2924237 BM_eigen_i1e_float/512 1424 1424 485953 BM_eigen_i1e_float/4k 10722 10723 65162 BM_eigen_i1e_float/32k 86286 86297 8048 BM_eigen_i1e_float/256k 691821 691868 1000 BM_eigen_i1e_float/1M 2777336 2777747 256 This shows anywhere from a 50% to 75% improvement on these operations. I've also benchmarked without any of these flags turned on, and got similar performance to before (if not better). Also tested packetmath.cpp + special_functions to ensure no regressions.	2019-09-11 18:34:02 -07:00
Deven Desai	cdb377d0cb	Fix for the HIP build+test errors introduced by the ndtri support. The fixes needed are * adding EIGEN_DEVICE_FUNC attribute to a couple of funcs (else HIPCC will error out when non-device funcs are called from global/device funcs) * switching to using ::<math_func> instead std::<math_func> (only for HIPCC) in cases where the std::<math_func> is not recognized as a device func by HIPCC * removing an errant "j" from a testcase (don't know how that made it in to begin with!)	2019-09-06 16:03:49 +00:00
Eugene Zhulenev	d918bd9a8b	Update ThreadLocal to use separate Initialize/Release callables	2019-09-10 16:13:32 -07:00
Eugene Zhulenev	e3dec4dcc1	ThreadLocal container that does not rely on thread local storage	2019-09-09 15:18:14 -07:00
Srinivas Vasudevan	e38dd48a27	PR 681: Add ndtri function, the inverse of the normal distribution function.	2019-08-12 19:26:29 -04:00
Eugene Zhulenev	47fefa235f	Allow move-only done callback in TensorAsyncDevice	2019-09-03 17:20:56 -07:00
Eugene Zhulenev	a8d264fa9c	Add test for const TensorMap underlying data mutation	2019-09-03 11:38:39 -07:00
Eugene Zhulenev	f68f2bba09	TensorMap constness should not change underlying storage constness	2019-09-03 11:08:09 -07:00
Alberto Luaces	c694be1214	Fixed Tensor documentation formatting.	2019-07-23 09:24:06 +00:00
Eugene Zhulenev	79c402e40e	Fix shadow warnings in TensorContractionThreadPool	2019-08-30 15:38:31 -07:00
Eugene Zhulenev	edf2ec28d8	Fix block mapper type name in TensorExecutor	2019-08-30 15:29:25 -07:00
Eugene Zhulenev	f0b36fb9a4	evalSubExprsIfNeededAsync + async TensorContractionThreadPool	2019-08-30 15:13:38 -07:00
Eugene Zhulenev	619cea9491	Revert accidentally removed <memory> header from ThreadPool	2019-08-30 14:51:17 -07:00
Eugene Zhulenev	66665e7e76	Asynchronous expression evaluation with TensorAsyncDevice	2019-08-30 14:49:40 -07:00
Eugene Zhulenev	bc40d4522c	Const correctness in TensorMap<const Tensor<T, ...>> expressions	2019-08-28 17:46:05 -07:00
Eugene Zhulenev	6e77f9bef3	Remove shadow warnings in TensorDeviceThreadPool	2019-08-28 10:32:19 -07:00
Rasmus Larsen	84fefdf321	Merged in ezhulenev/eigen-01 (pull request PR-683) Asynchronous parallelFor in Eigen ThreadPoolDevice	2019-08-26 21:49:17 +00:00
maratek	8b5ab0e4dd	Fix get_random_seed on Native Client Newlib in Native Client SDK does not provide ::random function. Implement get_random_seed for NaCl using ::rand, similarly to Windows version.	2019-08-23 15:25:56 -07:00
Eugene Zhulenev	6901788013	Asynchronous parallelFor in Eigen ThreadPoolDevice	2019-08-22 10:50:51 -07:00
Eugene Zhulenev	071311821e	Remove XSMM support from Tensor module	2019-08-19 11:44:25 -07:00
Rasmus Munk Larsen	facc4e4536	Disable tests for contraction with output kernels when using libxsmm, which does not support this.	2019-08-07 14:11:15 -07:00
Rasmus Munk Larsen	eab7e52db2	[Eigen] Vectorize evaluation of coefficient-wise functions over tensor blocks if the strides are known to be 1. Provides up to 20-25% speedup of the TF cross entropy op with AVX. A few benchmark numbers: name old time/op new time/op delta BM_Xent_16_10000_cpu 448µs ± 3% 389µs ± 2% -13.21% (p=0.008 n=5+5) BM_Xent_32_10000_cpu 575µs ± 6% 454µs ± 3% -21.00% (p=0.008 n=5+5) BM_Xent_64_10000_cpu 933µs ± 4% 712µs ± 1% -23.71% (p=0.008 n=5+5)	2019-08-07 12:57:42 -07:00
Rasmus Munk Larsen	0987126165	Clean up unnecessary namespace specifiers in TensorBlock.h.	2019-08-07 12:12:52 -07:00
Rasmus Munk Larsen	e2999d4c38	Fix performance regressions due to https://bitbucket.org/eigen/eigen/pull-requests/662 . The change caused the device struct to be copied for each expression evaluation, and caused, e.g., a 10% regression in the TensorFlow multinomial op on GPU: Benchmark Time(ns) CPU(ns) Iterations ---------------------------------------------------------------------- BM_Multinomial_gpu_1_100000_4 128173 231326 2922 1.610G items/s VS Benchmark Time(ns) CPU(ns) Iterations ---------------------------------------------------------------------- BM_Multinomial_gpu_1_100000_4 146683 246914 2719 1.509G items/s	2019-08-02 11:18:13 -07:00
Eugene Zhulenev	3cd148f983	Fix expression evaluation heuristic for TensorSliceOp	2019-07-09 12:10:26 -07:00
Eugene Zhulenev	6083014594	Add outer/inner chipping optimization for chipping dimension specified at runtime	2019-07-03 11:35:25 -07:00

1 2 3 4 5 ...

2957 Commits