mehdi-goli
d3e81db6c5
Eigen moved the scanLauncehr
function inside the internal namespace.
...
This commit applies the following changes:
- Moving the `scamLauncher` specialization inside internal namespace to fix compiler crash on TensorScan for SYCL backend.
- Replacing `SYCL/sycl.hpp` to `CL/sycl.hpp` in order to follow SYCL 1.2.1 standard.
- minor fixes: commenting out an unused variable to avoid compiler warnings.
2020-05-11 16:10:33 +01:00
Rasmus Munk Larsen
2fd8a5a08f
Add parallelization of TensorScanOp for types without packet ops.
...
Clean up the code a bit and do a few micro-optimizations to improve performance for small tensors.
Benchmark numbers for Tensor<uint32_t>:
name old time/op new time/op delta
BM_cumSumRowReduction_1T/8 [using 1 threads] 76.5ns ± 0% 61.3ns ± 4% -19.80% (p=0.008 n=5+5)
BM_cumSumRowReduction_1T/64 [using 1 threads] 2.47µs ± 1% 2.40µs ± 1% -2.77% (p=0.008 n=5+5)
BM_cumSumRowReduction_1T/256 [using 1 threads] 39.8µs ± 0% 39.6µs ± 0% -0.60% (p=0.008 n=5+5)
BM_cumSumRowReduction_1T/4k [using 1 threads] 13.9ms ± 0% 13.4ms ± 1% -4.19% (p=0.008 n=5+5)
BM_cumSumRowReduction_2T/8 [using 2 threads] 76.8ns ± 0% 59.1ns ± 0% -23.09% (p=0.016 n=5+4)
BM_cumSumRowReduction_2T/64 [using 2 threads] 2.47µs ± 1% 2.41µs ± 1% -2.53% (p=0.008 n=5+5)
BM_cumSumRowReduction_2T/256 [using 2 threads] 39.8µs ± 0% 34.7µs ± 6% -12.74% (p=0.008 n=5+5)
BM_cumSumRowReduction_2T/4k [using 2 threads] 13.8ms ± 1% 7.2ms ± 6% -47.74% (p=0.008 n=5+5)
BM_cumSumRowReduction_8T/8 [using 8 threads] 76.4ns ± 0% 61.8ns ± 3% -19.02% (p=0.008 n=5+5)
BM_cumSumRowReduction_8T/64 [using 8 threads] 2.47µs ± 1% 2.40µs ± 1% -2.84% (p=0.008 n=5+5)
BM_cumSumRowReduction_8T/256 [using 8 threads] 39.8µs ± 0% 28.3µs ±11% -28.75% (p=0.008 n=5+5)
BM_cumSumRowReduction_8T/4k [using 8 threads] 13.8ms ± 0% 2.7ms ± 5% -80.39% (p=0.008 n=5+5)
BM_cumSumColReduction_1T/8 [using 1 threads] 59.1ns ± 0% 80.3ns ± 0% +35.94% (p=0.029 n=4+4)
BM_cumSumColReduction_1T/64 [using 1 threads] 3.06µs ± 0% 3.08µs ± 1% ~ (p=0.114 n=4+4)
BM_cumSumColReduction_1T/256 [using 1 threads] 175µs ± 0% 176µs ± 0% ~ (p=0.190 n=4+5)
BM_cumSumColReduction_1T/4k [using 1 threads] 824ms ± 1% 844ms ± 1% +2.37% (p=0.008 n=5+5)
BM_cumSumColReduction_2T/8 [using 2 threads] 59.0ns ± 0% 90.7ns ± 0% +53.74% (p=0.029 n=4+4)
BM_cumSumColReduction_2T/64 [using 2 threads] 3.06µs ± 0% 3.10µs ± 0% +1.08% (p=0.016 n=4+5)
BM_cumSumColReduction_2T/256 [using 2 threads] 176µs ± 0% 189µs ±18% ~ (p=0.151 n=5+5)
BM_cumSumColReduction_2T/4k [using 2 threads] 836ms ± 2% 611ms ±14% -26.92% (p=0.008 n=5+5)
BM_cumSumColReduction_8T/8 [using 8 threads] 59.3ns ± 2% 90.6ns ± 0% +52.79% (p=0.008 n=5+5)
BM_cumSumColReduction_8T/64 [using 8 threads] 3.07µs ± 0% 3.10µs ± 0% +0.99% (p=0.016 n=5+4)
BM_cumSumColReduction_8T/256 [using 8 threads] 176µs ± 0% 80µs ±19% -54.51% (p=0.008 n=5+5)
BM_cumSumColReduction_8T/4k [using 8 threads] 827ms ± 2% 180ms ±14% -78.24% (p=0.008 n=5+5)
2020-05-06 14:48:37 -07:00
Rasmus Munk Larsen
0e59f786e1
Fix accidental copy of loop variable.
2020-05-05 21:35:38 +00:00
Rasmus Munk Larsen
7b76c85daf
Vectorize and parallelize TensorScanOp.
...
TensorScanOp is used in TensorFlow for a number of operations, such as cumulative logexp reduction and cumulative sum and product reductions.
The benchmarks numbers below are for cumulative row- and column reductions of NxN matrices.
name old time/op new time/op delta
BM_cumSumRowReduction_1T/4 [using 1 threads ] 25.1ns ± 1% 35.2ns ± 1% +40.45%
BM_cumSumRowReduction_1T/8 [using 1 threads ] 73.4ns ± 0% 82.7ns ± 3% +12.74%
BM_cumSumRowReduction_1T/32 [using 1 threads ] 988ns ± 0% 832ns ± 0% -15.77%
BM_cumSumRowReduction_1T/64 [using 1 threads ] 4.07µs ± 2% 3.47µs ± 0% -14.70%
BM_cumSumRowReduction_1T/128 [using 1 threads ] 18.0µs ± 0% 16.8µs ± 0% -6.58%
BM_cumSumRowReduction_1T/512 [using 1 threads ] 287µs ± 0% 281µs ± 0% -2.22%
BM_cumSumRowReduction_1T/2k [using 1 threads ] 4.78ms ± 1% 4.78ms ± 2% ~
BM_cumSumRowReduction_1T/10k [using 1 threads ] 117ms ± 1% 117ms ± 1% ~
BM_cumSumRowReduction_8T/4 [using 8 threads ] 25.0ns ± 0% 35.2ns ± 0% +40.82%
BM_cumSumRowReduction_8T/8 [using 8 threads ] 77.2ns ±16% 81.3ns ± 0% ~
BM_cumSumRowReduction_8T/32 [using 8 threads ] 988ns ± 0% 833ns ± 0% -15.67%
BM_cumSumRowReduction_8T/64 [using 8 threads ] 4.08µs ± 2% 3.47µs ± 0% -14.95%
BM_cumSumRowReduction_8T/128 [using 8 threads ] 18.0µs ± 0% 17.3µs ±10% ~
BM_cumSumRowReduction_8T/512 [using 8 threads ] 287µs ± 0% 58µs ± 6% -79.92%
BM_cumSumRowReduction_8T/2k [using 8 threads ] 4.79ms ± 1% 0.64ms ± 1% -86.58%
BM_cumSumRowReduction_8T/10k [using 8 threads ] 117ms ± 1% 18ms ± 6% -84.50%
BM_cumSumColReduction_1T/4 [using 1 threads ] 23.9ns ± 0% 33.4ns ± 1% +39.68%
BM_cumSumColReduction_1T/8 [using 1 threads ] 71.6ns ± 1% 49.1ns ± 3% -31.40%
BM_cumSumColReduction_1T/32 [using 1 threads ] 973ns ± 0% 165ns ± 2% -83.10%
BM_cumSumColReduction_1T/64 [using 1 threads ] 4.06µs ± 1% 0.57µs ± 1% -85.94%
BM_cumSumColReduction_1T/128 [using 1 threads ] 33.4µs ± 1% 4.1µs ± 1% -87.67%
BM_cumSumColReduction_1T/512 [using 1 threads ] 1.72ms ± 4% 0.21ms ± 5% -87.91%
BM_cumSumColReduction_1T/2k [using 1 threads ] 119ms ±53% 11ms ±35% -90.42%
BM_cumSumColReduction_1T/10k [using 1 threads ] 1.59s ±67% 0.35s ±49% -77.96%
BM_cumSumColReduction_8T/4 [using 8 threads ] 23.8ns ± 0% 33.3ns ± 0% +40.06%
BM_cumSumColReduction_8T/8 [using 8 threads ] 71.6ns ± 1% 49.2ns ± 5% -31.33%
BM_cumSumColReduction_8T/32 [using 8 threads ] 1.01µs ±12% 0.17µs ± 3% -82.93%
BM_cumSumColReduction_8T/64 [using 8 threads ] 4.15µs ± 4% 0.58µs ± 1% -86.09%
BM_cumSumColReduction_8T/128 [using 8 threads ] 33.5µs ± 0% 4.1µs ± 4% -87.65%
BM_cumSumColReduction_8T/512 [using 8 threads ] 1.71ms ± 3% 0.06ms ±16% -96.21%
BM_cumSumColReduction_8T/2k [using 8 threads ] 97.1ms ±14% 3.0ms ±23% -96.88%
BM_cumSumColReduction_8T/10k [using 8 threads ] 1.97s ± 8% 0.06s ± 2% -96.74%
2020-05-05 00:19:43 +00:00
Rasmus Munk Larsen
ab773c7e91
Extend support for Packet16b:
...
* Add ptranspose<*,4> to support matmul and add unit test for Matrix<bool> * Matrix<bool>
* work around a bug in slicing of Tensor<bool>.
* Add tensor tests
This speeds up matmul for boolean matrices by about 10x
name old time/op new time/op delta
BM_MatMul<bool>/8 267ns ± 0% 479ns ± 0% +79.25% (p=0.008 n=5+5)
BM_MatMul<bool>/32 6.42µs ± 0% 0.87µs ± 0% -86.50% (p=0.008 n=5+5)
BM_MatMul<bool>/64 43.3µs ± 0% 5.9µs ± 0% -86.42% (p=0.008 n=5+5)
BM_MatMul<bool>/128 315µs ± 0% 44µs ± 0% -85.98% (p=0.008 n=5+5)
BM_MatMul<bool>/256 2.41ms ± 0% 0.34ms ± 0% -85.68% (p=0.008 n=5+5)
BM_MatMul<bool>/512 18.8ms ± 0% 2.7ms ± 0% -85.53% (p=0.008 n=5+5)
BM_MatMul<bool>/1k 149ms ± 0% 22ms ± 0% -85.40% (p=0.008 n=5+5)
2020-04-28 16:12:47 +00:00
Eugene Zhulenev
3c02fefec5
Add async evaluation support to TensorSlicingOp.
...
Device::memcpy is not async-safe and might lead to deadlocks. Always evaluate slice expression in async mode.
2020-04-22 19:55:01 +00:00
Rasmus Munk Larsen
2f6ddaa25c
Add partial vectorization for matrices and tensors of bool. This speeds up boolean operations on Tensors by up to 25x.
...
Benchmark numbers for the logical and of two NxN tensors:
name old time/op new time/op delta
BM_booleanAnd_1T/3 [using 1 threads] 14.6ns ± 0% 14.4ns ± 0% -0.96%
BM_booleanAnd_1T/4 [using 1 threads] 20.5ns ±12% 9.0ns ± 0% -56.07%
BM_booleanAnd_1T/7 [using 1 threads] 41.7ns ± 0% 10.5ns ± 0% -74.87%
BM_booleanAnd_1T/8 [using 1 threads] 52.1ns ± 0% 10.1ns ± 0% -80.59%
BM_booleanAnd_1T/10 [using 1 threads] 76.3ns ± 0% 13.8ns ± 0% -81.87%
BM_booleanAnd_1T/15 [using 1 threads] 167ns ± 0% 16ns ± 0% -90.45%
BM_booleanAnd_1T/16 [using 1 threads] 188ns ± 0% 16ns ± 0% -91.57%
BM_booleanAnd_1T/31 [using 1 threads] 667ns ± 0% 34ns ± 0% -94.83%
BM_booleanAnd_1T/32 [using 1 threads] 710ns ± 0% 35ns ± 0% -95.01%
BM_booleanAnd_1T/64 [using 1 threads] 2.80µs ± 0% 0.11µs ± 0% -95.93%
BM_booleanAnd_1T/128 [using 1 threads] 11.2µs ± 0% 0.4µs ± 0% -96.11%
BM_booleanAnd_1T/256 [using 1 threads] 44.6µs ± 0% 2.5µs ± 0% -94.31%
BM_booleanAnd_1T/512 [using 1 threads] 178µs ± 0% 10µs ± 0% -94.35%
BM_booleanAnd_1T/1k [using 1 threads] 717µs ± 0% 78µs ± 1% -89.07%
BM_booleanAnd_1T/2k [using 1 threads] 2.87ms ± 0% 0.31ms ± 1% -89.08%
BM_booleanAnd_1T/4k [using 1 threads] 11.7ms ± 0% 1.9ms ± 4% -83.55%
BM_booleanAnd_1T/10k [using 1 threads] 70.3ms ± 0% 17.2ms ± 4% -75.48%
2020-04-20 20:16:28 +00:00
Changming Sun
b1aa07a8d3
Fix a bug in TensorIndexList.h
2020-04-13 18:22:03 +00:00
jangsoopark
39142904cc
Resolve C4346 when building eigen on windows
2020-04-08 14:55:39 +09:00
Aaron Franke
5c22c7a7de
Make file formatting comply with POSIX and Unix standards
...
UTF-8, LF, no BOM, and newlines at the end of files
2020-03-23 18:09:02 +00:00
Deven Desai
7158ed4e0e
Fixing HIP breakage caused by the recent commit that introduces Packet4h2 as the Eigen::Half packet type
2020-03-12 01:06:24 +00:00
Sami Kama
b733b8b680
remove duplicate pset1 for half and add some comments about why we need expose pmul/add/div/min/max on host
2020-03-10 20:28:43 +00:00
Cédric Hubert
98bfc5aaa8
Update MarketIO.h
2020-02-28 12:41:51 +00:00
Ilya Tokar
eb6cc29583
Avoid a division in NonBlockingThreadPool::Steal.
...
Looking at profiles we spend ~10-20% of Steal on simply computing
random % size. We can reduce random 32-bit int into [0, size) range with
a single multiplication and shift. This transformation is described in
https://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-modulo-reduction/
2020-02-14 16:02:57 -05:00
Eugene Zhulenev
f584bd9b30
Fail at compile time if default executor tries to use non-default device
2020-02-06 22:43:24 +00:00
Eugene Zhulenev
3fda850c46
Remove dead code from TensorReduction.h
2020-01-29 18:45:31 +00:00
Jeff Daily
b5df8cabd7
fix hip-clang compilation due to new HIP scalar accessor
2020-01-20 21:08:52 +00:00
Deven Desai
6d284bb1b7
Fix for HIP breakage - 200115. Adding a missing EIGEN_DEVICE_FUNC attr
2020-01-16 00:51:43 +00:00
Srinivas Vasudevan
f6c6de5d63
Ensure Igamma does not NaN or Inf for large values.
2020-01-14 21:32:48 +00:00
Eugene Zhulenev
b9362fb8f7
Convert StridedLinearBufferCopy::Kind to enum class
2020-01-13 11:43:24 -08:00
Srinivas Vasudevan
2e099e8d8f
Added special_packetmath test and tweaked bounds on tests.
...
Refactor shared packetmath code to header file.
(Squashed from PR !38 )
2020-01-11 10:31:21 +00:00
Matthew Powelson
2ea5a715cf
Properly initialize b vector in SplineFitting
...
InterpolateWithDerivative does not initialize the be vector correctly. This issue is discussed In stackoverflow question 48382939.
2020-01-09 21:29:04 +00:00
Ilya Tokar
19876ced76
Bug #1785 : Introduce numext::rint.
...
This provides a new op that matches std::rint and previous behavior of
pround. Also adds corresponding unsupported/../Tensor op.
Performance is the same as e. g. floor (tested SSE/AVX).
2020-01-07 21:22:44 +00:00
mehdi-goli
d0ae052da4
[SYCL Backend]
...
* Adding Missing operations for vector comparison in SYCL. This caused compiler error for vector comparison when compiling SYCL
* Fixing the compiler error for placement new in TensorForcedEval.h This caused compiler error when compiling SYCL backend
* Reducing the SYCL warning by removing the abort function inside the kernel
* Adding Strong inline to functions inside SYCL interop.
2020-01-07 15:13:37 +00:00
Deven Desai
636e2bb3fa
Fix for HIP breakage - 191220
...
The breakage was introduced by the following commit :
ae07801dd8
After the commit, HIPCC errors out on some tests with the following error
```
Building HIPCC object unsupported/test/CMakeFiles/cxx11_tensor_device_1.dir/cxx11_tensor_device_1_generated_cxx11_tensor_device.cu.o
In file included from /home/rocm-user/eigen/unsupported/test/cxx11_tensor_device.cu:17:
In file included from /home/rocm-user/eigen/unsupported/Eigen/CXX11/Tensor💯
/home/rocm-user/eigen/unsupported/Eigen/CXX11/src/Tensor/TensorBlock.h:129:12: error: no matching constructor for initialization of 'Eigen::internal::TensorBlockResourceRequirements'
return {merge(lhs.shape_type, rhs.shape_type), // shape_type
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/home/rocm-user/eigen/unsupported/Eigen/CXX11/src/Tensor/TensorBlock.h:75:8: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 3 were provided
struct TensorBlockResourceRequirements {
^
/home/rocm-user/eigen/unsupported/Eigen/CXX11/src/Tensor/TensorBlock.h:75:8: note: candidate constructor (the implicit move constructor) not viable: requires 1 argument, but 3 were provided
/home/rocm-user/eigen/unsupported/Eigen/CXX11/src/Tensor/TensorBlock.h:75:8: note: candidate constructor (the implicit copy constructor) not viable: requires 5 arguments, but 3 were provided
/home/rocm-user/eigen/unsupported/Eigen/CXX11/src/Tensor/TensorBlock.h:75:8: note: candidate constructor (the implicit default constructor) not viable: requires 0 arguments, but 3 were provided
...
...
```
The fix is to explicitly decalre the (implicitly called) constructor as a device func
2019-12-20 21:28:00 +00:00
Christoph Hertzberg
1e9664b147
Bug #1796 : Make matrix squareroot usable for Map and Ref types
2019-12-20 18:10:22 +01:00
Christoph Hertzberg
d86544d654
Reduce code duplication and avoid confusing Doxygen
2019-12-19 19:48:39 +01:00
Christoph Hertzberg
dde279f57d
Hide recursive meta templates from Doxygen
2019-12-19 19:47:23 +01:00
Christoph Hertzberg
c21771ac04
Use double-braces initialization (as everywhere else in the test-suite).
2019-12-19 19:20:48 +01:00
Christoph Hertzberg
a3273aeff8
Fix trivial shadow warning
2019-12-19 19:13:11 +01:00
Eugene Zhulenev
7a65219a2e
Fix TensorPadding bug in squeezed reads from inner dimension
2019-12-19 05:43:57 +00:00
Eugene Zhulenev
73e55525e5
Return const data pointer from TensorRef evaluator.data()
2019-12-18 23:19:36 +00:00
Eugene Zhulenev
ae07801dd8
Tensor block evaluation cost model
2019-12-18 20:07:00 +00:00
Jeff Daily
de07c4d1c2
fix compilation due to new HIP scalar accessor
2019-12-17 20:27:30 +00:00
Eugene Zhulenev
788bef6ab5
Reduce block evaluation overhead for small tensor expressions
2019-12-17 19:06:14 +00:00
Eugene Zhulenev
381f8f3139
Initialize non-trivially constructible types when allocating a temp buffer.
2019-12-12 01:31:30 +00:00
Eugene Zhulenev
64272c7f40
Squeeze reads from two inner dimensions in TensorPadding
2019-12-11 16:54:51 -08:00
Eugene Zhulenev
963ba1015b
Add back accidentally deleted default constructor to TensorExecutorTilingContext.
2019-12-11 18:47:55 +00:00
Eugene Zhulenev
c9220c035f
Remove block memory allocation required by removed block evaluation API
2019-12-10 17:15:55 -08:00
Eugene Zhulenev
1c879eb010
Remove V2 suffix from TensorBlock
2019-12-10 15:40:23 -08:00
Eugene Zhulenev
dbca11e880
Remove TensorBlock.h and old TensorBlock/BlockMapper
2019-12-10 14:31:44 -08:00
Deven Desai
c49f0d851a
Fix for HIP breakage detected on 191210
...
The following commit introduces compile errors when running eigen with hipcc
2918f85ba9
hipcc errors out because it requies the device attribute on the methods within the TensorBlockV2ResourceRequirements struct instroduced by the commit above. The fix is to add the device attribute to those methods
2019-12-10 22:14:05 +00:00
Eugene Zhulenev
2918f85ba9
Do not use std::vector in getResourceRequirements
2019-12-09 16:19:55 -08:00
Artem Belevich
8056a05b54
Undo the block size change.
...
.z *is* used by the EigenContractionKernelInternal().
2019-12-09 11:10:29 -08:00
Eugene Zhulenev
dbb703d44e
Add async evaluation support to TensorSelectOp
2019-12-09 18:36:13 +00:00
Janek Kozicki
11d6465326
fix AlignedVector3 inconsisent interface with other Vector classes, default constructor and operator- were missing.
2019-12-06 21:07:39 +01:00
Eugene Zhulenev
bb7ccac3af
Add recursive work splitting to EvalShardedByInnerDimContext
2019-12-05 14:51:49 -08:00
Artem Belevich
25230d1862
Improve performance of contraction kernels
...
* Force-inline implementations. They pass around pointers to shared memory
blocks. Without inlining compiler must operate via generic pointers.
Inlining allows compiler to detect that we're operating on shared memory
which allows generation of substantially faster code.
* Fixed a long-standing typo which resulted in launching 8x more kernels
than we needed (.z dimension of the block is unused by the kernel).
2019-12-05 12:48:34 -08:00
Rasmus Munk Larsen
366cf005b0
Add missing initialization in cxx11_tensor_trace.cpp.
2019-12-04 23:56:37 +00:00
Eugene Zhulenev
8f4536e852
Capture TensorMap by value inside tensor expression AST
2019-12-03 16:39:05 -08:00