eigen

mirror of https://gitlab.com/libeigen/eigen.git synced 2024-12-27 07:29:52 +08:00

Author	SHA1	Message	Date
Eugene Zhulenev	02431cbe71	TensorBroadcasting support for random/uniform blocks	2019-10-16 13:26:28 -07:00
Eugene Zhulenev	d380c23b2c	Block evaluation for TensorGenerator/TensorReverse/TensorShuffling	2019-10-14 14:31:59 -07:00
Gael Guennebaud	39fb9eeccf	bug #1747 : fix compilation with MSVC	2019-10-14 22:50:23 +02:00
Eugene Zhulenev	a411e9f344	Block evaluation for TensorGenerator + TensorReverse + fixed bug in tensor reverse op	2019-10-10 10:56:58 -07:00
Eugene Zhulenev	33e1746139	Block evaluation for TensorChipping + fixed bugs in TensorPadding and TensorSlicing	2019-10-09 12:45:31 -07:00
Gael Guennebaud	f0a4642bab	Implement c++03 compatible fix for changeset `7a43af1a33`	2019-10-09 16:00:57 +02:00
Eugene Zhulenev	f74ab8cb8d	Add block evaluation to TensorEvalTo and fix few small bugs	2019-10-07 15:34:26 -07:00
Brian Zhao	3afb640b56	Fixing incorrect size in Tensor documentation.	2019-10-04 21:30:35 -07:00
Rasmus Munk Larsen	20c4a9118f	Use "pdiv" rather than operator/ to support packet types.	2019-10-04 16:54:03 -07:00
Eugene Zhulenev	98bdd7252e	Fix compilation warnings and errors with clang in TensorBlockV2 code and tests	2019-10-04 10:15:33 -07:00
Eugene Zhulenev	60ae24ee1a	Add block evaluation to TensorReshaping/TensorCasting/TensorPadding/TensorSelect	2019-10-02 12:44:06 -07:00
Eugene Zhulenev	6e40454a6e	Add beta to TensorContractionKernel and make memset optional	2019-10-02 11:06:02 -07:00
Rasmus Munk Larsen	13ef08e5ac	Move implementation of vectorized error function erf() to SpecialFunctionsImpl.h.	2019-09-27 13:56:04 -07:00
Eugene Zhulenev	71d5bedf72	Fix compilation warnings and errors with clang in TensorBlockV2	2019-09-25 11:25:22 -07:00
Deven Desai	5e186b1987	Fix for the HIP build+test errors. The errors were introduced by this commit : `d38e6fbc27` After the above mentioned commit, some of the tests started failing with the following error ``` Building HIPCC object unsupported/test/CMakeFiles/cxx11_tensor_reduction_gpu_5.dir/cxx11_tensor_reduction_gpu_5_generated_cxx11_tensor_reduction_gpu.cu.o In file included from /home/rocm-user/eigen/unsupported/test/cxx11_tensor_reduction_gpu.cu:16: In file included from /home/rocm-user/eigen/unsupported/Eigen/CXX11/Tensor:29: In file included from /home/rocm-user/eigen/unsupported/Eigen/CXX11/../SpecialFunctions:70: /home/rocm-user/eigen/unsupported/Eigen/CXX11/../src/SpecialFunctions/SpecialFunctionsHalf.h:28:22: error: call to 'erf' is ambiguous return Eigen::half(Eigen::numext::erf(static_cast<float>(a))); ^~~~~~~~~~~~~~~~~~ /home/rocm-user/eigen/unsupported/test/../../Eigen/src/Core/MathFunctions.h:1600:7: note: candidate function [with T = float] float erf(const float &x) { return ::erff(x); } ^ /home/rocm-user/eigen/unsupported/Eigen/CXX11/../src/SpecialFunctions/SpecialFunctionsImpl.h:1897:5: note: candidate function [with Scalar = float] erf(const Scalar& x) { ^ In file included from /home/rocm-user/eigen/unsupported/test/cxx11_tensor_reduction_gpu.cu:16: In file included from /home/rocm-user/eigen/unsupported/Eigen/CXX11/Tensor:29: In file included from /home/rocm-user/eigen/unsupported/Eigen/CXX11/../SpecialFunctions:75: /home/rocm-user/eigen/unsupported/Eigen/CXX11/../src/SpecialFunctions/arch/GPU/GpuSpecialFunctions.h:87:23: error: call to 'erf' is ambiguous return make_double2(erf(a.x), erf(a.y)); ^~~ /home/rocm-user/eigen/unsupported/test/../../Eigen/src/Core/MathFunctions.h:1603:8: note: candidate function [with T = double] double erf(const double &x) { return ::erf(x); } ^ /home/rocm-user/eigen/unsupported/Eigen/CXX11/../src/SpecialFunctions/SpecialFunctionsImpl.h:1897:5: note: candidate function [with Scalar = double] erf(const Scalar& x) { ^ In file included from /home/rocm-user/eigen/unsupported/test/cxx11_tensor_reduction_gpu.cu:16: In file included from /home/rocm-user/eigen/unsupported/Eigen/CXX11/Tensor:29: In file included from /home/rocm-user/eigen/unsupported/Eigen/CXX11/../SpecialFunctions:75: /home/rocm-user/eigen/unsupported/Eigen/CXX11/../src/SpecialFunctions/arch/GPU/GpuSpecialFunctions.h:87:33: error: call to 'erf' is ambiguous return make_double2(erf(a.x), erf(a.y)); ^~~ /home/rocm-user/eigen/unsupported/test/../../Eigen/src/Core/MathFunctions.h:1603:8: note: candidate function [with T = double] double erf(const double &x) { return ::erf(x); } ^ /home/rocm-user/eigen/unsupported/Eigen/CXX11/../src/SpecialFunctions/SpecialFunctionsImpl.h:1897:5: note: candidate function [with Scalar = double] erf(const Scalar& x) { ^ 3 errors generated. ``` This PR fixes the compile error by removing the "old" implementation for "erf" (assuming that the "new" implementation is what we want going forward. from a GPU point-of-view both implementations are the same). This PR also fixes what seems like a cut-n-paste error in the aforementioned commit	2019-09-25 15:39:13 +00:00
Eugene Zhulenev	f35b9ab510	Fix a bug in a packed block type in TensorContractionThreadPool	2019-09-24 16:54:36 -07:00
Rasmus Larsen	d38e6fbc27	Merged in rmlarsen/eigen (pull request PR-704) Add generic PacketMath implementation of the Error Function (erf).	2019-09-24 23:40:29 +00:00
Rasmus Munk Larsen	591a554c68	Add TODO to cleanup FMA cost modelling.	2019-09-24 16:39:25 -07:00
Eugene Zhulenev	c64396b4c6	Choose TensorBlock StridedLinearCopy type statically	2019-09-24 16:04:29 -07:00
Eugene Zhulenev	c97b208468	Add new TensorBlock api implementation + tests	2019-09-24 15:17:35 -07:00
Eugene Zhulenev	ef9dfee7bd	Tensor block evaluation V2 support for unary/binary/broadcsting	2019-09-24 12:52:45 -07:00
Christoph Hertzberg	e4c1b3c1d2	Fix implicit conversion warnings and use pnegate to negate packets	2019-09-23 16:07:43 +02:00
Christoph Hertzberg	ba0736fa8e	Fix (or mask away) conversion warnings introduced in `553caeb6a3` .	2019-09-23 15:58:05 +02:00
Rasmus Munk Larsen	1d5af0693c	Add support for asynchronous evaluation of tensor casting expressions.	2019-09-19 13:54:49 -07:00
Rasmus Munk Larsen	6de5ed08d8	Add generic PacketMath implementation of the Error Function (erf).	2019-09-19 12:48:30 -07:00
Srinivas Vasudevan	df0816b71f	Merging eigen/eigen.	2019-09-16 19:33:29 -04:00
Srinivas Vasudevan	6e215cf109	Add Bessel functions to SpecialFunctions. - Split SpecialFunctions files in to a separate BesselFunctions file. In particular add: - Modified bessel functions of the second kind k0, k1, k0e, k1e - Bessel functions of the first kind j0, j1 - Bessel functions of the second kind y0, y1	2019-09-14 12:16:47 -04:00
Eugene Zhulenev	bf8866b466	Fix maybe-unitialized warnings in TensorContractionThreadPool	2019-09-13 14:29:55 -07:00
Eugene Zhulenev	553caeb6a3	Use ThreadLocal container in TensorContractionThreadPool	2019-09-13 12:14:44 -07:00
Srinivas Vasudevan	facdec5aa7	Add packetized versions of i0e and i1e special functions. - In particular refactor the i0e and i1e code so scalar and vectorized path share code. - Move chebevl to GenericPacketMathFunctions. A brief benchmark with building Eigen with FMA, AVX and AVX2 flags Before: CPU: Intel Haswell with HyperThreading (6 cores) Benchmark Time(ns) CPU(ns) Iterations ----------------------------------------------------------------- BM_eigen_i0e_double/1 57.3 57.3 10000000 BM_eigen_i0e_double/8 398 398 1748554 BM_eigen_i0e_double/64 3184 3184 218961 BM_eigen_i0e_double/512 25579 25579 27330 BM_eigen_i0e_double/4k 205043 205042 3418 BM_eigen_i0e_double/32k 1646038 1646176 422 BM_eigen_i0e_double/256k 13180959 13182613 53 BM_eigen_i0e_double/1M 52684617 52706132 10 BM_eigen_i0e_float/1 28.4 28.4 24636711 BM_eigen_i0e_float/8 75.7 75.7 9207634 BM_eigen_i0e_float/64 512 512 1000000 BM_eigen_i0e_float/512 4194 4194 166359 BM_eigen_i0e_float/4k 32756 32761 21373 BM_eigen_i0e_float/32k 261133 261153 2678 BM_eigen_i0e_float/256k 2087938 2088231 333 BM_eigen_i0e_float/1M 8380409 8381234 84 BM_eigen_i1e_double/1 56.3 56.3 10000000 BM_eigen_i1e_double/8 397 397 1772376 BM_eigen_i1e_double/64 3114 3115 223881 BM_eigen_i1e_double/512 25358 25361 27761 BM_eigen_i1e_double/4k 203543 203593 3462 BM_eigen_i1e_double/32k 1613649 1613803 428 BM_eigen_i1e_double/256k 12910625 12910374 54 BM_eigen_i1e_double/1M 51723824 51723991 10 BM_eigen_i1e_float/1 28.3 28.3 24683049 BM_eigen_i1e_float/8 74.8 74.9 9366216 BM_eigen_i1e_float/64 505 505 1000000 BM_eigen_i1e_float/512 4068 4068 171690 BM_eigen_i1e_float/4k 31803 31806 21948 BM_eigen_i1e_float/32k 253637 253692 2763 BM_eigen_i1e_float/256k 2019711 2019918 346 BM_eigen_i1e_float/1M 8238681 8238713 86 After: CPU: Intel Haswell with HyperThreading (6 cores) Benchmark Time(ns) CPU(ns) Iterations ----------------------------------------------------------------- BM_eigen_i0e_double/1 15.8 15.8 44097476 BM_eigen_i0e_double/8 99.3 99.3 7014884 BM_eigen_i0e_double/64 777 777 886612 BM_eigen_i0e_double/512 6180 6181 100000 BM_eigen_i0e_double/4k 48136 48140 14678 BM_eigen_i0e_double/32k 385936 385943 1801 BM_eigen_i0e_double/256k 3293324 3293551 228 BM_eigen_i0e_double/1M 12423600 12424458 57 BM_eigen_i0e_float/1 16.3 16.3 43038042 BM_eigen_i0e_float/8 30.1 30.1 23456931 BM_eigen_i0e_float/64 169 169 4132875 BM_eigen_i0e_float/512 1338 1339 516860 BM_eigen_i0e_float/4k 10191 10191 68513 BM_eigen_i0e_float/32k 81338 81337 8531 BM_eigen_i0e_float/256k 651807 651984 1000 BM_eigen_i0e_float/1M 2633821 2634187 268 BM_eigen_i1e_double/1 16.2 16.2 42352499 BM_eigen_i1e_double/8 110 110 6316524 BM_eigen_i1e_double/64 822 822 851065 BM_eigen_i1e_double/512 6480 6481 100000 BM_eigen_i1e_double/4k 51843 51843 10000 BM_eigen_i1e_double/32k 414854 414852 1680 BM_eigen_i1e_double/256k 3320001 3320568 212 BM_eigen_i1e_double/1M 13442795 13442391 53 BM_eigen_i1e_float/1 17.6 17.6 41025735 BM_eigen_i1e_float/8 35.5 35.5 19597891 BM_eigen_i1e_float/64 240 240 2924237 BM_eigen_i1e_float/512 1424 1424 485953 BM_eigen_i1e_float/4k 10722 10723 65162 BM_eigen_i1e_float/32k 86286 86297 8048 BM_eigen_i1e_float/256k 691821 691868 1000 BM_eigen_i1e_float/1M 2777336 2777747 256 This shows anywhere from a 50% to 75% improvement on these operations. I've also benchmarked without any of these flags turned on, and got similar performance to before (if not better). Also tested packetmath.cpp + special_functions to ensure no regressions.	2019-09-11 18:34:02 -07:00
Deven Desai	cdb377d0cb	Fix for the HIP build+test errors introduced by the ndtri support. The fixes needed are * adding EIGEN_DEVICE_FUNC attribute to a couple of funcs (else HIPCC will error out when non-device funcs are called from global/device funcs) * switching to using ::<math_func> instead std::<math_func> (only for HIPCC) in cases where the std::<math_func> is not recognized as a device func by HIPCC * removing an errant "j" from a testcase (don't know how that made it in to begin with!)	2019-09-06 16:03:49 +00:00
Eugene Zhulenev	d918bd9a8b	Update ThreadLocal to use separate Initialize/Release callables	2019-09-10 16:13:32 -07:00
Eugene Zhulenev	e3dec4dcc1	ThreadLocal container that does not rely on thread local storage	2019-09-09 15:18:14 -07:00
Srinivas Vasudevan	e38dd48a27	PR 681: Add ndtri function, the inverse of the normal distribution function.	2019-08-12 19:26:29 -04:00
Eugene Zhulenev	47fefa235f	Allow move-only done callback in TensorAsyncDevice	2019-09-03 17:20:56 -07:00
Eugene Zhulenev	f68f2bba09	TensorMap constness should not change underlying storage constness	2019-09-03 11:08:09 -07:00
Alberto Luaces	c694be1214	Fixed Tensor documentation formatting.	2019-07-23 09:24:06 +00:00
Eugene Zhulenev	79c402e40e	Fix shadow warnings in TensorContractionThreadPool	2019-08-30 15:38:31 -07:00
Eugene Zhulenev	edf2ec28d8	Fix block mapper type name in TensorExecutor	2019-08-30 15:29:25 -07:00
Eugene Zhulenev	f0b36fb9a4	evalSubExprsIfNeededAsync + async TensorContractionThreadPool	2019-08-30 15:13:38 -07:00
Eugene Zhulenev	619cea9491	Revert accidentally removed <memory> header from ThreadPool	2019-08-30 14:51:17 -07:00
Eugene Zhulenev	66665e7e76	Asynchronous expression evaluation with TensorAsyncDevice	2019-08-30 14:49:40 -07:00
Eugene Zhulenev	bc40d4522c	Const correctness in TensorMap<const Tensor<T, ...>> expressions	2019-08-28 17:46:05 -07:00
Eugene Zhulenev	6e77f9bef3	Remove shadow warnings in TensorDeviceThreadPool	2019-08-28 10:32:19 -07:00
Rasmus Larsen	84fefdf321	Merged in ezhulenev/eigen-01 (pull request PR-683) Asynchronous parallelFor in Eigen ThreadPoolDevice	2019-08-26 21:49:17 +00:00
maratek	8b5ab0e4dd	Fix get_random_seed on Native Client Newlib in Native Client SDK does not provide ::random function. Implement get_random_seed for NaCl using ::rand, similarly to Windows version.	2019-08-23 15:25:56 -07:00
Eugene Zhulenev	6901788013	Asynchronous parallelFor in Eigen ThreadPoolDevice	2019-08-22 10:50:51 -07:00
Eugene Zhulenev	071311821e	Remove XSMM support from Tensor module	2019-08-19 11:44:25 -07:00
Rasmus Munk Larsen	eab7e52db2	[Eigen] Vectorize evaluation of coefficient-wise functions over tensor blocks if the strides are known to be 1. Provides up to 20-25% speedup of the TF cross entropy op with AVX. A few benchmark numbers: name old time/op new time/op delta BM_Xent_16_10000_cpu 448µs ± 3% 389µs ± 2% -13.21% (p=0.008 n=5+5) BM_Xent_32_10000_cpu 575µs ± 6% 454µs ± 3% -21.00% (p=0.008 n=5+5) BM_Xent_64_10000_cpu 933µs ± 4% 712µs ± 1% -23.71% (p=0.008 n=5+5)	2019-08-07 12:57:42 -07:00
Rasmus Munk Larsen	0987126165	Clean up unnecessary namespace specifiers in TensorBlock.h.	2019-08-07 12:12:52 -07:00
Rasmus Munk Larsen	e2999d4c38	Fix performance regressions due to https://bitbucket.org/eigen/eigen/pull-requests/662 . The change caused the device struct to be copied for each expression evaluation, and caused, e.g., a 10% regression in the TensorFlow multinomial op on GPU: Benchmark Time(ns) CPU(ns) Iterations ---------------------------------------------------------------------- BM_Multinomial_gpu_1_100000_4 128173 231326 2922 1.610G items/s VS Benchmark Time(ns) CPU(ns) Iterations ---------------------------------------------------------------------- BM_Multinomial_gpu_1_100000_4 146683 246914 2719 1.509G items/s	2019-08-02 11:18:13 -07:00
Eugene Zhulenev	3cd148f983	Fix expression evaluation heuristic for TensorSliceOp	2019-07-09 12:10:26 -07:00
Eugene Zhulenev	6083014594	Add outer/inner chipping optimization for chipping dimension specified at runtime	2019-07-03 11:35:25 -07:00
Deven Desai	7eb2e0a95b	adding the EIGEN_DEVICE_FUNC attribute to the constCast routine. Not having this attribute results in the following failures in the `--config=rocm` TF build. ``` In file included from tensorflow/core/kernels/cross_op_gpu.cu.cc:20: In file included from ./tensorflow/core/framework/register_types.h:20: In file included from ./tensorflow/core/framework/numeric_types.h:20: In file included from ./third_party/eigen3/unsupported/Eigen/CXX11/Tensor:1: In file included from external/eigen_archive/unsupported/Eigen/CXX11/Tensor:140: external/eigen_archive/unsupported/Eigen/CXX11/src/Tensor/TensorChipping.h:356:37: error: 'Eigen::constCast': no overloaded function has restriction specifiers that are compatible with the ambient context 'data' typename Storage::Type result = constCast(m_impl.data()); ^ external/eigen_archive/unsupported/Eigen/CXX11/src/Tensor/TensorChipping.h:356:37: error: 'Eigen::constCast': no overloaded function has restriction specifiers that are compatible with the ambient context 'data' external/eigen_archive/unsupported/Eigen/CXX11/src/Tensor/TensorAssign.h:148:56: note: in instantiation of member function 'Eigen::TensorEvaluator<const Eigen::TensorChippingOp<1, Eigen::TensorMap<Eigen::Tensor<int, 2, 1, long>, 16, MakePointer> >, Eigen::Gpu\ Device>::data' requested here return m_rightImpl.evalSubExprsIfNeeded(m_leftImpl.data()); ``` Adding the EIGEN_DEVICE_FUNC attribute resolves those errors	2019-07-02 20:02:46 +00:00
Gael Guennebaud	ef8aca6a89	Merged in codeplaysoftware/eigen (pull request PR-667) [SYCL] : Approved-by: Gael Guennebaud <g.gael@free.fr> Approved-by: Rasmus Larsen <rmlarsen@google.com>	2019-07-02 12:45:23 +00:00
Eugene Zhulenev	4ac93f8edc	Allocate non-const scalar buffer for block evaluation with DefaultDevice	2019-07-01 10:55:19 -07:00
Mehdi Goli	9ea490c82c	[SYCL] : * Modifying TensorDeviceSYCL to use `EIGEN_THROW_X`. * Modifying TensorMacro to use `EIGEN_TRY/CATCH(X)` macro. * Modifying TensorReverse.h to use `EIGEN_DEVICE_REF` instead of `&`. * Fixing the SYCL device macro in SpecialFunctionsImpl.h.	2019-07-01 16:27:28 +01:00
Eugene Zhulenev	81a03bec75	Fix TensorReverse on GPU with m_stride[i]==0	2019-06-28 15:50:39 -07:00
Rasmus Munk Larsen	74a9dd1102	Fix preprocessor condition to only generate a warning when calling eigen::GpuDevice::synchronize() from device code, but not when calling from a non-GPU compilation unit.	2019-06-28 11:56:21 -07:00
Rasmus Munk Larsen	70d4020ad9	Remove comma causing warning in c++03 mode.	2019-06-28 11:39:45 -07:00
Eugene Zhulenev	6e7c76481a	Merge with Eigen head	2019-06-28 11:22:46 -07:00
Eugene Zhulenev	878845cb25	Add block access to TensorReverseOp and make sure that TensorForcedEval uses block access when preferred	2019-06-28 11:13:44 -07:00
Mehdi Goli	7d08fa805a	[SYCL] This PR adds the minimum modifications to the Eigen unsupported module required to run it on devices supporting SYCL. * Abstracting the pointer type so that both SYCL memory and pointer can be captured. * Converting SYCL virtual pointer to SYCL device memory in Eigen evaluator class. * Binding SYCL placeholder accessor to command group handler by using bind method in Eigen evaluator node. * Adding SYCL macro for controlling loop unrolling. * Modifying the TensorDeviceSycl.h and SYCL executor method to adopt the above changes.	2019-06-28 10:08:23 +01:00
Christoph Hertzberg	adec097c61	Remove extra comma (causes warnings in C++03)	2019-06-26 16:14:28 +02:00
Eugene Zhulenev	229db81572	Optimize evaluation strategy for TensorSlicingOp and TensorChippingOp	2019-06-25 15:41:37 -07:00
Rasmus Munk Larsen	b08527b0c1	Clean up CUDA/NVCC version macros and their use in Eigen, and a few other CUDA build failures.	2019-05-31 15:26:06 -07:00
Michael Tesch	c5019f722b	Use pade for matrix exponential also for complex values.	2019-05-08 17:04:55 +02:00
Rasmus Larsen	e92486b8c3	Merged in rmlarsen/eigen (pull request PR-643) Make Eigen build with cuda 10 and clang. Approved-by: Justin Lebar <justin.lebar@gmail.com>	2019-05-20 17:02:39 +00:00
Eugene Zhulenev	01654d97fa	Prevent potential division by zero in TensorExecutor	2019-05-17 14:02:25 -07:00
Eugene Zhulenev	96a276803c	Always evaluate Tensor expressions with broadcasting via tiled evaluation code path	2019-05-16 16:15:45 -07:00
Rasmus Munk Larsen	ab0a30e429	Make Eigen build with cuda 10 and clang.	2019-05-15 13:32:15 -07:00
Rasmus Munk Larsen	e5ac8cbd7a	A) fix deadlocks in thread pool caused by EventCount This fixed 2 deadlocks caused by sloppiness in the EventCount logic. Both most likely were introduced by cl/236729920 which includes the new EventCount algorithm: `01da8caf00` bug #1 (Prewait): Prewait must not consume existing signals. Consider the following scenario. There are 2 thread pool threads (1 and 2) and 1 external thread (3). RunQueue is empty. Thread 1 checks the queue, calls Prewait, checks RunQueue again and now is going to call CommitWait. Thread 2 checks the queue and now is going to call Prewait. Thread 3 submits 2 tasks, EventCount signals is set to 1 because only 1 waiter is registered the second signal is discarded). Now thread 2 resumes and calls Prewait and takes away the signal. Thread 1 resumes and calls CommitWait, there are no pending signals anymore, so it blocks. As the result we have 2 tasks, but only 1 thread is running. bug #2 (CancelWait): CancelWait must not take away a signal if it's not sure that the signal was meant for this thread. When one thread blocks and another submits a new task concurrently, the EventCount protocol guarantees only the following properties (similar to the Dekker's algorithm): (a) the registered waiter notices presence of the new task and does not block (b) the signaler notices presence of the waiters and wakes it (c) both the waiter notices presence of the new task and signaler notices presence of the waiter [it's only that both of them do not notice each other must not be possible, because it would lead to a deadlock] CancelWait is called for cases (a) and (c). For case (c) it is OK to take the notification signal away, but it's not OK for (a) because nobody queued a signals for us and we take away a signal meant for somebody else. Consider: Thread 1 calls Prewait, checks RunQueue, it's empty, now it's going to call CommitWait. Thread 3 submits 2 tasks, EventCount signals is set to 1 because only 1 waiter is registered the second signal is discarded). Thread 2 calls Prewait, checks RunQueue, discovers the tasks, calls CancelWait and consumes the pending signal (meant for thread 1). Now Thread 1 resumes and calls CommitWait, since there are no signals it blocks. As the result we have 2 tasks, but only 1 thread is running. Both deadlocks are only a problem if the tasks require parallelism. Most computational tasks do not require parallelism, i.e. a single thread will run task 1, finish it and then dequeue and run task 2. This fix undoes some of the sloppiness in the EventCount that was meant to reduce CPU consumption by idle threads, because we now have more threads running in these corner cases. But we still don't have pthread_yield's and maybe the strictness introduced by this change will actually help to reduce tail latency because we will have threads running when we actually need them running. B) fix deadlock in thread pool caused by RunQueue This fixed a deadlock caused by sloppiness in the RunQueue logic. Most likely this was introduced with the non-blocking thread pool. The deadlock only affects workloads that require parallelism. Most computational tasks don't require parallelism. PopBack must not fail spuriously. If it does, it can effectively lead to single thread consuming several wake up signals. Consider 2 worker threads are blocked. External thread submits a task. One of the threads is woken. It tries to steal the task, but fails due to a spurious failure in PopBack (external thread submits another task and holds the lock). The thread executes blocking protocol again (it won't block because NonEmptyQueueIndex is precise and the thread will discover pending work, but it has called PrepareWait). Now external thread submits another task and signals EventCount again. The signal is consumed by the first thread again. But now we have 2 tasks pending but only 1 worker thread running. It may be possible to fix this in a different way: make EventCount::CancelWait forward wakeup signal to a blocked thread rather then consuming it. But this looks more complex and I am not 100% that it will fix the bug. It's also possible to have 2 versions of PopBack: one will do try_to_lock and another won't. Then worker threads could first opportunistically check all queues with try_to_lock, and only use the blocking version before blocking. But let's first fix the bug with the simpler change.	2019-05-08 10:16:46 -07:00
Christoph Hertzberg	e6667a7060	Fix stupid shadow-warnings (with old clang versions)	2019-05-07 18:32:19 +02:00
Christoph Hertzberg	e54dc24d62	Restore C++03 compatibility	2019-05-07 18:30:44 +02:00
Rasmus Larsen	ac50afaffa	Merged in ezhulenev/eigen-01 (pull request PR-633) Check if gpu_assert was overridden in TensorGpuHipCudaDefines	2019-04-29 16:29:35 +00:00
Eugene Zhulenev	01d7e6ee9b	Check if gpu_assert was overridden in TensorGpuHipCudaDefines	2019-04-25 11:19:17 -07:00
Eugene Zhulenev	8ead5bb3d8	Fix doxygen warnings to enable statis code analysis	2019-04-24 12:42:28 -07:00
Rasmus Munk Larsen	144ca33321	Remove deprecation annotation from typedef Eigen::Index Index, as it would generate too many build warnings.	2019-04-24 08:50:07 -07:00
Eugene Zhulenev	a7b7f3ca8a	Add missing EIGEN_DEPRECATED annotations to deprecated functions and fix few other doxygen warnings	2019-04-23 17:23:19 -07:00
Anuj Rawat	8c7a6feb8e	Adding lowlevel APIs for optimized RHS packet load in TensorFlow SpatialConvolution Low-level APIs are added in order to optimized packet load in gemm_pack_rhs in TensorFlow SpatialConvolution. The optimization is for scenario when a packet is split across 2 adjacent columns. In this case we read it as two 'partial' packets and then merge these into 1. Currently this only works for Packet16f (AVX512) and Packet8f (AVX2). We plan to add this for other packet types (such as Packet8d) also. This optimization shows significant speedup in SpatialConvolution with certain parameters. Some examples are below. Benchmark parameters are specified as: Batch size, Input dim, Depth, Num of filters, Filter dim Speedup numbers are specified for number of threads 1, 2, 4, 8, 16. AVX512: Parameters \| Speedup (Num of threads: 1, 2, 4, 8, 16) ----------------------------\|------------------------------------------ 128, 24x24, 3, 64, 5x5 \|2.18X, 2.13X, 1.73X, 1.64X, 1.66X 128, 24x24, 1, 64, 8x8 \|2.00X, 1.98X, 1.93X, 1.91X, 1.91X 32, 24x24, 3, 64, 5x5 \|2.26X, 2.14X, 2.17X, 2.22X, 2.33X 128, 24x24, 3, 64, 3x3 \|1.51X, 1.45X, 1.45X, 1.67X, 1.57X 32, 14x14, 24, 64, 5x5 \|1.21X, 1.19X, 1.16X, 1.70X, 1.17X 128, 128x128, 3, 96, 11x11 \|2.17X, 2.18X, 2.19X, 2.20X, 2.18X AVX2: Parameters \| Speedup (Num of threads: 1, 2, 4, 8, 16) ----------------------------\|------------------------------------------ 128, 24x24, 3, 64, 5x5 \| 1.66X, 1.65X, 1.61X, 1.56X, 1.49X 32, 24x24, 3, 64, 5x5 \| 1.71X, 1.63X, 1.77X, 1.58X, 1.68X 128, 24x24, 1, 64, 5x5 \| 1.44X, 1.40X, 1.38X, 1.37X, 1.33X 128, 24x24, 3, 64, 3x3 \| 1.68X, 1.63X, 1.58X, 1.56X, 1.62X 128, 128x128, 3, 96, 11x11 \| 1.36X, 1.36X, 1.37X, 1.37X, 1.37X In the higher level benchmark cifar10, we observe a runtime improvement of around 6% for AVX512 on Intel Skylake server (8 cores). On lower level PackRhs micro-benchmarks specified in TensorFlow tensorflow/core/kernels/eigen_spatial_convolutions_test.cc, we observe the following runtime numbers: AVX512: Parameters \| Runtime without patch (ns) \| Runtime with patch (ns) \| Speedup ---------------------------------------------------------------\|----------------------------\|-------------------------\|--------- BM_RHS_NAME(PackRhs, 128, 24, 24, 3, 64, 5, 5, 1, 1, 256, 56) \| 41350 \| 15073 \| 2.74X BM_RHS_NAME(PackRhs, 32, 64, 64, 32, 64, 5, 5, 1, 1, 256, 56) \| 7277 \| 7341 \| 0.99X BM_RHS_NAME(PackRhs, 32, 64, 64, 32, 64, 5, 5, 2, 2, 256, 56) \| 8675 \| 8681 \| 1.00X BM_RHS_NAME(PackRhs, 32, 64, 64, 30, 64, 5, 5, 1, 1, 256, 56) \| 24155 \| 16079 \| 1.50X BM_RHS_NAME(PackRhs, 32, 64, 64, 30, 64, 5, 5, 2, 2, 256, 56) \| 25052 \| 17152 \| 1.46X BM_RHS_NAME(PackRhs, 32, 256, 256, 4, 16, 8, 8, 1, 1, 256, 56) \| 18269 \| 18345 \| 1.00X BM_RHS_NAME(PackRhs, 32, 256, 256, 4, 16, 8, 8, 2, 4, 256, 56) \| 19468 \| 19872 \| 0.98X BM_RHS_NAME(PackRhs, 32, 64, 64, 4, 16, 3, 3, 1, 1, 36, 432) \| 156060 \| 42432 \| 3.68X BM_RHS_NAME(PackRhs, 32, 64, 64, 4, 16, 3, 3, 2, 2, 36, 432) \| 132701 \| 36944 \| 3.59X AVX2: Parameters \| Runtime without patch (ns) \| Runtime with patch (ns) \| Speedup ---------------------------------------------------------------\|----------------------------\|-------------------------\|--------- BM_RHS_NAME(PackRhs, 128, 24, 24, 3, 64, 5, 5, 1, 1, 256, 56) \| 26233 \| 12393 \| 2.12X BM_RHS_NAME(PackRhs, 32, 64, 64, 32, 64, 5, 5, 1, 1, 256, 56) \| 6091 \| 6062 \| 1.00X BM_RHS_NAME(PackRhs, 32, 64, 64, 32, 64, 5, 5, 2, 2, 256, 56) \| 7427 \| 7408 \| 1.00X BM_RHS_NAME(PackRhs, 32, 64, 64, 30, 64, 5, 5, 1, 1, 256, 56) \| 23453 \| 20826 \| 1.13X BM_RHS_NAME(PackRhs, 32, 64, 64, 30, 64, 5, 5, 2, 2, 256, 56) \| 23167 \| 22091 \| 1.09X BM_RHS_NAME(PackRhs, 32, 256, 256, 4, 16, 8, 8, 1, 1, 256, 56) \| 23422 \| 23682 \| 0.99X BM_RHS_NAME(PackRhs, 32, 256, 256, 4, 16, 8, 8, 2, 4, 256, 56) \| 23165 \| 23663 \| 0.98X BM_RHS_NAME(PackRhs, 32, 64, 64, 4, 16, 3, 3, 1, 1, 36, 432) \| 72689 \| 44969 \| 1.62X BM_RHS_NAME(PackRhs, 32, 64, 64, 4, 16, 3, 3, 2, 2, 36, 432) \| 61732 \| 39779 \| 1.55X All benchmarks on Intel Skylake server with 8 cores.	2019-04-20 06:46:43 +00:00
Rasmus Munk Larsen	039ee52125	Tweak cost model for tensor contraction when parallelizing over the inner dimension. https://bitbucket.org/snippets/rmlarsen/MexxLo	2019-04-12 13:35:10 -07:00
Jonathon Koyle	9a3f06d836	Update TheadPoolDevice example to include ThreadPool creation and passing pointer into constructor.	2019-04-10 10:02:33 -06:00
Deven Desai	66a885b61e	adding EIGEN_DEVICE_FUNC to the recently added TensorContractionKernel constructor. Not having the EIGEN_DEVICE_FUNC attribute on it was leading to compiler errors when compiling Eigen in the ROCm/HIP path	2019-04-08 13:45:08 +00:00
Eugene Zhulenev	629ddebd15	Add missing semicolon	2019-04-02 15:04:26 -07:00
Eugene Zhulenev	4e2f6de1a8	Add support for custom packed Lhs/Rhs blocks in tensor contractions	2019-04-01 11:47:31 -07:00
Deven Desai	2dbea5510f	Merged eigen/eigen into default	2019-03-19 16:52:38 -04:00
David Tellenbach	bd9c2ae3fd	Fix include guard comments	2019-03-15 15:29:17 +01:00
Eugene Zhulenev	001f10e3c9	Fix segfaults with cuda compilation	2019-03-11 09:43:33 -07:00
Eugene Zhulenev	899c16fa2c	Fix a bug in TensorGenerator for 1d tensors	2019-03-11 09:42:01 -07:00
Eugene Zhulenev	0f8bfff23d	Fix a data race in NonBlockingThreadPool	2019-03-11 09:38:44 -07:00
Gael Guennebaud	2df4f00246	Change license from LGPL to MPL2 with agreement from David Harmon.	2019-03-07 18:17:10 +01:00
Rasmus Munk Larsen	3c3f639fe2	Merge.	2019-03-06 11:54:30 -08:00
Rasmus Munk Larsen	f4ec8edea8	Add macro EIGEN_AVOID_THREAD_LOCAL to make it possible to manually disable the use of thread_local.	2019-03-06 11:52:04 -08:00
Rasmus Munk Larsen	41cdc370d0	Fix placement of "#if defined(EIGEN_GPUCC)" guard region. Found with -Wundefined-func-template. Author: tkoeppe@google.com	2019-03-06 11:42:22 -08:00
Rasmus Munk Larsen	cc407c9d4d	Fix placement of "#if defined(EIGEN_GPUCC)" guard region. Found with -Wundefined-func-template. Author: tkoeppe@google.com	2019-03-06 11:40:06 -08:00
Eugene Zhulenev	1bc2a0a57c	Add missing return to NonBlockingThreadPool::LocalSteal	2019-03-06 10:49:49 -08:00
Eugene Zhulenev	4e4dcd9026	Remove redundant steal loop	2019-03-06 10:39:07 -08:00
Eugene Zhulenev	25abaa2e41	Check that inner block dimension is continuous	2019-03-05 17:34:35 -08:00
Eugene Zhulenev	5d9a6686ed	Block evaluation for TensorGeneratorOp	2019-03-05 16:35:21 -08:00
Eugene Zhulenev	a407e022e6	Tune tensor contraction threadpool heuristics	2019-03-05 14:19:59 -08:00
Eugene Zhulenev	56c6373f82	Add an extra check for the RunQueue size estimate	2019-03-05 11:51:26 -08:00
Eugene Zhulenev	efb5080d31	Do not initialize invalid fast_strides in TensorGeneratorOp	2019-03-04 16:58:49 -08:00
Eugene Zhulenev	b95941e5c2	Add tiled evaluation for TensorForcedEvalOp	2019-03-04 16:02:22 -08:00
Eugene Zhulenev	694084ecbd	Use fast divisors in TensorGeneratorOp	2019-03-04 11:10:21 -08:00
Rasmus Munk Larsen	cf4a1c81fa	Fix specialization for conjugate on non-complex types in TensorBase.h.	2019-03-01 14:21:09 -08:00
Rasmus Munk Larsen	6560692c67	Improve EventCount used by the non-blocking threadpool. The current algorithm requires threads to commit/cancel waiting in order they called Prewait. Spinning caused by that serialization can consume lots of CPU time on some workloads. Restructure the algorithm to not require that serialization and remove spin waits from Commit/CancelWait. Note: this reduces max number of threads from 2^16 to 2^14 to leave more space for ABA counter (which is now 22 bits). Implementation details are explained in comments.	2019-02-22 13:56:26 -08:00
Gael Guennebaud	9ac1634fdf	Fix conversion warnings	2019-02-19 21:59:53 +01:00
Rasmus Munk Larsen	071629a440	Fix incorrect value of NumDimensions in TensorContraction traits. Reported here: #1671	2019-02-19 10:49:54 -08:00
Rasmus Larsen	efeabee445	Merged in ezhulenev/eigen-01 (pull request PR-590) Do not generate no-op cast() and conjugate() expressions	2019-02-14 21:16:12 +00:00
Eugene Zhulenev	7b837559a7	Fix signed-unsigned return in RuqQueue	2019-02-14 10:40:21 -08:00
Eugene Zhulenev	f0d42d2265	Fix signed-unsigned comparison warning in RunQueue	2019-02-14 10:27:28 -08:00
Eugene Zhulenev	106ba7bb1a	Do not generate no-op cast() and conjugate() expressions	2019-02-14 09:51:51 -08:00
Eugene Zhulenev	8c2f30c790	Speedup Tensor ThreadPool RunQueu::Empty()	2019-02-13 10:20:53 -08:00
Eugene Zhulenev	21eb97d3e0	Add PacketConv implementation for non-vectorizable src expressions	2019-02-08 15:47:25 -08:00
Eugene Zhulenev	1e36166ed1	Optimize TensorConversion evaluator: do not convert same type	2019-02-08 15:13:24 -08:00
Steven Peters	953ca5ba2f	Spline.h: fix spelling "spang" -> "span"	2019-02-08 06:23:24 +00:00
Eugene Zhulenev	59998117bb	Don't do parallel_pack if we can use thread_local memory in tensor contractions	2019-02-07 09:21:25 -08:00
Eugene Zhulenev	8491127082	Do not reduce parallelism too much in contractions with small number of threads	2019-02-04 12:59:33 -08:00
Eugene Zhulenev	eb21bab769	Parallelize tensor contraction only by sharding dimension and use 'thread-local' memory for packing	2019-02-04 10:43:16 -08:00
Gael Guennebaud	d586686924	Workaround lack of support for arbitrary packet-type in Tensor by manually loading half/quarter packets in tensor contraction mapper.	2019-01-30 16:48:01 +01:00
Christoph Hertzberg	a7779a9b42	Hide some annoying unused variable warnings in g++8.1	2019-01-29 16:48:21 +01:00
Christoph Hertzberg	c9825b967e	Renaming even more `I` identifiers	2019-01-26 13:22:13 +01:00
Christoph Hertzberg	934b8a1304	Avoid `I` as an identifier, since it may clash with the C-header complex.h	2019-01-25 14:54:39 +01:00
Eugene Zhulenev	1e6d15b55b	Fix shorten-64-to-32 warning in TensorContractionThreadPool	2019-01-11 11:41:53 -08:00
Eugene Zhulenev	0abe03764c	Fix shorten-64-to-32 warning in TensorContractionThreadPool	2019-01-10 10:27:55 -08:00
Gael Guennebaud	d812f411c3	bug #1654 : fix compilation with cuda and no c++11	2019-01-09 18:00:05 +01:00
Eugene Zhulenev	e70ffef967	Optimize evalShardedByInnerDim	2019-01-08 16:26:31 -08:00
Rasmus Munk Larsen	dd6d65898a	Fix shorten-64-to-32 warning. Use regular memcpy if num_threads==0.	2018-12-12 14:45:31 -08:00
Gael Guennebaud	cf697272e1	Remove debug code.	2018-12-09 23:05:46 +01:00
Gael Guennebaud	450dc97c6b	Various fixes in polynomial solver and its unit tests: - cleanup noise in imaginary part of real roots - take into account the magnitude of the derivative to check roots. - use <= instead of < at appropriate places	2018-12-09 22:54:39 +01:00
Rasmus Munk Larsen	8a02883d58	Merged in markdryan/eigen/avx512-contraction-2 (pull request PR-554) Fix tensor contraction on AVX512 builds Approved-by: Rasmus Munk Larsen <rmlarsen@google.com>	2018-12-05 18:19:32 +00:00
Mark D Ryan	36f8f6d0be	Fix evalShardedByInnerDim for AVX512 builds evalShardedByInnerDim ensures that the values it passes for start_k and end_k to evalGemmPartialWithoutOutputKernel are multiples of 8 as the kernel does not work correctly when the values of k are not multiples of the packet_size. While this precaution works for AVX builds, it is insufficient for AVX512 builds where the maximum packet size is 16. The result is slightly incorrect float32 contractions on AVX512 builds. This commit fixes the problem by ensuring that k is always a multiple of the packet_size if the packet_size is > 8.	2018-12-05 12:29:03 +01:00
Christoph Hertzberg	0ec8afde57	Fixed most conversion warnings in MatrixFunctions module	2018-11-20 16:23:28 +01:00
Deven Desai	e7e6809e6b	ROCm/HIP specfic fixes + updates 1. Eigen/src/Core/arch/GPU/Half.h Updating the HIPCC implementation half so that it can declared as a __shared__ variable 2. Eigen/src/Core/util/Macros.h, Eigen/src/Core/util/Memory.h introducing a EIGEN_USE_STD(func) macro that calls - std::func be default - ::func when eigen is being compiled with HIPCC This change was requested in the previous HIP PR (https://bitbucket.org/eigen/eigen/pull-requests/518/pr-with-hip-specific-fixes-for-the-eigen/diff) 3. unsupported/Eigen/CXX11/src/Tensor/TensorDeviceThreadPool.h Removing EIGEN_DEVICE_FUNC attribute from pure virtual methods as it is not supported by HIPCC 4. unsupported/Eigen/CXX11/src/Tensor/TensorReduction.h Disabling the template specializations of InnerMostDimReducer as they run into HIPCC link errors	2018-11-19 18:13:59 +00:00
Rasmus Munk Larsen	72928a2c8a	Merged in rmlarsen/eigen2 (pull request PR-543) Add parallel memcpy to TensorThreadPoolDevice in Eigen, but limit the number of threads to 4, beyond which we just seem to be wasting CPU cycles as the threads contend for memory bandwidth. Approved-by: Eugene Zhulenev <ezhulenev@google.com>	2018-11-13 17:10:30 +00:00
Rasmus Munk Larsen	cda479d626	Remove accidental changes.	2018-11-12 18:34:04 -08:00
Rasmus Munk Larsen	719d9aee65	Add parallel memcpy to TensorThreadPoolDevice in Eigen, but limit the number of threads to 4, beyond which we just seem to be wasting CPU cycles as the threads contend for memory bandwidth.	2018-11-12 17:46:02 -08:00
Rasmus Munk Larsen	93f9988a7e	A few small fixes to a) prevent throwing in ctors and dtors of the threading code, and b) supporting matrix exponential on platforms with 113 bits of mantissa for long doubles.	2018-11-09 14:15:32 -08:00
Christoph Hertzberg	449ff74672	Fix most Doxygen warnings. Also add links to stable documentation from unsupported modules (by using the corresponding Doxytags file). Manually grafted from `d107a371c6`	2018-10-19 21:10:28 +02:00
Rasmus Munk Larsen	dda68f56ec	Fix GPU build due to gpu_assert not always being defined.	2018-10-18 16:29:29 -07:00
Eugene Zhulenev	9e96e91936	Move from rvalue arguments in ThreadPool enqueue* methods	2018-10-16 16:48:32 -07:00
Eugene Zhulenev	217d839816	Reduce thread scheduling overhead in parallelFor	2018-10-16 14:53:06 -07:00
Rasmus Munk Larsen	d52763bb4f	Merged in ezhulenev/eigen-02 (pull request PR-528) [TensorBlockIO] Check if it's allowed to squeeze inner dimensions Approved-by: Rasmus Munk Larsen <rmlarsen@google.com>	2018-10-16 15:39:40 +00:00
Eugene Zhulenev	900c7c61bb	Check if it's allowed to squueze inner dimensions in TensorBlockIO	2018-10-15 16:52:33 -07:00
Gael Guennebaud	f0fb95135d	Iterative solvers: unify and fix handling of multiple rhs. m_info was not properly computed and the logic was repeated in several places.	2018-10-15 23:47:46 +02:00
Gael Guennebaud	2747b98cfc	DGMRES: fix null rhs, fix restart, fix m_isDeflInitialized for multiple solve	2018-10-15 23:46:00 +02:00
Christoph Hertzberg	3f2c8b7ff0	Fix a lot of Doxygen warnings in Tensor module	2018-10-09 20:22:47 +02:00
Rasmus Munk Larsen	d16634c4d4	Fix out-of bounds access in TensorArgMax.h.	2018-10-08 16:41:36 -07:00
Gael Guennebaud	64b1a15318	Workaround stupid warning	2018-10-08 12:01:18 +02:00
Christoph Hertzberg	b92c71235d	Move struct outside of method for C++03 compatibility.	2018-10-02 18:59:10 +02:00
Christoph Hertzberg	051f9c1aff	Make code compile in C++03 mode again	2018-10-02 18:36:30 +02:00
Christoph Hertzberg	b786ce8c72	Fix conversion warning ... again	2018-10-02 18:35:25 +02:00
Christoph Hertzberg	564ca71e39	Merged in deven-amd/eigen/HIP_fixes (pull request PR-518) PR with HIP specific fixes (for the eigen nightly regression failures in HIP mode)	2018-10-01 16:51:04 +00:00
Deven Desai	94898488a6	This commit contains the following (HIP specific) updates: - unsupported/Eigen/CXX11/src/Tensor/TensorReductionGpu.h Changing "pass-by-reference" argument to be "pass-by-value" instead (in a __global__ function decl). "pass-by-reference" arguments to __global__ functions are unwise, and will be explicitly flagged as errors by the newer versions of HIP. - Eigen/src/Core/util/Memory.h - unsupported/Eigen/CXX11/src/Tensor/TensorContraction.h Changes introduced in recent commits breaks the HIP compile. Adding EIGEN_DEVICE_FUNC attribute to some functions and calling ::malloc/free instead of the corresponding std:: versions to get the HIP compile working again - unsupported/Eigen/CXX11/src/Tensor/TensorReduction.h Change introduced a recent commit breaks the HIP compile (link stage errors out due to failure to inline a function). Disabling the recently introduced code (only for HIP compile), to get the eigen nightly testing going again. Will submit another PR once we have te proper fix. - Eigen/src/Core/util/ConfigureVectorization.h Enabling GPU VECTOR support when HIP compiler is in use (for both the host and device compile phases)	2018-10-01 14:28:37 +00:00
Rasmus Munk Larsen	2088c0897f	Merged eigen/eigen into default	2018-09-28 16:00:46 -07:00
Rasmus Munk Larsen	31629bb964	Get rid of unused variable warning.	2018-09-28 16:00:09 -07:00
Eugene Zhulenev	bb13d5d917	Fix bug in copy optimization in Tensor slicing.	2018-09-28 14:34:42 -07:00
Rasmus Munk Larsen	104e8fa074	Fix a few warnings and rename a variable to not shadow "last".	2018-09-28 12:00:08 -07:00
Rasmus Munk Larsen	7c1b47840a	Merged in ezhulenev/eigen-01 (pull request PR-514) Add tests for evalShardedByInnerDim contraction + fix bugs	2018-09-28 18:37:54 +00:00
Eugene Zhulenev	524c81f3fa	Add tests for evalShardedByInnerDim contraction + fix bugs	2018-09-28 11:24:08 -07:00
Christoph Hertzberg	86ba50be39	Fix integer conversion warnings	2018-09-28 19:33:39 +02:00
Eugene Zhulenev	e95696acb3	Optimize TensorBlockCopyOp	2018-09-27 14:49:26 -07:00
Eugene Zhulenev	9f33e71e9d	Revert code lost in merge	2018-09-27 12:08:17 -07:00
Eugene Zhulenev	a7a3e9f2b6	Merge with eigen/eigen default	2018-09-27 12:05:06 -07:00
Eugene Zhulenev	9f4988959f	Remove explicit mkldnn support and redundant TensorContractionKernelBlocking	2018-09-27 11:49:19 -07:00
Rasmus Munk Larsen	d956204ab2	Remove "false &&" left over from test.	2018-09-26 17:03:30 -07:00
Rasmus Munk Larsen	3815aeed7a	Parallelize tensor contraction over the inner dimension in cases where where one or both of the outer dimensions (m and n) are small but k is large. This speeds up individual matmul microbenchmarks by up to 85%. Naming below is BM_Matmul_M_K_N_THREADS, measured on a 2-socket Intel Broadwell-based server. Benchmark Base (ns) New (ns) Improvement ------------------------------------------------------------------ BM_Matmul_1_80_13522_1 387457 396013 -2.2% BM_Matmul_1_80_13522_2 406487 230789 +43.2% BM_Matmul_1_80_13522_4 395821 123211 +68.9% BM_Matmul_1_80_13522_6 391625 97002 +75.2% BM_Matmul_1_80_13522_8 408986 113828 +72.2% BM_Matmul_1_80_13522_16 399988 67600 +83.1% BM_Matmul_1_80_13522_22 411546 60044 +85.4% BM_Matmul_1_80_13522_32 393528 57312 +85.4% BM_Matmul_1_80_13522_44 390047 63525 +83.7% BM_Matmul_1_80_13522_88 387876 63592 +83.6% BM_Matmul_1_1500_500_1 245359 248119 -1.1% BM_Matmul_1_1500_500_2 401833 143271 +64.3% BM_Matmul_1_1500_500_4 210519 100231 +52.4% BM_Matmul_1_1500_500_6 251582 86575 +65.6% BM_Matmul_1_1500_500_8 211499 80444 +62.0% BM_Matmul_3_250_512_1 70297 68551 +2.5% BM_Matmul_3_250_512_2 70141 52450 +25.2% BM_Matmul_3_250_512_4 67872 58204 +14.2% BM_Matmul_3_250_512_6 71378 63340 +11.3% BM_Matmul_3_250_512_8 69595 41652 +40.2% BM_Matmul_3_250_512_16 72055 42549 +40.9% BM_Matmul_3_250_512_22 70158 54023 +23.0% BM_Matmul_3_250_512_32 71541 56042 +21.7% BM_Matmul_3_250_512_44 71843 57019 +20.6% BM_Matmul_3_250_512_88 69951 54045 +22.7% BM_Matmul_3_1500_512_1 369328 374284 -1.4% BM_Matmul_3_1500_512_2 428656 223603 +47.8% BM_Matmul_3_1500_512_4 205599 139508 +32.1% BM_Matmul_3_1500_512_6 214278 139071 +35.1% BM_Matmul_3_1500_512_8 184149 142338 +22.7% BM_Matmul_3_1500_512_16 156462 156983 -0.3% BM_Matmul_3_1500_512_22 163905 158259 +3.4% BM_Matmul_3_1500_512_32 155314 157662 -1.5% BM_Matmul_3_1500_512_44 235434 158657 +32.6% BM_Matmul_3_1500_512_88 156779 160275 -2.2% BM_Matmul_1500_4_512_1 363358 349528 +3.8% BM_Matmul_1500_4_512_2 303134 263319 +13.1% BM_Matmul_1500_4_512_4 176208 130086 +26.2% BM_Matmul_1500_4_512_6 148026 115449 +22.0% BM_Matmul_1500_4_512_8 131656 98421 +25.2% BM_Matmul_1500_4_512_16 134011 82861 +38.2% BM_Matmul_1500_4_512_22 134950 85685 +36.5% BM_Matmul_1500_4_512_32 133165 90081 +32.4% BM_Matmul_1500_4_512_44 133203 90644 +32.0% BM_Matmul_1500_4_512_88 134106 100566 +25.0% BM_Matmul_4_1500_512_1 439243 435058 +1.0% BM_Matmul_4_1500_512_2 451830 257032 +43.1% BM_Matmul_4_1500_512_4 276434 164513 +40.5% BM_Matmul_4_1500_512_6 182542 144827 +20.7% BM_Matmul_4_1500_512_8 179411 166256 +7.3% BM_Matmul_4_1500_512_16 158101 155560 +1.6% BM_Matmul_4_1500_512_22 152435 155448 -1.9% BM_Matmul_4_1500_512_32 155150 149538 +3.6% BM_Matmul_4_1500_512_44 193842 149777 +22.7% BM_Matmul_4_1500_512_88 149544 154468 -3.3%	2018-09-26 16:47:13 -07:00
Eugene Zhulenev	71cd3fbd6a	Support multiple contraction kernel types in TensorContractionThreadPool	2018-09-26 11:08:47 -07:00
Christoph Hertzberg	2c083ace3e	Provide EIGEN_OVERRIDE and EIGEN_FINAL macros to mark virtual function overrides	2018-09-24 18:01:17 +02:00
Gael Guennebaud	c696dbcaa6	Fiw shadowing of last and all	2018-09-21 23:02:33 +02:00
Gael Guennebaud	4291f167ee	Add missing plugins to DynamicSparseMatrix -- fix sparse_extra_3	2018-09-21 14:53:43 +02:00
Rasmus Munk Larsen	8e2be7777e	Merged eigen/eigen into default	2018-09-20 11:41:15 -07:00
Rasmus Munk Larsen	5d2e759329	Initialize BlockIteratorState in a C++03 compatible way.	2018-09-20 11:40:43 -07:00
Gael Guennebaud	e04faca930	merge	2018-09-20 18:33:54 +02:00
Gael Guennebaud	d37188b9c1	Fix MPrealSupport	2018-09-20 18:30:10 +02:00
Gael Guennebaud	3c6dc93f99	Fix GPU support.	2018-09-20 18:29:21 +02:00
Gael Guennebaud	9419f506d0	Fix regression introduced by the previous fix for AVX512. It brokes the complex-complex case on SSE.	2018-09-20 17:32:34 +02:00
Christoph Hertzberg	a0166ab651	Workaround for spurious "array subscript is above array bounds" warnings with g++4.x	2018-09-20 17:08:43 +02:00
Gael Guennebaud	71496b0e25	Fix gebp kernel for real+complex in case only reals are vectorized (e.g., AVX512). This commit also removes "half-packet" from data-mappers: it was not used and conceptually broken anyways.	2018-09-20 17:01:24 +02:00
Rasmus Munk Larsen	44d8274383	Cast to longer type.	2018-09-19 13:31:42 -07:00
Rasmus Munk Larsen	d638b62dda	Silence compiler warning.	2018-09-19 13:27:55 -07:00
Rasmus Munk Larsen	db9c9df59a	Silence more compiler warnings.	2018-09-19 11:50:27 -07:00
Rasmus Munk Larsen	febd09dcc0	Silence compiler warnings in ThreadPoolInterface.h.	2018-09-19 11:11:04 -07:00
luz.paz"	f67b19a884	[PATCH 1/2] Misc. typos From 68d431b4c14ad60a778ee93c1f59ecc4b931950e Mon Sep 17 00:00:00 2001 Found via `codespell -q 3 -I ../eigen-word-whitelist.txt` where the whitelists consists of: ``` als ans cas dum lastr lowd nd overfl pres preverse substraction te uint whch ``` --- CMakeLists.txt \| 26 +++++++++---------- Eigen/src/Core/GenericPacketMath.h \| 2 +- Eigen/src/SparseLU/SparseLU.h \| 2 +- bench/bench_norm.cpp \| 2 +- doc/HiPerformance.dox \| 2 +- doc/QuickStartGuide.dox \| 2 +- .../Eigen/CXX11/src/Tensor/TensorChipping.h \| 6 ++--- .../Eigen/CXX11/src/Tensor/TensorDeviceGpu.h \| 2 +- .../src/Tensor/TensorForwardDeclarations.h \| 4 +-- .../src/Tensor/TensorGpuHipCudaDefines.h \| 2 +- .../Eigen/CXX11/src/Tensor/TensorReduction.h \| 2 +- .../CXX11/src/Tensor/TensorReductionGpu.h \| 2 +- .../test/cxx11_tensor_concatenation.cpp \| 2 +- unsupported/test/cxx11_tensor_executor.cpp \| 2 +- 14 files changed, 29 insertions(+), 29 deletions(-)	2018-09-18 04:15:01 -04:00
Eugene Zhulenev	c4627039ac	Support static dimensions (aka IndexList) in Tensor::resize(...)	2018-09-18 14:25:21 -07:00
Eugene Zhulenev	218a7b9840	Enable DSizes type promotion with c++03 compilers	2018-09-18 10:57:00 -07:00
Ravi Kiran	1f0c941c3d	Collapsed revision * Merged eigen/eigen into default	2018-09-17 18:29:12 -07:00
Rasmus Munk Larsen	03a88c57e1	Merged in ezhulenev/eigen-02 (pull request PR-498) Add DSizes index type promotion	2018-09-17 21:58:38 +00:00
Rasmus Munk Larsen	5ca0e4a245	Merged in ezhulenev/eigen-01 (pull request PR-497) Fix warnings in IndexList array_prod	2018-09-17 20:15:06 +00:00
Eugene Zhulenev	a5cd4e9ad1	Replace deprecated Eigen::DenseIndex with Eigen::Index in TensorIndexList	2018-09-17 10:58:07 -07:00
Gael Guennebaud	b311bfb752	bug #1596 : fix inclusion of Eigen's header within unsupported modules.	2018-09-17 09:54:29 +02:00
Gael Guennebaud	72f19c827a	typo	2018-09-16 22:10:34 +02:00
Eugene Zhulenev	66f056776f	Add DSizes index type promotion	2018-09-15 15:17:38 -07:00
Eugene Zhulenev	f313126dab	Fix warnings in IndexList array_prod	2018-09-15 13:47:54 -07:00
Christoph Hertzberg	42705ba574	Fix weird error for building with g++-4.7 in C++03 mode.	2018-09-15 12:43:41 +02:00
Rasmus Munk Larsen	c2383f95af	Merged in ezhulenev/eigen/fix_dsizes (pull request PR-494) Fix DSizes IndexList constructor	2018-09-15 02:36:19 +00:00
Rasmus Munk Larsen	30290cdd56	Merged in ezhulenev/eigen/moar_eigen_fixes_3 (pull request PR-493) Const cast scalar pointer in TensorSlicingOp evaluator Approved-by: Sameer Agarwal <sameeragarwal@google.com>	2018-09-15 02:35:07 +00:00
Eugene Zhulenev	f7d0053cf0	Fix DSizes IndexList constructor	2018-09-14 19:19:13 -07:00
Rasmus Munk Larsen	601e289d27	Merged in ezhulenev/eigen/moar_eigen_fixes_1 (pull request PR-492) Explicitly construct tensor block dimensions from evaluator dimensions	2018-09-15 01:36:21 +00:00
Eugene Zhulenev	71070a1e84	Const cast scalar pointer in TensorSlicingOp evaluator	2018-09-14 17:17:50 -07:00

... 2 3 4 5 6 ...

2371 Commits