eigen

mirror of https://gitlab.com/libeigen/eigen.git synced 2024-12-21 07:19:46 +08:00

Author	SHA1	Message	Date
Eugene Zhulenev	01d7e6ee9b	Check if gpu_assert was overridden in TensorGpuHipCudaDefines	2019-04-25 11:19:17 -07:00
Eugene Zhulenev	8ead5bb3d8	Fix doxygen warnings to enable statis code analysis	2019-04-24 12:42:28 -07:00
Eugene Zhulenev	07355d47c6	Get rid of SequentialLinSpacedReturnType deprecation warnings in DenseBase.h	2019-04-24 11:01:35 -07:00
Rasmus Munk Larsen	144ca33321	Remove deprecation annotation from typedef Eigen::Index Index, as it would generate too many build warnings.	2019-04-24 08:50:07 -07:00
Eugene Zhulenev	a7b7f3ca8a	Add missing EIGEN_DEPRECATED annotations to deprecated functions and fix few other doxygen warnings	2019-04-23 17:23:19 -07:00
Eugene Zhulenev	68a2a8c445	Use packet ops instead of AVX2 intrinsics	2019-04-23 11:41:02 -07:00
Anuj Rawat	8c7a6feb8e	Adding lowlevel APIs for optimized RHS packet load in TensorFlow SpatialConvolution Low-level APIs are added in order to optimized packet load in gemm_pack_rhs in TensorFlow SpatialConvolution. The optimization is for scenario when a packet is split across 2 adjacent columns. In this case we read it as two 'partial' packets and then merge these into 1. Currently this only works for Packet16f (AVX512) and Packet8f (AVX2). We plan to add this for other packet types (such as Packet8d) also. This optimization shows significant speedup in SpatialConvolution with certain parameters. Some examples are below. Benchmark parameters are specified as: Batch size, Input dim, Depth, Num of filters, Filter dim Speedup numbers are specified for number of threads 1, 2, 4, 8, 16. AVX512: Parameters \| Speedup (Num of threads: 1, 2, 4, 8, 16) ----------------------------\|------------------------------------------ 128, 24x24, 3, 64, 5x5 \|2.18X, 2.13X, 1.73X, 1.64X, 1.66X 128, 24x24, 1, 64, 8x8 \|2.00X, 1.98X, 1.93X, 1.91X, 1.91X 32, 24x24, 3, 64, 5x5 \|2.26X, 2.14X, 2.17X, 2.22X, 2.33X 128, 24x24, 3, 64, 3x3 \|1.51X, 1.45X, 1.45X, 1.67X, 1.57X 32, 14x14, 24, 64, 5x5 \|1.21X, 1.19X, 1.16X, 1.70X, 1.17X 128, 128x128, 3, 96, 11x11 \|2.17X, 2.18X, 2.19X, 2.20X, 2.18X AVX2: Parameters \| Speedup (Num of threads: 1, 2, 4, 8, 16) ----------------------------\|------------------------------------------ 128, 24x24, 3, 64, 5x5 \| 1.66X, 1.65X, 1.61X, 1.56X, 1.49X 32, 24x24, 3, 64, 5x5 \| 1.71X, 1.63X, 1.77X, 1.58X, 1.68X 128, 24x24, 1, 64, 5x5 \| 1.44X, 1.40X, 1.38X, 1.37X, 1.33X 128, 24x24, 3, 64, 3x3 \| 1.68X, 1.63X, 1.58X, 1.56X, 1.62X 128, 128x128, 3, 96, 11x11 \| 1.36X, 1.36X, 1.37X, 1.37X, 1.37X In the higher level benchmark cifar10, we observe a runtime improvement of around 6% for AVX512 on Intel Skylake server (8 cores). On lower level PackRhs micro-benchmarks specified in TensorFlow tensorflow/core/kernels/eigen_spatial_convolutions_test.cc, we observe the following runtime numbers: AVX512: Parameters \| Runtime without patch (ns) \| Runtime with patch (ns) \| Speedup ---------------------------------------------------------------\|----------------------------\|-------------------------\|--------- BM_RHS_NAME(PackRhs, 128, 24, 24, 3, 64, 5, 5, 1, 1, 256, 56) \| 41350 \| 15073 \| 2.74X BM_RHS_NAME(PackRhs, 32, 64, 64, 32, 64, 5, 5, 1, 1, 256, 56) \| 7277 \| 7341 \| 0.99X BM_RHS_NAME(PackRhs, 32, 64, 64, 32, 64, 5, 5, 2, 2, 256, 56) \| 8675 \| 8681 \| 1.00X BM_RHS_NAME(PackRhs, 32, 64, 64, 30, 64, 5, 5, 1, 1, 256, 56) \| 24155 \| 16079 \| 1.50X BM_RHS_NAME(PackRhs, 32, 64, 64, 30, 64, 5, 5, 2, 2, 256, 56) \| 25052 \| 17152 \| 1.46X BM_RHS_NAME(PackRhs, 32, 256, 256, 4, 16, 8, 8, 1, 1, 256, 56) \| 18269 \| 18345 \| 1.00X BM_RHS_NAME(PackRhs, 32, 256, 256, 4, 16, 8, 8, 2, 4, 256, 56) \| 19468 \| 19872 \| 0.98X BM_RHS_NAME(PackRhs, 32, 64, 64, 4, 16, 3, 3, 1, 1, 36, 432) \| 156060 \| 42432 \| 3.68X BM_RHS_NAME(PackRhs, 32, 64, 64, 4, 16, 3, 3, 2, 2, 36, 432) \| 132701 \| 36944 \| 3.59X AVX2: Parameters \| Runtime without patch (ns) \| Runtime with patch (ns) \| Speedup ---------------------------------------------------------------\|----------------------------\|-------------------------\|--------- BM_RHS_NAME(PackRhs, 128, 24, 24, 3, 64, 5, 5, 1, 1, 256, 56) \| 26233 \| 12393 \| 2.12X BM_RHS_NAME(PackRhs, 32, 64, 64, 32, 64, 5, 5, 1, 1, 256, 56) \| 6091 \| 6062 \| 1.00X BM_RHS_NAME(PackRhs, 32, 64, 64, 32, 64, 5, 5, 2, 2, 256, 56) \| 7427 \| 7408 \| 1.00X BM_RHS_NAME(PackRhs, 32, 64, 64, 30, 64, 5, 5, 1, 1, 256, 56) \| 23453 \| 20826 \| 1.13X BM_RHS_NAME(PackRhs, 32, 64, 64, 30, 64, 5, 5, 2, 2, 256, 56) \| 23167 \| 22091 \| 1.09X BM_RHS_NAME(PackRhs, 32, 256, 256, 4, 16, 8, 8, 1, 1, 256, 56) \| 23422 \| 23682 \| 0.99X BM_RHS_NAME(PackRhs, 32, 256, 256, 4, 16, 8, 8, 2, 4, 256, 56) \| 23165 \| 23663 \| 0.98X BM_RHS_NAME(PackRhs, 32, 64, 64, 4, 16, 3, 3, 1, 1, 36, 432) \| 72689 \| 44969 \| 1.62X BM_RHS_NAME(PackRhs, 32, 64, 64, 4, 16, 3, 3, 2, 2, 36, 432) \| 61732 \| 39779 \| 1.55X All benchmarks on Intel Skylake server with 8 cores.	2019-04-20 06:46:43 +00:00
Christoph Hertzberg	4270c62812	Split the implementation of i?amax/min into two. Based on PR-627 by Sameer Agarwal. Like the Netlib reference implementation, IAMAX now uses the L1-norm instead of the L2-norm for each element. Changed IMIN accordingly.	2019-04-15 17:18:03 +02:00
Rasmus Munk Larsen	039ee52125	Tweak cost model for tensor contraction when parallelizing over the inner dimension. https://bitbucket.org/snippets/rmlarsen/MexxLo	2019-04-12 13:35:10 -07:00
Jonathon Koyle	9a3f06d836	Update TheadPoolDevice example to include ThreadPool creation and passing pointer into constructor.	2019-04-10 10:02:33 -06:00
Deven Desai	66a885b61e	adding EIGEN_DEVICE_FUNC to the recently added TensorContractionKernel constructor. Not having the EIGEN_DEVICE_FUNC attribute on it was leading to compiler errors when compiling Eigen in the ROCm/HIP path	2019-04-08 13:45:08 +00:00
Eugene Zhulenev	629ddebd15	Add missing semicolon	2019-04-02 15:04:26 -07:00
Eugene Zhulenev	4e2f6de1a8	Add support for custom packed Lhs/Rhs blocks in tensor contractions	2019-04-01 11:47:31 -07:00
Gael Guennebaud	45e65fbb77	bug #1695 : fix a numerical robustness issue. Computing the secular equation at the middle range without a shift might give a wrong sign.	2019-03-27 20:16:58 +01:00
William D. Irons	8de66719f9	Collapsed revision from PR-619 * Add support for pcmp_eq in AltiVec/Complex.h * Fixed implementation of pcmp_eq for double The new logic is based on the logic from NEON for double.	2019-03-26 18:14:49 +00:00
Gael Guennebaud	f11364290e	ICC does not support -fno-unsafe-math-optimizations	2019-03-22 09:26:24 +01:00
Deven Desai	51e399fc15	updates requested in the PR feedback. Also droping coded within #ifdef EIGEN_HAS_OLD_HIP_FP16	2019-03-19 21:45:25 +00:00
Deven Desai	2dbea5510f	Merged eigen/eigen into default	2019-03-19 16:52:38 -04:00
Rasmus Larsen	5c93b38c5f	Merged in rmlarsen/eigen (pull request PR-618) Make clipping outside [-18:18] consistent for vectorized and non-vectorized paths of scalar_logistic_op<float>. Approved-by: Gael Guennebaud <g.gael@free.fr>	2019-03-18 15:51:55 +00:00
Gael Guennebaud	48898a988a	fix unit test in c++03: c++03 does not allow passing local or anonymous enum as template param	2019-03-18 11:38:36 +01:00
Gael Guennebaud	cf7e2e277f	bug #1692 : enable enum as sizes of Matrix and Array	2019-03-17 21:59:30 +01:00
Rasmus Munk Larsen	e42f9aa68a	Make clipping outside [-18:18] consistent for vectorized and non-vectorized paths of scalar_logistic_<float>.	2019-03-15 17:15:14 -07:00
Rasmus Larsen	1936aac43f	Merged in tellenbach/eigen/sykline_consistent_include_guards (pull request PR-617) Fix include guard comments for Skyline module	2019-03-15 20:04:56 +00:00
David Tellenbach	bd9c2ae3fd	Fix include guard comments	2019-03-15 15:29:17 +01:00
Rasmus Munk Larsen	8450a6d519	Clean up half packet traits and add a few more missing packet ops.	2019-03-14 15:18:06 -07:00
David Tellenbach	b013176e52	Remove undefined std::complex<int>	2019-03-14 11:40:28 +01:00
David Tellenbach	97f9a46cb9	PR 593: Add variadtic ctor for DiagonalMatrix with unit tests	2019-03-14 10:18:24 +01:00
Gael Guennebaud	45ab514fe2	revert debug stuff	2019-03-14 10:08:12 +01:00
Rasmus Munk Larsen	6a34003141	Remove EIGEN_MPL2_ONLY guard in IncompleteCholesky that is no longer needed after the AMD reordering code was relicensed to MPL2.	2019-03-13 11:52:41 -07:00
Gael Guennebaud	d7d2f0680e	bug #1684 : partially workaround clang's 6/7 bug #40815	2019-03-13 10:40:01 +01:00
Rasmus Larsen	690f0795d0	Merged in rmlarsen/eigen (pull request PR-615) Clean up PacketMathHalf.h and add a few missing logical packet ops.	2019-03-12 16:09:48 +00:00
Thomas Capricelli	1901433674	erm.. use proper id	2019-03-12 13:53:38 +01:00
Thomas Capricelli	90302aa8c9	update tracking code	2019-03-12 13:47:01 +01:00
Rasmus Munk Larsen	77f7d4a894	Clean up PacketMathHalf.h and add a few missing logical packet ops.	2019-03-11 17:51:16 -07:00
Eugene Zhulenev	001f10e3c9	Fix segfaults with cuda compilation	2019-03-11 09:43:33 -07:00
Eugene Zhulenev	899c16fa2c	Fix a bug in TensorGenerator for 1d tensors	2019-03-11 09:42:01 -07:00
Eugene Zhulenev	0f8bfff23d	Fix a data race in NonBlockingThreadPool	2019-03-11 09:38:44 -07:00
Gael Guennebaud	656d9bc66b	Apply SSE's pmin/pmax fix for GCC <= 5 to AVX's pmin/pmax	2019-03-10 21:19:18 +01:00
Gael Guennebaud	2df4f00246	Change license from LGPL to MPL2 with agreement from David Harmon.	2019-03-07 18:17:10 +01:00
Rasmus Munk Larsen	3c3f639fe2	Merge.	2019-03-06 11:54:30 -08:00
Rasmus Munk Larsen	f4ec8edea8	Add macro EIGEN_AVOID_THREAD_LOCAL to make it possible to manually disable the use of thread_local.	2019-03-06 11:52:04 -08:00
Rasmus Munk Larsen	41cdc370d0	Fix placement of "#if defined(EIGEN_GPUCC)" guard region. Found with -Wundefined-func-template. Author: tkoeppe@google.com	2019-03-06 11:42:22 -08:00
Rasmus Munk Larsen	cc407c9d4d	Fix placement of "#if defined(EIGEN_GPUCC)" guard region. Found with -Wundefined-func-template. Author: tkoeppe@google.com	2019-03-06 11:40:06 -08:00
Eugene Zhulenev	1bc2a0a57c	Add missing return to NonBlockingThreadPool::LocalSteal	2019-03-06 10:49:49 -08:00
Eugene Zhulenev	4e4dcd9026	Remove redundant steal loop	2019-03-06 10:39:07 -08:00
Rasmus Larsen	4d808e834a	Merged in rmlarsen/eigen_threadpool (pull request PR-606) Remove EIGEN_MPL2_ONLY guards around code re-licensed from LGPL to MPL2 in `2ca1e73239` Approved-by: Sameer Agarwal <sameeragarwal@google.com>	2019-03-06 17:59:03 +00:00
Rasmus Larsen	2ea18e505f	Merged in ezhulenev/eigen-01 (pull request PR-610) Block evaluation for TensorGeneratorOp	2019-03-06 16:49:38 +00:00
Eugene Zhulenev	25abaa2e41	Check that inner block dimension is continuous	2019-03-05 17:34:35 -08:00
Eugene Zhulenev	5d9a6686ed	Block evaluation for TensorGeneratorOp	2019-03-05 16:35:21 -08:00
Rasmus Larsen	b4861f4778	Merged in ezhulenev/eigen-01 (pull request PR-609) Tune tensor contraction threadpool heuristics	2019-03-05 23:54:40 +00:00

1 2 3 4 5 ...

10597 Commits