eigen

mirror of https://gitlab.com/libeigen/eigen.git synced 2025-01-06 14:14:46 +08:00

Author	SHA1	Message	Date
Rasmus Larsen	e7b481ea74	Merged in rmlarsen/eigen (pull request PR-578) Speed up Eigen matrixvector and vectormatrix multiplication. Approved-by: Eugene Zhulenev <ezhulenev@google.com>	2019-02-02 01:53:44 +00:00
Sameer Agarwal	b55b5c7280	Speed up row-major matrix-vector product on ARM The row-major matrix-vector multiplication code uses a threshold to check if processing 8 rows at a time would thrash the cache. This change introduces two modifications to this logic. 1. A smaller threshold for ARM and ARM64 devices. The value of this threshold was determined empirically using a Pixel2 phone, by benchmarking a large number of matrix-vector products in the range [1..4096]x[1..4096] and measuring performance separately on small and little cores with frequency pinning. On big (out-of-order) cores, this change has little to no impact. But on the small (in-order) cores, the matrix-vector products are up to 700% faster. Especially on large matrices. The motivation for this change was some internal code at Google which was using hand-written NEON for implementing similar functionality, processing the matrix one row at a time, which exhibited substantially better performance than Eigen. With the current change, Eigen handily beats that code. 2. Make the logic for choosing number of simultaneous rows apply unifiormly to 8, 4 and 2 rows instead of just 8 rows. Since the default threshold for non-ARM devices is essentially unchanged (32000 -> 32 * 1024), this change has no impact on non-ARM performance. This was verified by running the same set of benchmarks on a Xeon desktop.	2019-02-01 15:23:53 -08:00
Rasmus Munk Larsen	4c0fa6ce0f	Speed up Eigen matrixvector and vectormatrix multiplication. This change speeds up Eigen matrix * vector and vector * matrix multiplication for dynamic matrices when it is known at runtime that one of the factors is a vector. The benchmarks below test c.noalias()= n_by_n_matrix * n_by_1_matrix; c.noalias()= 1_by_n_matrix * n_by_n_matrix; respectively. Benchmark measurements: SSE: Run on * (72 X 2992 MHz CPUs); 2019-01-28T17:51:44.452697457-08:00 CPU: Intel Skylake Xeon with HyperThreading (36 cores) dL1:32KB dL2:1024KB dL3:24MB Benchmark Base (ns) New (ns) Improvement ------------------------------------------------------------------ BM_MatVec/64 1096 312 +71.5% BM_MatVec/128 4581 1464 +68.0% BM_MatVec/256 18534 5710 +69.2% BM_MatVec/512 118083 24162 +79.5% BM_MatVec/1k 704106 173346 +75.4% BM_MatVec/2k 3080828 742728 +75.9% BM_MatVec/4k 25421512 4530117 +82.2% BM_VecMat/32 352 130 +63.1% BM_VecMat/64 1213 425 +65.0% BM_VecMat/128 4640 1564 +66.3% BM_VecMat/256 17902 5884 +67.1% BM_VecMat/512 70466 24000 +65.9% BM_VecMat/1k 340150 161263 +52.6% BM_VecMat/2k 1420590 645576 +54.6% BM_VecMat/4k 8083859 4364327 +46.0% AVX2: Run on * (72 X 2993 MHz CPUs); 2019-01-28T17:45:11.508545307-08:00 CPU: Intel Skylake Xeon with HyperThreading (36 cores) dL1:32KB dL2:1024KB dL3:24MB Benchmark Base (ns) New (ns) Improvement ------------------------------------------------------------------ BM_MatVec/64 619 120 +80.6% BM_MatVec/128 9693 752 +92.2% BM_MatVec/256 38356 2773 +92.8% BM_MatVec/512 69006 12803 +81.4% BM_MatVec/1k 443810 160378 +63.9% BM_MatVec/2k 2633553 646594 +75.4% BM_MatVec/4k 16211095 4327148 +73.3% BM_VecMat/64 925 227 +75.5% BM_VecMat/128 3438 830 +75.9% BM_VecMat/256 13427 2936 +78.1% BM_VecMat/512 53944 12473 +76.9% BM_VecMat/1k 302264 157076 +48.0% BM_VecMat/2k 1396811 675778 +51.6% BM_VecMat/4k 8962246 4459010 +50.2% AVX512: Run on *** (72 X 2993 MHz CPUs); 2019-01-28T17:35:17.239329863-08:00 CPU: Intel Skylake Xeon with HyperThreading (36 cores) dL1:32KB dL2:1024KB dL3:24MB Benchmark Base (ns) New (ns) Improvement ------------------------------------------------------------------ BM_MatVec/64 401 111 +72.3% BM_MatVec/128 1846 513 +72.2% BM_MatVec/256 36739 1927 +94.8% BM_MatVec/512 54490 9227 +83.1% BM_MatVec/1k 487374 161457 +66.9% BM_MatVec/2k 2016270 643824 +68.1% BM_MatVec/4k 13204300 4077412 +69.1% BM_VecMat/32 324 106 +67.3% BM_VecMat/64 1034 246 +76.2% BM_VecMat/128 3576 802 +77.6% BM_VecMat/256 13411 2561 +80.9% BM_VecMat/512 58686 10037 +82.9% BM_VecMat/1k 320862 163750 +49.0% BM_VecMat/2k 1406719 651397 +53.7% BM_VecMat/4k 7785179 4124677 +47.0% Currently watchingStop watching	2019-01-31 14:24:08 -08:00
Gael Guennebaud	7ef879f6bf	GEBP: improves pipelining in the 1pX4 path with FMA. Prior to this change, a product with a LHS having 8 rows was faster with AVX-only than with AVX+FMA. With AVX+FMA I measured a speed up of about x1.25 in such cases.	2019-01-30 23:45:12 +01:00
Gael Guennebaud	de77bf5d6c	Fix compilation with ARM64.	2019-01-30 16:48:20 +01:00
Gael Guennebaud	eb4c6bb22d	Fix conflicts and merge	2019-01-30 15:57:08 +01:00
Gael Guennebaud	df12fae8b8	According to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89101 , the previous GCC issue is fixed in GCC trunk (will be gcc 9).	2019-01-30 11:52:28 +01:00
Gael Guennebaud	3775926bba	ARM64 & GEBP: add specialization for double +30% speed up	2019-01-30 11:49:06 +01:00
Gael Guennebaud	be5b0f664a	ARM64 & GEBP: Make use of vfmaq_laneq_f32 and workaround GCC's issue in generating good ASM	2019-01-30 11:48:25 +01:00
Gael Guennebaud	8a06c699d0	bug #1669 : fix PartialPivLU/inverse with zero-sized matrices.	2019-01-29 10:27:13 +01:00
Gael Guennebaud	a2a07e62b9	Fix compilation with c++03 (local class cannot be template arguments), and make SparseMatrix::assignDiagonal truly protected.	2019-01-29 10:10:07 +01:00
Gael Guennebaud	f489f44519	bug #1574 : implement "sparse_matrix =,+=,-= diagonal_matrix" with smart insertion strategies of missing diagonal coeffs.	2019-01-28 17:29:50 +01:00
Gael Guennebaud	803fa79767	Move evaluator<SparseCompressedBase>::find(i,j) to a more general and reusable SparseCompressedBase::lower_bound(i,j) functiion	2019-01-28 17:24:44 +01:00
Christoph Hertzberg	5a52e35f9a	Renaming some more `I` identifiers	2019-01-26 13:18:21 +01:00
Rasmus Munk Larsen	71429883ee	Fix compilation error in NEON GEBP specializaition of madd.	2019-01-25 17:00:21 -08:00
Gael Guennebaud	ec8a387972	cleanup	2019-01-24 10:24:45 +01:00
David Tellenbach	237b03b372	PR 574: use variadic template instead of initializer_list to implement fixed-size vector ctor from coefficients.	2019-01-23 00:07:19 +01:00
Gael Guennebaud	80f81f9c4b	Cleanup SFINAE in Array/Matrix(initializer_list) ctors and minor doc editing.	2019-01-22 17:08:47 +01:00
David Tellenbach	db152b9ee6	PR 572: Add initializer list constructors to Matrix and Array (include unit tests and doc) - {1,2,3,4,5,...} for fixed-size vectors only - {{1,2,3},{4,5,6}} for the general cases - {{1,2,3,4,5,....}} is allowed for both row and column-vector	2019-01-21 16:25:57 +01:00
nluehr	92774f0275	Replace host_define.h with cuda_runtime_api.h	2019-01-18 16:10:09 -06:00
Christoph Hertzberg	da0a41b9ce	Mask unused-parameter warnings, when building with NDEBUG	2019-01-18 10:41:14 +01:00
Rasmus Munk Larsen	2eccbaf3f7	Add missing logical packet ops for GPU and NEON.	2019-01-17 17:45:08 -08:00
Gael Guennebaud	ee3662abc5	Remove some useless const_cast	2019-01-17 18:27:49 +01:00
Gael Guennebaud	0fe6b7d687	Make nestByValue works again (broken since 3.3) and add unit tests.	2019-01-17 18:27:25 +01:00
Gael Guennebaud	4b7cf7ff82	Extend reshaped unit tests and remove useless const_cast	2019-01-17 17:35:32 +01:00
Gael Guennebaud	b57c9787b1	Cleanup useless const_cast and add missing broadcast assignment tests	2019-01-17 16:55:42 +01:00
Gael Guennebaud	be05d0030d	Make FullPivLU use conjugateIf<>	2019-01-17 12:01:00 +01:00
Patrick Peltzer	15e53d5d93	PR 567: makes all dense solvers inherit SoverBase (LU,Cholesky,QR,SVD). This changeset also includes: * add HouseholderSequence::conjugateIf * define int as the StorageIndex type for all dense solvers * dedicated unit tests, including assertion checking * _check_solve_assertion(): this method can be implemented in derived solver classes to implement custom checks * CompleteOrthogonalDecompositions: add applyZOnTheLeftInPlace, fix scalar type in applyZAdjointOnTheLeftInPlace(), add missing assertions * Cholesky: add missing assertions * FullPivHouseholderQR: Corrected Scalar type in _solve_impl() * BDCSVD: Unambiguous return type for ternary operator * SVDBase: Corrected Scalar type in _solve_impl()	2019-01-17 01:17:39 +01:00
Gael Guennebaud	7f32109c11	Add conjugateIf<bool> members to DesneBase, TriangularView, SelfadjointView, and make PartialPivLU use it.	2019-01-17 11:33:43 +01:00
Gael Guennebaud	562985bac4	bug #1646 : fix false aliasing detection for A.row(0) = A.col(0); This changeset completely disable the detection for vectors for which are current mechanism cannot detect any positive aliasing anyway.	2019-01-17 00:14:27 +01:00
Rasmus Munk Larsen	7401e2541d	Fix compilation error for logical packet ops with older compilers.	2019-01-16 14:43:33 -08:00
Gael Guennebaud	0f028f61cb	GEBP: fix swapped kernel mode with AVX512 and complex scalars	2019-01-16 22:26:38 +01:00
Gael Guennebaud	e118ce86fd	GEBP: cleanup logic to choose between a 4 packets of 1 packet	2019-01-16 21:47:42 +01:00
Gael Guennebaud	70e133333d	bug #1661 : fix regression in GEBP and AVX512	2019-01-16 21:22:20 +01:00
Gael Guennebaud	502f717980	bug #1646 : disable aliasing detection for empty and 1x1 expression	2019-01-16 14:33:45 +01:00
Gael Guennebaud	0b466b6933	bug #1633 : use proper type for madd temporaries, factorize RhsPacketx4.	2019-01-16 13:50:13 +01:00
Renjie Liu	dbfcceabf5	Bug: 1633: refactor gebp kernel and optimize for neon	2019-01-16 12:51:36 +08:00
Gael Guennebaud	2b70b2f570	Make Transform::rotation() an alias to Transform::linear() in the case of an Isometry	2019-01-15 22:50:42 +01:00
Gael Guennebaud	2c2c114995	Silent maybe-uninitialized warnings by gcc	2019-01-15 16:53:15 +01:00
Gael Guennebaud	6ec6bf0b0d	Enable visitor on empty matrices (the visitor is left unchanged), and protect min/maxCoeff(Index,Index) on empty matrices by an assertion (+ doc & unit tests)	2019-01-15 15:21:14 +01:00
Gael Guennebaud	027e44ed24	bug #1592 : makes partial min/max reductions trigger an assertion on inputs with a zero reduction length (+doc and tests)	2019-01-15 15:13:24 +01:00
Gael Guennebaud	f8bc5cb39e	Fix detection of vector-at-time: use Rows/Cols instead of MaxRow/MaxCols. This fix VectorXd(n).middleCol(0,0).outerSize() which was equal to 1.	2019-01-15 15:09:49 +01:00
Gael Guennebaud	6cf7afa3d9	Typo	2019-01-15 11:04:37 +01:00
Rasmus Larsen	7b3aab0936	Merged in rmlarsen/eigen (pull request PR-570) Add support for inverse hyperbolic functions. Fix cost of division.	2019-01-14 21:31:33 +00:00
Gael Guennebaud	250dcd1fdb	bug #1652 : fix position of EIGEN_ALIGN16 attributes in Neon and Altivec	2019-01-14 21:45:56 +01:00
Rasmus Larsen	5a59452aae	Merged eigen/eigen into default	2019-01-14 10:23:23 -08:00
Gael Guennebaud	3c9e6d206d	AVX512: fix pgather/pscatter for Packet4cd and unaligned pointers	2019-01-14 17:57:28 +01:00
Gael Guennebaud	61b6eb05fe	AVX512 (r)sqrt(double) was mistakenly disabled with clang and others	2019-01-14 17:28:47 +01:00
Gael Guennebaud	ccddeaad90	fix warning	2019-01-14 16:51:16 +01:00
Gael Guennebaud	d4881751d3	Doc: add Isometry in the list of supported Mode of Transform<>	2019-01-14 16:38:26 +01:00
Greg Coombe	9d988a1e1a	Initialize isometric transforms like affine transforms. The isometric transform, like the affine transform, has an implicit last row of [0, 0, 0, 1]. This was not being properly initialized, as verified by a new test function.	2019-01-11 23:14:35 -08:00
Gael Guennebaud	4356a55a61	PR 571: Implements an accurate argument reduction algorithm for huge inputs of sin/cos and call it instead of falling back to std::sin/std::cos. This makes both the small and huge argument cases faster because: - for small inputs this removes the last pselect - for large inputs only the reduction part follows a scalar path, the rest use the same SIMD path as the small-argument case.	2019-01-14 13:54:01 +01:00
Gael Guennebaud	f566724023	Fix StorageIndex FIXME in dense LU solvers	2019-01-13 17:54:30 +01:00
Rasmus Munk Larsen	1c6e6e2c3f	Merge.	2019-01-11 17:47:11 -08:00
Rasmus Munk Larsen	28ba1b2c32	Add support for inverse hyperbolic functions. Fix cost of division.	2019-01-11 17:45:37 -08:00
Rasmus Munk Larsen	89c4001d6f	Fix warnings in ptrue for complex and half types.	2019-01-11 14:10:57 -08:00
Rasmus Munk Larsen	a49d01edba	Fix warnings in ptrue for complex and half types.	2019-01-11 13:18:17 -08:00
Rasmus Munk Larsen	df29511ac0	Fix merge.	2019-01-11 10:36:36 -08:00
Rasmus Munk Larsen	9396ace46b	Merge.	2019-01-11 10:28:52 -08:00
Rasmus Larsen	74882471d0	Merged eigen/eigen into default	2019-01-11 10:20:55 -08:00
Gael Guennebaud	9005f0111f	Replace compiler's alignas/alignof extension by respective c++11 keywords when available. This also fix a compilation issue with gcc-4.7.	2019-01-11 17:10:54 +01:00
Mark D Ryan	3c9add6598	Remove reinterpret_cast from AVX512 complex implementation The reinterpret_casts used in ptranspose(PacketBlock<Packet8cf,4>&) ptranspose(PacketBlock<Packet8cf,8>&) don't appear to be working correctly. They're used to convert the kernel parameters to PacketBlock<Packet8d,T>& so that the complex number versions of ptranspose can be written using the existing double implementations. Unfortunately, they don't seem to work and are responsible for 9 unit test failures in the AVX512 build of tensorflow master. This commit fixes the issue by manually initialising PacketBlock<Packet8d,T> variables with the contents of the kernel parameter before calling the double version of ptranspose, and then copying the resulting values back into the kernel parameter before returning.	2019-01-11 14:02:09 +01:00
Rasmus Munk Larsen	fcfced13ed	Rename pones -> ptrue. Use _CMP_TRUE_UQ where appropriate.	2019-01-09 17:20:33 -08:00
Rasmus Munk Larsen	e15bb785ad	Collapsed revision * Add packet up "pones". Write pnot(a) as pxor(pones(a), a). * Collapsed revision * Simplify a bit. * Undo useless diffs. * Fix typo.	2019-01-09 16:34:23 -08:00
Rasmus Munk Larsen	f6ba6071c5	Fix typo.	2019-01-09 16:34:23 -08:00
Rasmus Munk Larsen	8f04442526	Collapsed revision * Collapsed revision * Add packet up "pones". Write pnot(a) as pxor(pones(a), a). * Collapsed revision * Simplify a bit. * Undo useless diffs. * Fix typo.	2019-01-09 16:34:23 -08:00
Rasmus Munk Larsen	e00521b514	Undo useless diffs.	2019-01-09 16:32:53 -08:00
Rasmus Munk Larsen	f2767112c8	Simplify a bit.	2019-01-09 16:29:18 -08:00
Rasmus Munk Larsen	cb955df9a6	Add packet up "pones". Write pnot(a) as pxor(pones(a), a).	2019-01-09 16:17:08 -08:00
Rasmus Larsen	cb3c059fa4	Merged eigen/eigen into default	2019-01-09 15:04:17 -08:00
Gael Guennebaud	d812f411c3	bug #1654 : fix compilation with cuda and no c++11	2019-01-09 18:00:05 +01:00
Gael Guennebaud	3492a1ca74	fix plog(+inf) with AVX512	2019-01-09 16:53:37 +01:00
Gael Guennebaud	47810cf5b7	Add dedicated implementations of predux_any for AVX512, NEON, and Altivec/VSE	2019-01-09 16:40:42 +01:00
Gael Guennebaud	3f14e0d19e	fix warning	2019-01-09 15:45:21 +01:00
Gael Guennebaud	aeec68f77b	Add missing pcmp_lt and others for AVX512	2019-01-09 15:36:41 +01:00
Gael Guennebaud	e6b217b8dd	bug #1652 : implements a much more accurate version of vectorized sin/cos. This new version achieve same speed for SSE/AVX, and is slightly faster with FMA. Guarantees are as follows: - no FMA: 1ULP up to 3pi, 2ULP up to sin(25966) and cos(18838), fallback to std::sin/cos for larger inputs - FMA: 1ULP up to sin(117435.992) and cos(71476.0625), fallback to std::sin/cos for larger inputs	2019-01-09 15:25:17 +01:00
Rasmus Munk Larsen	055f0b73db	Add support for pcmp_eq and pnot, including for complex types.	2019-01-07 16:53:36 -08:00
Eugene Zhulenev	190d053e41	Explicitly set fill character when printing aligned data to ostream	2019-01-03 14:55:28 -08:00
Mark D Ryan	bc5dd4cafd	PR560: Fix the AVX512f only builds Commit `c53eececb0` introduced AVX512 support for complex numbers but required avx512dq to build. Commit `1d683ae2f5` fixed some but not, it would seem all, of the hard avx512dq dependencies. Build failures are still evident on Eigen and TensorFlow when compiling with just avx512f and no avx512dq using gcc 7.3. Looking at the code there does indeed seem to be a problem. Commit `c53eececb0` calls avx512dq intrinsics directly, e.g, _mm512_extractf32x8_ps and _mm512_and_ps. This commit fixes the issue by replacing the direct intrinsic calls with the various wrapper functions that are safe to use on avx512f only builds.	2019-01-03 14:33:04 +01:00
Gael Guennebaud	60d3fe9a89	One more stupid AVX 512 fix (I don't have direct access to AVX512 machines)	2018-12-24 13:05:03 +01:00
Gael Guennebaud	4aa667b510	Add EIGEN_STRONG_INLINE where required	2018-12-24 10:45:01 +01:00
Gael Guennebaud	961ff567e8	Add missing pcmp_lt_or_nan for AVX512	2018-12-23 22:13:29 +01:00
Gael Guennebaud	0f6f75bd8a	Implement a faster fix for sin/cos of large entries that also correctly handle INF input.	2018-12-23 17:26:21 +01:00
Gael Guennebaud	38d704def8	Make sure that psin/pcos return number in [-1,1] for large inputs (though sin/cos on large entries is quite useless because it's inaccurate)	2018-12-23 16:13:24 +01:00
Gael Guennebaud	5713fb7feb	Fix plog(+INF): it returned ~87 instead of +INF	2018-12-23 15:40:52 +01:00
Christoph Hertzberg	6dd93f7e3b	Make code compile again for older compilers. See https://stackoverflow.com/questions/7411515/	2018-12-22 13:09:07 +01:00
Gustavo Lima Chaves	1024a70e82	gebp: Add new ½ and ¼ packet rows per (peeling) round on the lhs MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The patch works by altering the gebp lhs packing routines to also consider ½ and ¼ packet lenght rows when packing, besides the original whole package and row-by-row attempts. Finally, gebp itself will try to fit a fraction of a packet at a time if: i) ½ and/or ¼ packets are available for the current context (e.g. AVX2 and SSE-sized SIMD register for x86) ii) The matrix's height is favorable to it (it may be it's too small in that dimension to take full advantage of the current/maximum packet width or it may be the case that last rows may take advantage of smaller packets before gebp goes row-by-row) This helps mitigate huge slowdowns one had on AVX512 builds when compared to AVX2 ones, for some dimensions. Gains top at an extra 1x in throughput. This patch is a complement to changeset `4ad359237a` . Since packing is changed, Eigen users which would go for very low-level API usage, like TensorFlow, will have to be adapted to work fine with the changes.	2018-12-21 11:03:18 -08:00
Gustavo Lima Chaves	e763fcd09e	Introducing "vectorized" byte on unpacket_traits structs This is a preparation to a change on gebp_traits, where a new template argument will be introduced to dictate the packet size, so it won't be bound to the current/max packet size only anymore. By having packet types defined early on gebp_traits, one has now to act on packet types, not scalars anymore, for the enum values defined on that class. One approach for reaching the vectorizable/size properties one needs there could be getting the packet's scalar again with unpacket_traits<>, then the size/Vectorizable enum entries from packet_traits<>. It turns out guards like "#ifndef EIGEN_VECTORIZE_AVX512" at AVX/PacketMath.h will hide smaller packet variations of packet_traits<> for some types (and it makes sense to keep that). In other words, one can't go back to the scalar and create a new PacketType, as this will always lead to the maximum packet type for the architecture. The less costly/invasive solution for that, thus, is to add the vectorizable info on every unpacket_traits struct as well.	2018-12-19 14:24:44 -08:00
Gael Guennebaud	efa4c9c40f	bug #1615 : slightly increase the default unrolling limit to compensate for changeset `101ea26f5e` . This solves a performance regression with clang and 3x3 matrix products.	2018-12-13 10:42:39 +01:00
Gael Guennebaud	f582ea3579	Fix compilation with expression template scalar type.	2018-12-12 22:47:00 +01:00
Gael Guennebaud	2de8da70fd	bug #1557 : fix RealSchur and EigenSolver for matrices with only zeros on the diagonal.	2018-12-12 17:30:08 +01:00
Gael Guennebaud	37c91e1836	bug #1644 : fix warning	2018-12-11 22:07:20 +01:00
Gael Guennebaud	f159cf3d75	Artificially increase l1-blocking size for AVX512. +10% speedup with current kernels. With a 6pX4 kernel (not committed yet), this provides a +20% speedup.	2018-12-11 15:36:27 +01:00
Gael Guennebaud	0a7e7af6fd	Properly set the number of registers for AVX512	2018-12-11 15:33:17 +01:00
Gael Guennebaud	7166496f70	bug #1643 : fix compilation issue with gcc and no optimizaion	2018-12-11 13:24:42 +01:00
Gael Guennebaud	0d90637838	enable spilling workaround on architectures with SSE/AVX	2018-12-10 23:22:44 +01:00
Gael Guennebaud	bff90bf270	workaround "may be used uninitialized" warning	2018-12-08 18:58:28 +01:00
Gael Guennebaud	81c27325ae	bug #1641 : fix testing of pandnot and fix pandnot for complex on SSE/AVX/AVX512	2018-12-08 14:27:48 +01:00
Gael Guennebaud	426bce7529	fix EIGEN_GEBP_2PX4_SPILLING_WORKAROUND for non vectorized type, and non x86/64 target	2018-12-08 09:44:21 +01:00
Gael Guennebaud	956678a4ef	bug #1515 : disable gebp's 3pX4 micro kernel for MSVC<=19.14 because of register spilling.	2018-12-07 18:03:36 +01:00
Gael Guennebaud	7b6d0ff1f6	Enable FMA with MSVC (through /arch:AVX2). To make this possible, I also has to turn the #warning regarding AVX512-FMA to a #error.	2018-12-07 15:14:50 +01:00
Gael Guennebaud	f233c6194d	bug #1637 : workaround register spilling in gebp with clang>=6.0+AVX+FMA	2018-12-07 10:01:09 +01:00
Gael Guennebaud	ae59a7652b	bug #1638 : add a warning if avx512 is enabled without SSE/AVX FMA	2018-12-07 09:23:28 +01:00
Gael Guennebaud	4e7746fe22	bug #1636 : fix gemm performance issue with gcc>=6 and no FMA	2018-12-07 09:15:46 +01:00
Gael Guennebaud	cbf2f4b7a0	AVX512f includes FMA but GCC does not define __FMA__ with -mavx512f only	2018-12-06 18:21:56 +01:00
Gael Guennebaud	1d683ae2f5	Fix compilation with avx512f only, i.e., no AVX512DQ	2018-12-06 18:11:07 +01:00
Gael Guennebaud	c53eececb0	Implement AVX512 vectorization of std::complex<float/double>	2018-12-06 15:58:06 +01:00
Gael Guennebaud	3fba59ea59	temporarily re-disable SSE/AVX vectorization of complex<> on AVX512 -> this needs to be fixed though!	2018-12-06 00:13:26 +01:00
Gael Guennebaud	1ac2695ef7	bug #1636 : fix compilation with some ABI versions.	2018-12-06 00:05:10 +01:00
Rasmus Munk Larsen	47d8b741b2	#elif -> #else to fix GPU build.	2018-12-05 13:19:31 -08:00
Christoph Hertzberg	c1d356e8b4	bug #1635 : Use infinity from Numtraits instead of creating it manually.	2018-12-05 15:01:04 +01:00
Rasmus Munk Larsen	b57b31cce9	Merged in ezhulenev/eigen-01 (pull request PR-553) Do not disable alignment with EIGEN_GPUCC Approved-by: Rasmus Munk Larsen <rmlarsen@google.com>	2018-12-04 23:47:19 +00:00
Eugene Zhulenev	0bb15bb6d6	Update checks in ConfigureVectorization.h	2018-12-03 17:10:40 -08:00
Eugene Zhulenev	fd0fbfa9b5	Do not disable alignment with EIGEN_GPUCC	2018-12-03 15:54:10 -08:00
Christoph Hertzberg	919414b9fe	bug #785 : Make Cholesky decomposition work for empty matrices	2018-12-03 16:18:15 +01:00
Gael Guennebaud	0ea7ae7213	Add missing padd for Packet8i (it was implicitly generated by clang and gcc)	2018-11-30 21:52:25 +01:00
Gael Guennebaud	ab4df3e6ff	bug #1634 : remove double copy in move-ctor of non movable Matrix/Array	2018-11-30 21:25:51 +01:00
Gael Guennebaud	c785464430	Add packet sin and cos to Altivec/VSX and NEON	2018-11-30 16:21:33 +01:00
Gael Guennebaud	69ace742be	Several improvements regarding packet-bitwise operations: - add unit tests - optimize their AVX512f implementation - add missing implementations (half, Packet4f, ...)	2018-11-30 15:56:08 +01:00
Gael Guennebaud	fa87f9d876	Add psin/pcos on AVX512 -> almost for free, at last!	2018-11-30 14:33:13 +01:00
Gael Guennebaud	c68bd2fa7a	Cleanup	2018-11-30 14:32:31 +01:00
Gael Guennebaud	f91500d303	Fix pandnot order in AVX512	2018-11-30 14:32:06 +01:00
Gael Guennebaud	b477d60bc6	Extend the generic psin_float code to handle cosine and make SSE and AVX use it (-> this adds pcos for AVX)	2018-11-30 11:26:30 +01:00
Gael Guennebaud	e19ece822d	Disable fma gcc's workaround for gcc >= 8 (based on GEMM benchmarks)	2018-11-28 17:56:24 +01:00
Gael Guennebaud	41052f63b7	same for pmax	2018-11-28 17:17:28 +01:00
Gael Guennebaud	3e95e398b6	pmin/pmax o SSE: make sure to use AVX instruction with AVX enabled, and disable gcc workaround for fixed gcc versions	2018-11-28 17:14:20 +01:00
Gael Guennebaud	aa6097395b	Add missing SSE/AVX type-casting in AVX512 mode	2018-11-28 16:09:08 +01:00
Gael Guennebaud	48fe78c375	bug #1630 : fix linspaced when requesting smaller packet size than default one.	2018-11-28 13:15:06 +01:00
Eugene Zhulenev	80f1651f35	Use explicit packet type in SSE/PacketMath pldexp	2018-11-27 17:25:49 -08:00
Benoit Jacob	a4159dba08	do not read buffers out of bounds -- load only the 4 bytes we know exist here. Could also have done a vld1_lane_f32 but doing so here, without the overhead of initializing the unused lane, would have triggered used-of-uninitialized-value errors in tools such as ASan. Note that this code is sub-optimal before or after this change: we should be reading either 2 or 4 float32 values per load-instruction (2 for ARM in-order cores with an affinity for 8-byte loads; 4 for ARM out-of-order cores able to dual-issue 16-byte load instructions with arithmetic instructions). Before or after this patch, we are only loading 4 bytes of useful data here (even if before this patch, we were technically loading 8, only to use only the 4 first).	2018-11-27 16:53:14 -05:00
Gael Guennebaud	b131a4db24	bug #1631 : fix compilation with ARM NEON and clang, and cleanup the weird pshiftright_and_cast and pcast_and_shiftleft functions.	2018-11-27 23:45:00 +01:00
Gael Guennebaud	a1a5fbbd21	Update pshiftleft to pass the shift as a true compile-time integer.	2018-11-27 22:57:30 +01:00
Gael Guennebaud	fa7fd61eda	Unify SSE/AVX psin functions. It is based on the SSE version which is much more accurate, though very slightly slower. This changeset also includes the following required changes: - add packet-float to packet-int type traits - add packet float<->int reinterpret casts - add faster pselect for AVX based on blendv	2018-11-27 22:41:51 +01:00
Benoit Jacob	7b1cb8a440	fix the build on 64-bit ARM when NEON is disabled	2018-11-27 11:11:02 -05:00
Gael Guennebaud	b5695a6008	Unify Altivec/VSX pexp(double) with default implementation	2018-11-27 13:53:05 +01:00
Gael Guennebaud	7655a8af6e	cleanup	2018-11-26 23:21:29 +01:00
Gael Guennebaud	502f92fa10	Unify SSE and AVX pexp for double.	2018-11-26 23:12:44 +01:00
Gael Guennebaud	4a347a0054	Unify NEON's pexp with generic implementation	2018-11-26 22:15:44 +01:00
Gael Guennebaud	5c8406babc	Unify Altivec/VSX's pexp with generic implementation	2018-11-26 16:47:13 +01:00
Gael Guennebaud	cf8b85d5c5	Unify SSE and AVX implementation of pexp	2018-11-26 16:36:19 +01:00
Gael Guennebaud	c2f35b1b47	Unify Altivec/VSX's plog with generic implementation, and enable it!	2018-11-26 15:58:11 +01:00
Gael Guennebaud	c24e98e6a8	Unify NEON's plog with generic implementation	2018-11-26 15:02:16 +01:00
Gael Guennebaud	2c44c40114	First step toward a unification of packet log implementation, currently only SSE and AVX are unified. To this end, I added the following functions: pzero, pcmp_*, pfrexp, pset1frombits functions.	2018-11-26 14:21:24 +01:00
Gael Guennebaud	5f6045077c	Make SSE/AVX pandnot(A,B) consistent with generic version, i.e., "A and not B"	2018-11-26 14:14:07 +01:00
Gael Guennebaud	0836a715d6	bug #1611 : fix plog(0) on NEON	2018-11-26 09:08:38 +01:00
Patrik Huber	95566eeed4	Fix typos	2018-11-23 22:22:14 +00:00
Gael Guennebaud	ccabdd88c9	Fix reserved usage of double __ in macro names	2018-11-23 16:01:47 +01:00
Gael Guennebaud	a7842daef2	Fix several uninitialized member from ctor	2018-11-23 15:10:28 +01:00
Gael Guennebaud	a476054879	bug #1624 : improve matrix-matrix product on ARM 64, 20% speedup	2018-11-23 10:25:19 +01:00
Gael Guennebaud	4b2cebade8	Workaround weird MSVC bug	2018-11-21 15:53:37 +01:00
Deven Desai	e7e6809e6b	ROCm/HIP specfic fixes + updates 1. Eigen/src/Core/arch/GPU/Half.h Updating the HIPCC implementation half so that it can declared as a __shared__ variable 2. Eigen/src/Core/util/Macros.h, Eigen/src/Core/util/Memory.h introducing a EIGEN_USE_STD(func) macro that calls - std::func be default - ::func when eigen is being compiled with HIPCC This change was requested in the previous HIP PR (https://bitbucket.org/eigen/eigen/pull-requests/518/pr-with-hip-specific-fixes-for-the-eigen/diff) 3. unsupported/Eigen/CXX11/src/Tensor/TensorDeviceThreadPool.h Removing EIGEN_DEVICE_FUNC attribute from pure virtual methods as it is not supported by HIPCC 4. unsupported/Eigen/CXX11/src/Tensor/TensorReduction.h Disabling the template specializations of InnerMostDimReducer as they run into HIPCC link errors	2018-11-19 18:13:59 +00:00
Gael Guennebaud	6a510fe69c	Make MaxPacketSize a true upper bound, even for fixed-size inputs	2018-11-16 11:25:32 +01:00
Mark D Ryan	670d56441c	PR 544: Set requestedAlignment correctly for SliceVectorizedTraversals Commit `aa110e681b` optimised the multiplication of small dyanmically sized matrices by restricting the packet size to a maximum of 4, increasing the chances that SIMD instructions are used in the computation. However, it introduced a mismatch between the packet size and the requestedAlignment. This mismatch can lead to crashes when the destination is not aligned. This patch fixes the issue by ensuring that the AssignmentTraits are correctly computed when using a restricted packet size. * * * Bind LinearPacketType to MaxPacketSize This commit applies any packet size limit specified when instantiating copy_using_evaluator_traits to the LinearPacketType, providing that the size of the destination is not known at compile time. * * * Add unit test for restricted packet assignment A new unit test is added to check that multiplication of small dynamically sized matrices works correctly when the packet size is restricted to 4 and the destination is unaligned.	2018-11-13 16:15:08 +01:00
Nikolaus Demmel	3dc0845046	Fix typo in comment on EIGEN_MAX_STATIC_ALIGN_BYTES	2018-11-14 18:11:30 +01:00
Gael Guennebaud	7fddc6a51f	typo	2018-11-14 14:43:18 +01:00
Gael Guennebaud	449f948b2a	help doxygen linking to DenseBase::NulllaryExpr	2018-11-14 14:42:59 +01:00
luz.paz"	f67b19a884	[PATCH 1/2] Misc. typos From 68d431b4c14ad60a778ee93c1f59ecc4b931950e Mon Sep 17 00:00:00 2001 Found via `codespell -q 3 -I ../eigen-word-whitelist.txt` where the whitelists consists of: ``` als ans cas dum lastr lowd nd overfl pres preverse substraction te uint whch ``` --- CMakeLists.txt \| 26 +++++++++---------- Eigen/src/Core/GenericPacketMath.h \| 2 +- Eigen/src/SparseLU/SparseLU.h \| 2 +- bench/bench_norm.cpp \| 2 +- doc/HiPerformance.dox \| 2 +- doc/QuickStartGuide.dox \| 2 +- .../Eigen/CXX11/src/Tensor/TensorChipping.h \| 6 ++--- .../Eigen/CXX11/src/Tensor/TensorDeviceGpu.h \| 2 +- .../src/Tensor/TensorForwardDeclarations.h \| 4 +-- .../src/Tensor/TensorGpuHipCudaDefines.h \| 2 +- .../Eigen/CXX11/src/Tensor/TensorReduction.h \| 2 +- .../CXX11/src/Tensor/TensorReductionGpu.h \| 2 +- .../test/cxx11_tensor_concatenation.cpp \| 2 +- unsupported/test/cxx11_tensor_executor.cpp \| 2 +- 14 files changed, 29 insertions(+), 29 deletions(-)	2018-09-18 04:15:01 -04:00
Rasmus Munk Larsen	77b447c24e	Add optimized version of logistic function for float. As an example, this is about 50% faster than the existing version on Haswell using AVX.	2018-11-12 13:42:24 -08:00
Gael Guennebaud	0105146915	Fix warning in c++03	2018-11-10 09:11:38 +01:00
Gael Guennebaud	784a3f13cf	bug #1619 : fix mixing of const and non-const generic iterators	2018-11-09 21:45:10 +01:00
Gael Guennebaud	db9a9a12ba	bug #1619 : make const and non-const iterators compatible	2018-11-09 16:49:19 +01:00
Gael Guennebaud	bd9a00718f	Let doxygen sees lastN	2018-11-09 11:35:48 +01:00
Gael Guennebaud	a368848473	Recent xcode versions does support EIGEN_HAS_STATIC_ARRAY_TEMPLATE	2018-11-09 10:33:17 +01:00
Gael Guennebaud	f62a0f69c6	Fix max-size in indexed-view	2018-11-08 18:40:22 +01:00
Gael Guennebaud	bf495859ff	Merged in glchaves/eigen (pull request PR-539) Vectorize row-by-row gebp loop iterations on 16 packets as well	2018-11-07 07:21:15 +00:00
Gustavo Lima Chaves	4ad359237a	Vectorize row-by-row gebp loop iterations on 16 packets as well Signed-off-by: Gustavo Lima Chaves <gustavo.lima.chaves@intel.com> Signed-off-by: Mark D. Ryan <mark.d.ryan@intel.com>	2018-11-06 10:48:42 -08:00
Matthieu Vigne	8d7a73e48e	bug #1617 : Fix SolveTriangular.solveInPlace crashing for empty matrix. This made FullPivLU.kernel() crash when used on the zero matrix. Add unit test for FullPivLU.kernel() on the zero matrix.	2018-10-31 20:28:18 +01:00
Christoph Hertzberg	66b28e290d	bug #1618 : Use different power-of-2 check to avoid MSVC warning	2018-11-01 13:23:19 +01:00
Christian von Schultz	4a40b3785d	Collapsed revision (based on pull request PR-325) * Support compiling without IO streams Add the preprocessor definition EIGEN_NO_IO which, if defined, disables all use of the IO streams part of the standard library.	2018-10-22 21:14:40 +02:00
Rasmus Munk Larsen	14054e217f	Do not rely on the compiler generating __device__ functions for constexpr in Cuda (via EIGEN_CONSTEXPR_ARE_DEVICE_FUNC. This breaks several target in the TensorFlow Cuda build, e.g., INFO: From Compiling tensorflow/core/kernels/maxpooling_op_gpu.cu.cc: /b/f/w/run/external/eigen_archive/Eigen/src/Core/arch/GPU/Half.h(197): error: calling a __host__ function("std::equal_to<float> ::operator () const") from a __global__ function("tensorflow::_NV_ANON_NAMESPACE::MaxPoolGradBackwardNoMaskNHWC< ::Eigen::half> ") is not allowed /b/f/w/run/external/eigen_archive/Eigen/src/Core/arch/GPU/Half.h(197): error: identifier "std::equal_to<float> ::operator () const" is undefined in device code" /b/f/w/run/external/eigen_archive/Eigen/src/Core/arch/GPU/Half.h(197): error: calling a __host__ function("std::equal_to<float> ::operator () const") from a __global__ function("tensorflow::_NV_ANON_NAMESPACE::MaxPoolGradBackwardNoMaskNCHW< ::Eigen::half> ") is not allowed /b/f/w/run/external/eigen_archive/Eigen/src/Core/arch/GPU/Half.h(197): error: identifier "std::equal_to<float> ::operator () const" is undefined in device code 4 errors detected in the compilation of "/tmp/tmpxft_00000011_00000000-6_maxpooling_op_gpu.cu.cpp1.ii". ERROR: /tmpfs/tensor_flow/tensorflow/core/kernels/BUILD:3753:1: output 'tensorflow/core/kernels/_objs/pooling_ops_gpu/maxpooling_op_gpu.cu.pic.o' was not created ERROR: /tmpfs/tensor_flow/tensorflow/core/kernels/BUILD:3753:1: Couldn't build file tensorflow/core/kernels/_objs/pooling_ops_gpu/maxpooling_op_gpu.cu.pic.o: not all outputs were created or valid	2018-10-22 16:18:24 -07:00
Rasmus Munk Larsen	9caafca550	Merged in rmlarsen/eigen (pull request PR-532) Only set EIGEN_CONSTEXPR_ARE_DEVICE_FUNC for clang++ if cxx_relaxed_constexpr is available.	2018-10-19 21:37:14 +00:00
Christoph Hertzberg	449ff74672	Fix most Doxygen warnings. Also add links to stable documentation from unsupported modules (by using the corresponding Doxytags file). Manually grafted from `d107a371c6`	2018-10-19 21:10:28 +02:00
Rasmus Munk Larsen	d8f285852b	Only set EIGEN_CONSTEXPR_ARE_DEVICE_FUNC for clang++ if cxx_relaxed_constexpr is available.	2018-10-18 16:55:02 -07:00
Gael Guennebaud	0f780bb0b4	Fix float-to-double warning	2018-10-16 09:19:45 +02:00
Gael Guennebaud	a39e0f7438	bug #1612 : fix regression in "outer-vectorization" of partial reductions for PacketSize==1 (aka complex<double>)	2018-10-16 01:04:25 +02:00
Gael Guennebaud	d2d570c116	Remove useless (and broken) resize	2018-10-16 00:42:48 +02:00
Gael Guennebaud	f0fb95135d	Iterative solvers: unify and fix handling of multiple rhs. m_info was not properly computed and the logic was repeated in several places.	2018-10-15 23:47:46 +02:00
Gael Guennebaud	3a33db4de5	merge	2018-10-15 09:22:27 +02:00
Mark D Ryan	aa110e681b	PR 526: Speed up multiplication of small, dynamically sized matrices The Packet16f, Packet8f and Packet8d types are too large to use with dynamically sized matrices typically processed by the SliceVectorizedTraversal specialization of the dense_assignment_loop. Using these types is likely to lead to little or no vectorization. Significant slowdown in the multiplication of these small matrices can be observed when building with AVX and AVX512 enabled. This patch introduces a new dense_assignment_kernel that is used when computing small products whose operands have dynamic dimensions. It ensures that the PacketSize used is no larger than 4, thereby increasing the chance that vectorized instructions will be used when computing the product. I tested all 969 possible combinations of M, K, and N that are handled by the dense_assignment_loop on x86 builds. Although a few combinations are slowed down by this patch they are far outnumbered by the cases that are sped up, as the following results demonstrate. Disabling Packed8d on AVX512 builds: Total Cases: 969 Better: 511 Worse: 85 Same: 373 Max Improvement: 169.00% (4 8 6) Max Degradation: 36.50% (8 5 3) Median Improvement: 35.46% Median Degradation: 17.41% Total FLOPs Improvement: 19.42% Disabling Packet16f and Packed8f on AVX512 builds: Total Cases: 969 Better: 658 Worse: 5 Same: 306 Max Improvement: 214.05% (8 6 5) Max Degradation: 22.26% (16 2 1) Median Improvement: 60.05% Median Degradation: 13.32% Total FLOPs Improvement: 59.58% Disabling Packed8f on AVX builds: Total Cases: 969 Better: 663 Worse: 96 Same: 210 Max Improvement: 155.29% (4 10 5) Max Degradation: 35.12% (8 3 2) Median Improvement: 34.28% Median Degradation: 15.05% Total FLOPs Improvement: 26.02%	2018-10-12 15:20:21 +02:00
Eugene Zhulenev	d9392f9e55	Fix code format	2018-11-02 14:51:35 -07:00
Eugene Zhulenev	118520f04a	Workaround nbcc+msvc compiler bug	2018-11-02 14:48:28 -07:00
Christoph Hertzberg	24dc076519	Explicitly convert 0 to Scalar for custom types	2018-10-12 10:22:19 +02:00
Gael Guennebaud	43633fbaba	Fix warning with AVX512f	2018-10-11 10:13:48 +02:00
Gael Guennebaud	97e2c808e9	Fix avx512 plog(NaN) to return NaN instead of +inf	2018-10-11 10:13:13 +02:00
Gael Guennebaud	b3f66d29a5	Enable avx512 plog with clang	2018-10-11 10:12:21 +02:00
Gael Guennebaud	f0aa7e40fc	Fix regression in changeset `5335659c47`	2018-10-10 23:47:30 +02:00
Gael Guennebaud	ce243ee45b	bug #520 : add diagmat +/- diagmat operators.	2018-10-10 23:38:22 +02:00
Gael Guennebaud	5335659c47	Merged in ezhulenev/eigen-02 (pull request PR-525) Fix bug in partial reduction of expressions requiring evaluation	2018-10-10 20:59:00 +00:00
Gael Guennebaud	eec0dfd688	bug #632 : add specializations for res ?= dense +/- sparse and res ?= sparse +/- dense. They are rewritten as two compound assignment to by-pass hybrid dense-sparse iterator.	2018-10-10 22:50:15 +02:00
Eugene Zhulenev	8e6dc2c81d	Fix bug in partial reduction of expressions requiring evaluation	2018-10-10 13:23:52 -07:00
Eugene Zhulenev	2bf1a31d81	Use void type if stl-style iterators are not supported	2018-10-10 10:31:40 -07:00
Rasmus Munk Larsen	e8918743c1	Merged in ezhulenev/eigen-01 (pull request PR-523) Compile time detection for unimplemented stl-style iterators	2018-10-09 23:42:01 +00:00
Eugene Zhulenev	c0ca8a9fa3	Compile time detection for unimplemented stl-style iterators	2018-10-09 15:28:23 -07:00
Gael Guennebaud	1dd1f8e454	bug #65 : add vectorization of partial reductions along the outer-dimension, for instance: colmajor_mat.rowwise().mean()	2018-10-09 23:36:50 +02:00
Gael Guennebaud	bfa2a81a50	Make redux_vec_unroller more flexible regarding packet-type	2018-10-09 23:30:41 +02:00
Christoph Hertzberg	f6359ad795	Small Doxygen fixes	2018-10-09 19:33:35 +02:00
Gael Guennebaud	7a882c05ab	Fix compilation on CUDA	2018-10-09 17:02:16 +02:00
Gael Guennebaud	e00487f7d2	bug #1603 : add parenthesis around ternary operator in function body as well as a harmless attempt to make MSVC happy.	2018-10-08 22:27:04 +02:00
Gael Guennebaud	649d4758a6	merge	2018-10-08 17:35:18 +02:00
Gael Guennebaud	774bb9d6f7	fix a doxygen issue	2018-10-08 09:30:15 +02:00

... 2 3 4 5 6 ...

6061 Commits