eigen

mirror of https://gitlab.com/libeigen/eigen.git synced 2024-12-21 07:19:46 +08:00

Author	SHA1	Message	Date
Gael Guennebaud	512b74aaa1	GEMM: catch all scalar-multiple variants when falling-back to a coeff-based product. Before only sAB was caught which was both inconsistent with GEMM, sub-optimal, and could even lead to compilation-errors (https://stackoverflow.com/questions/54738495).	2019-02-18 11:47:54 +01:00
Christoph Hertzberg	ec032ac03b	Guard C++11-style default constructor. Also, this is only needed for MSVC	2019-02-16 09:44:05 +01:00
Gael Guennebaud	902a7793f7	Add possibility to bench row-major lhs and rhs	2019-02-15 16:52:34 +01:00
Gael Guennebaud	83309068b4	bug #1680 : improve MSVC inlining by declaring many triavial constructors and accessors as STRONG_INLINE.	2019-02-15 16:35:35 +01:00
Gael Guennebaud	0505248f25	bug #1680 : make all "block" methods strong-inline and device-functions (some were missing EIGEN_DEVICE_FUNC)	2019-02-15 16:33:56 +01:00
Gael Guennebaud	559320745e	bug #1678 : Fix lack of __FMA__ macro on MSVC with AVX512	2019-02-15 10:30:28 +01:00
Gael Guennebaud	d85ae650bf	bug #1678 : workaround MSVC compilation issues with AVX512	2019-02-15 10:24:17 +01:00
Gael Guennebaud	f2970819a2	bug #1679 : avoid possible division by 0 in complex-schur	2019-02-15 09:39:25 +01:00
Rasmus Munk Larsen	65e23ca7e9	Revert `b55b5c7280` .	2019-02-14 13:46:13 -08:00
Rasmus Larsen	efeabee445	Merged in ezhulenev/eigen-01 (pull request PR-590) Do not generate no-op cast() and conjugate() expressions	2019-02-14 21:16:12 +00:00
Eugene Zhulenev	7b837559a7	Fix signed-unsigned return in RuqQueue	2019-02-14 10:40:21 -08:00
Eugene Zhulenev	f0d42d2265	Fix signed-unsigned comparison warning in RunQueue	2019-02-14 10:27:28 -08:00
Eugene Zhulenev	106ba7bb1a	Do not generate no-op cast() and conjugate() expressions	2019-02-14 09:51:51 -08:00
Eugene Zhulenev	8c2f30c790	Speedup Tensor ThreadPool RunQueu::Empty()	2019-02-13 10:20:53 -08:00
Gael Guennebaud	bdcb5f3304	Let's properly use Score instead of std::abs, and remove deprecated FIXME ( a /= b does a/b and not a * (1/b) as it was a long time ago...)	2019-02-11 22:56:19 +01:00
Gael Guennebaud	2edfc6807d	Fix compilation of empty products of the form: Mx0 * 0xN	2019-02-11 18:24:07 +01:00
Gael Guennebaud	eb46f34a8c	Speed up 2x2 LU by a factor 2, and other small fixed sizes by about 10%. Not sure that's so critical, but this does not complexify the code base much.	2019-02-11 17:59:35 +01:00
Gael Guennebaud	dada863d23	Enable unit tests of PartialPivLU on fixed size matrices, and increase tested matrix size (blocking was not tested!)	2019-02-11 17:56:20 +01:00
Gael Guennebaud	ab6e6edc32	Speedup PartialPivLU for small matrices by passing compile-time sizes when available. This change set also makes a better use of Map<>+OuterStride and Ref<> yielding surprising speed up for small dynamic sizes as well. The table below reports times in micro seconds for 10 random matrices: \| ------ float --------- \| ------- double ------- \| size \| before after ratio \| before after ratio \| fixed 1 \| 0.34 0.11 2.93 \| 0.35 0.11 3.06 \| fixed 2 \| 0.81 0.24 3.38 \| 0.91 0.25 3.60 \| fixed 3 \| 1.49 0.49 3.04 \| 1.68 0.55 3.01 \| fixed 4 \| 2.31 0.70 3.28 \| 2.45 1.08 2.27 \| fixed 5 \| 3.49 1.11 3.13 \| 3.84 2.24 1.71 \| fixed 6 \| 4.76 1.64 2.88 \| 4.87 2.84 1.71 \| dyn 1 \| 0.50 0.40 1.23 \| 0.51 0.40 1.26 \| dyn 2 \| 1.08 0.85 1.27 \| 1.04 0.69 1.49 \| dyn 3 \| 1.76 1.26 1.40 \| 1.84 1.14 1.60 \| dyn 4 \| 2.57 1.75 1.46 \| 2.67 1.66 1.60 \| dyn 5 \| 3.80 2.64 1.43 \| 4.00 2.48 1.61 \| dyn 6 \| 5.06 3.43 1.47 \| 5.15 3.21 1.60 \|	2019-02-11 13:58:24 +01:00
Eugene Zhulenev	21eb97d3e0	Add PacketConv implementation for non-vectorizable src expressions	2019-02-08 15:47:25 -08:00
Eugene Zhulenev	1e36166ed1	Optimize TensorConversion evaluator: do not convert same type	2019-02-08 15:13:24 -08:00
Steven Peters	953ca5ba2f	Spline.h: fix spelling "spang" -> "span"	2019-02-08 06:23:24 +00:00
Eugene Zhulenev	59998117bb	Don't do parallel_pack if we can use thread_local memory in tensor contractions	2019-02-07 09:21:25 -08:00
Gael Guennebaud	013cc3a6b3	Make GEMM fallback to GEMV for runtime vectors. This is a more general and simpler version of changeset `4c0fa6ce0f`	2019-02-07 16:24:09 +01:00
Gael Guennebaud	fa2fcb4895	Backed out changeset `4c0fa6ce0f`	2019-02-07 16:07:08 +01:00
Gael Guennebaud	b3c4344a68	bug #1676 : workaround GCC's bug in c++17 mode.	2019-02-07 15:21:35 +01:00
Rasmus Larsen	3091c03898	Merged in ezhulenev/eigen-01 (pull request PR-581) Parallelize tensor contraction only by sharding dimension and use 'thread-local' memory for packing Approved-by: Rasmus Larsen <rmlarsen@google.com> Approved-by: Gael Guennebaud <g.gael@free.fr>	2019-02-05 22:45:20 +00:00
Eugene Zhulenev	8491127082	Do not reduce parallelism too much in contractions with small number of threads	2019-02-04 12:59:33 -08:00
Eugene Zhulenev	eb21bab769	Parallelize tensor contraction only by sharding dimension and use 'thread-local' memory for packing	2019-02-04 10:43:16 -08:00
Eugene Zhulenev	6d0f6265a9	Remove duplicated comment line	2019-02-04 10:30:25 -08:00
Eugene Zhulenev	690b2c45b1	Fix GeneralBlockPanelKernel Android compilation	2019-02-04 10:29:15 -08:00
Gael Guennebaud	871e2e5339	bug #1674 : disable GCC's unsafe-math-optimizations in sin/cos vectorization (results are completely wrong otherwise)	2019-02-03 08:54:47 +01:00
Rasmus Larsen	e7b481ea74	Merged in rmlarsen/eigen (pull request PR-578) Speed up Eigen matrixvector and vectormatrix multiplication. Approved-by: Eugene Zhulenev <ezhulenev@google.com>	2019-02-02 01:53:44 +00:00
Sameer Agarwal	b55b5c7280	Speed up row-major matrix-vector product on ARM The row-major matrix-vector multiplication code uses a threshold to check if processing 8 rows at a time would thrash the cache. This change introduces two modifications to this logic. 1. A smaller threshold for ARM and ARM64 devices. The value of this threshold was determined empirically using a Pixel2 phone, by benchmarking a large number of matrix-vector products in the range [1..4096]x[1..4096] and measuring performance separately on small and little cores with frequency pinning. On big (out-of-order) cores, this change has little to no impact. But on the small (in-order) cores, the matrix-vector products are up to 700% faster. Especially on large matrices. The motivation for this change was some internal code at Google which was using hand-written NEON for implementing similar functionality, processing the matrix one row at a time, which exhibited substantially better performance than Eigen. With the current change, Eigen handily beats that code. 2. Make the logic for choosing number of simultaneous rows apply unifiormly to 8, 4 and 2 rows instead of just 8 rows. Since the default threshold for non-ARM devices is essentially unchanged (32000 -> 32 * 1024), this change has no impact on non-ARM performance. This was verified by running the same set of benchmarks on a Xeon desktop.	2019-02-01 15:23:53 -08:00
Rasmus Munk Larsen	4c0fa6ce0f	Speed up Eigen matrixvector and vectormatrix multiplication. This change speeds up Eigen matrix * vector and vector * matrix multiplication for dynamic matrices when it is known at runtime that one of the factors is a vector. The benchmarks below test c.noalias()= n_by_n_matrix * n_by_1_matrix; c.noalias()= 1_by_n_matrix * n_by_n_matrix; respectively. Benchmark measurements: SSE: Run on * (72 X 2992 MHz CPUs); 2019-01-28T17:51:44.452697457-08:00 CPU: Intel Skylake Xeon with HyperThreading (36 cores) dL1:32KB dL2:1024KB dL3:24MB Benchmark Base (ns) New (ns) Improvement ------------------------------------------------------------------ BM_MatVec/64 1096 312 +71.5% BM_MatVec/128 4581 1464 +68.0% BM_MatVec/256 18534 5710 +69.2% BM_MatVec/512 118083 24162 +79.5% BM_MatVec/1k 704106 173346 +75.4% BM_MatVec/2k 3080828 742728 +75.9% BM_MatVec/4k 25421512 4530117 +82.2% BM_VecMat/32 352 130 +63.1% BM_VecMat/64 1213 425 +65.0% BM_VecMat/128 4640 1564 +66.3% BM_VecMat/256 17902 5884 +67.1% BM_VecMat/512 70466 24000 +65.9% BM_VecMat/1k 340150 161263 +52.6% BM_VecMat/2k 1420590 645576 +54.6% BM_VecMat/4k 8083859 4364327 +46.0% AVX2: Run on * (72 X 2993 MHz CPUs); 2019-01-28T17:45:11.508545307-08:00 CPU: Intel Skylake Xeon with HyperThreading (36 cores) dL1:32KB dL2:1024KB dL3:24MB Benchmark Base (ns) New (ns) Improvement ------------------------------------------------------------------ BM_MatVec/64 619 120 +80.6% BM_MatVec/128 9693 752 +92.2% BM_MatVec/256 38356 2773 +92.8% BM_MatVec/512 69006 12803 +81.4% BM_MatVec/1k 443810 160378 +63.9% BM_MatVec/2k 2633553 646594 +75.4% BM_MatVec/4k 16211095 4327148 +73.3% BM_VecMat/64 925 227 +75.5% BM_VecMat/128 3438 830 +75.9% BM_VecMat/256 13427 2936 +78.1% BM_VecMat/512 53944 12473 +76.9% BM_VecMat/1k 302264 157076 +48.0% BM_VecMat/2k 1396811 675778 +51.6% BM_VecMat/4k 8962246 4459010 +50.2% AVX512: Run on *** (72 X 2993 MHz CPUs); 2019-01-28T17:35:17.239329863-08:00 CPU: Intel Skylake Xeon with HyperThreading (36 cores) dL1:32KB dL2:1024KB dL3:24MB Benchmark Base (ns) New (ns) Improvement ------------------------------------------------------------------ BM_MatVec/64 401 111 +72.3% BM_MatVec/128 1846 513 +72.2% BM_MatVec/256 36739 1927 +94.8% BM_MatVec/512 54490 9227 +83.1% BM_MatVec/1k 487374 161457 +66.9% BM_MatVec/2k 2016270 643824 +68.1% BM_MatVec/4k 13204300 4077412 +69.1% BM_VecMat/32 324 106 +67.3% BM_VecMat/64 1034 246 +76.2% BM_VecMat/128 3576 802 +77.6% BM_VecMat/256 13411 2561 +80.9% BM_VecMat/512 58686 10037 +82.9% BM_VecMat/1k 320862 163750 +49.0% BM_VecMat/2k 1406719 651397 +53.7% BM_VecMat/4k 7785179 4124677 +47.0% Currently watchingStop watching	2019-01-31 14:24:08 -08:00
Gael Guennebaud	7ef879f6bf	GEBP: improves pipelining in the 1pX4 path with FMA. Prior to this change, a product with a LHS having 8 rows was faster with AVX-only than with AVX+FMA. With AVX+FMA I measured a speed up of about x1.25 in such cases.	2019-01-30 23:45:12 +01:00
Gael Guennebaud	de77bf5d6c	Fix compilation with ARM64.	2019-01-30 16:48:20 +01:00
Gael Guennebaud	d586686924	Workaround lack of support for arbitrary packet-type in Tensor by manually loading half/quarter packets in tensor contraction mapper.	2019-01-30 16:48:01 +01:00
Gael Guennebaud	eb4c6bb22d	Fix conflicts and merge	2019-01-30 15:57:08 +01:00
Gael Guennebaud	e3622a0396	Slightly extend discussions on auto and move the content of the Pit falls wiki page here. http://eigen.tuxfamily.org/index.php?title=Pit_Falls	2019-01-30 13:09:21 +01:00
Gael Guennebaud	df12fae8b8	According to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89101 , the previous GCC issue is fixed in GCC trunk (will be gcc 9).	2019-01-30 11:52:28 +01:00
Gael Guennebaud	3775926bba	ARM64 & GEBP: add specialization for double +30% speed up	2019-01-30 11:49:06 +01:00
Gael Guennebaud	be5b0f664a	ARM64 & GEBP: Make use of vfmaq_laneq_f32 and workaround GCC's issue in generating good ASM	2019-01-30 11:48:25 +01:00
Christoph Hertzberg	a7779a9b42	Hide some annoying unused variable warnings in g++8.1	2019-01-29 16:48:21 +01:00
Gael Guennebaud	efe02292a6	Add recent gemm related changesets and various cleanups in perf-monitoring	2019-01-29 11:53:47 +01:00
Gael Guennebaud	8a06c699d0	bug #1669 : fix PartialPivLU/inverse with zero-sized matrices.	2019-01-29 10:27:13 +01:00
Gael Guennebaud	a2a07e62b9	Fix compilation with c++03 (local class cannot be template arguments), and make SparseMatrix::assignDiagonal truly protected.	2019-01-29 10:10:07 +01:00
Gael Guennebaud	f489f44519	bug #1574 : implement "sparse_matrix =,+=,-= diagonal_matrix" with smart insertion strategies of missing diagonal coeffs.	2019-01-28 17:29:50 +01:00
Gael Guennebaud	803fa79767	Move evaluator<SparseCompressedBase>::find(i,j) to a more general and reusable SparseCompressedBase::lower_bound(i,j) functiion	2019-01-28 17:24:44 +01:00
Gael Guennebaud	53560f9186	bug #1672 : fix unit test compilation with MSVC by adding overloads of test_is* for long long (and factorize copy/paste code through a macro)	2019-01-28 13:47:28 +01:00

... 2 3 4 5 6 ...

10597 Commits