Commit Graph

5965 Commits

Author SHA1 Message Date
Gael Guennebaud
fa2fcb4895 Backed out changeset 4c0fa6ce0f 2019-02-07 16:07:08 +01:00
Gael Guennebaud
b3c4344a68 bug #1676: workaround GCC's bug in c++17 mode. 2019-02-07 15:21:35 +01:00
Eugene Zhulenev
6d0f6265a9 Remove duplicated comment line 2019-02-04 10:30:25 -08:00
Eugene Zhulenev
690b2c45b1 Fix GeneralBlockPanelKernel Android compilation 2019-02-04 10:29:15 -08:00
Gael Guennebaud
871e2e5339 bug #1674: disable GCC's unsafe-math-optimizations in sin/cos vectorization (results are completely wrong otherwise) 2019-02-03 08:54:47 +01:00
Rasmus Larsen
e7b481ea74 Merged in rmlarsen/eigen (pull request PR-578)
Speed up Eigen matrix*vector and vector*matrix multiplication.

Approved-by: Eugene Zhulenev <ezhulenev@google.com>
2019-02-02 01:53:44 +00:00
Sameer Agarwal
b55b5c7280 Speed up row-major matrix-vector product on ARM
The row-major matrix-vector multiplication code uses a threshold to
check if processing 8 rows at a time would thrash the cache.

This change introduces two modifications to this logic.

1. A smaller threshold for ARM and ARM64 devices.

The value of this threshold was determined empirically using a Pixel2
phone, by benchmarking a large number of matrix-vector products in the
range [1..4096]x[1..4096] and measuring performance separately on
small and little cores with frequency pinning.

On big (out-of-order) cores, this change has little to no impact. But
on the small (in-order) cores, the matrix-vector products are up to
700% faster. Especially on large matrices.

The motivation for this change was some internal code at Google which
was using hand-written NEON for implementing similar functionality,
processing the matrix one row at a time, which exhibited substantially
better performance than Eigen.

With the current change, Eigen handily beats that code.

2. Make the logic for choosing number of simultaneous rows apply
unifiormly to 8, 4 and 2 rows instead of just 8 rows.

Since the default threshold for non-ARM devices is essentially
unchanged (32000 -> 32 * 1024), this change has no impact on non-ARM
performance. This was verified by running the same set of benchmarks
on a Xeon desktop.
2019-02-01 15:23:53 -08:00
Rasmus Munk Larsen
4c0fa6ce0f Speed up Eigen matrix*vector and vector*matrix multiplication.
This change speeds up Eigen matrix * vector and vector * matrix multiplication for dynamic matrices when it is known at runtime that one of the factors is a vector.

The benchmarks below test

c.noalias()= n_by_n_matrix * n_by_1_matrix;
c.noalias()= 1_by_n_matrix * n_by_n_matrix;
respectively.

Benchmark measurements:

SSE:
Run on *** (72 X 2992 MHz CPUs); 2019-01-28T17:51:44.452697457-08:00
CPU: Intel Skylake Xeon with HyperThreading (36 cores) dL1:32KB dL2:1024KB dL3:24MB
Benchmark                          Base (ns)  New (ns) Improvement
------------------------------------------------------------------
BM_MatVec/64                            1096       312    +71.5%
BM_MatVec/128                           4581      1464    +68.0%
BM_MatVec/256                          18534      5710    +69.2%
BM_MatVec/512                         118083     24162    +79.5%
BM_MatVec/1k                          704106    173346    +75.4%
BM_MatVec/2k                         3080828    742728    +75.9%
BM_MatVec/4k                        25421512   4530117    +82.2%
BM_VecMat/32                             352       130    +63.1%
BM_VecMat/64                            1213       425    +65.0%
BM_VecMat/128                           4640      1564    +66.3%
BM_VecMat/256                          17902      5884    +67.1%
BM_VecMat/512                          70466     24000    +65.9%
BM_VecMat/1k                          340150    161263    +52.6%
BM_VecMat/2k                         1420590    645576    +54.6%
BM_VecMat/4k                         8083859   4364327    +46.0%

AVX2:
Run on *** (72 X 2993 MHz CPUs); 2019-01-28T17:45:11.508545307-08:00
CPU: Intel Skylake Xeon with HyperThreading (36 cores) dL1:32KB dL2:1024KB dL3:24MB
Benchmark                          Base (ns)  New (ns) Improvement
------------------------------------------------------------------
BM_MatVec/64                             619       120    +80.6%
BM_MatVec/128                           9693       752    +92.2%
BM_MatVec/256                          38356      2773    +92.8%
BM_MatVec/512                          69006     12803    +81.4%
BM_MatVec/1k                          443810    160378    +63.9%
BM_MatVec/2k                         2633553    646594    +75.4%
BM_MatVec/4k                        16211095   4327148    +73.3%
BM_VecMat/64                             925       227    +75.5%
BM_VecMat/128                           3438       830    +75.9%
BM_VecMat/256                          13427      2936    +78.1%
BM_VecMat/512                          53944     12473    +76.9%
BM_VecMat/1k                          302264    157076    +48.0%
BM_VecMat/2k                         1396811    675778    +51.6%
BM_VecMat/4k                         8962246   4459010    +50.2%

AVX512:
Run on *** (72 X 2993 MHz CPUs); 2019-01-28T17:35:17.239329863-08:00
CPU: Intel Skylake Xeon with HyperThreading (36 cores) dL1:32KB dL2:1024KB dL3:24MB
Benchmark                          Base (ns)  New (ns) Improvement
------------------------------------------------------------------
BM_MatVec/64                             401       111    +72.3%
BM_MatVec/128                           1846       513    +72.2%
BM_MatVec/256                          36739      1927    +94.8%
BM_MatVec/512                          54490      9227    +83.1%
BM_MatVec/1k                          487374    161457    +66.9%
BM_MatVec/2k                         2016270    643824    +68.1%
BM_MatVec/4k                        13204300   4077412    +69.1%
BM_VecMat/32                             324       106    +67.3%
BM_VecMat/64                            1034       246    +76.2%
BM_VecMat/128                           3576       802    +77.6%
BM_VecMat/256                          13411      2561    +80.9%
BM_VecMat/512                          58686     10037    +82.9%
BM_VecMat/1k                          320862    163750    +49.0%
BM_VecMat/2k                         1406719    651397    +53.7%
BM_VecMat/4k                         7785179   4124677    +47.0%
Currently watchingStop watching
2019-01-31 14:24:08 -08:00
Gael Guennebaud
7ef879f6bf GEBP: improves pipelining in the 1pX4 path with FMA.
Prior to this change, a product with a LHS having 8 rows was faster with AVX-only than with AVX+FMA.
With AVX+FMA I measured a speed up of about x1.25 in such cases.
2019-01-30 23:45:12 +01:00
Gael Guennebaud
de77bf5d6c Fix compilation with ARM64. 2019-01-30 16:48:20 +01:00
Gael Guennebaud
eb4c6bb22d Fix conflicts and merge 2019-01-30 15:57:08 +01:00
Gael Guennebaud
df12fae8b8 According to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89101, the previous GCC issue is fixed in GCC trunk (will be gcc 9). 2019-01-30 11:52:28 +01:00
Gael Guennebaud
3775926bba ARM64 & GEBP: add specialization for double +30% speed up 2019-01-30 11:49:06 +01:00
Gael Guennebaud
be5b0f664a ARM64 & GEBP: Make use of vfmaq_laneq_f32 and workaround GCC's issue in generating good ASM 2019-01-30 11:48:25 +01:00
Gael Guennebaud
8a06c699d0 bug #1669: fix PartialPivLU/inverse with zero-sized matrices. 2019-01-29 10:27:13 +01:00
Gael Guennebaud
a2a07e62b9 Fix compilation with c++03 (local class cannot be template arguments), and make SparseMatrix::assignDiagonal truly protected. 2019-01-29 10:10:07 +01:00
Gael Guennebaud
f489f44519 bug #1574: implement "sparse_matrix =,+=,-= diagonal_matrix" with smart insertion strategies of missing diagonal coeffs. 2019-01-28 17:29:50 +01:00
Gael Guennebaud
803fa79767 Move evaluator<SparseCompressedBase>::find(i,j) to a more general and reusable SparseCompressedBase::lower_bound(i,j) functiion 2019-01-28 17:24:44 +01:00
Christoph Hertzberg
5a52e35f9a Renaming some more I identifiers 2019-01-26 13:18:21 +01:00
Rasmus Munk Larsen
71429883ee Fix compilation error in NEON GEBP specializaition of madd. 2019-01-25 17:00:21 -08:00
Gael Guennebaud
ec8a387972 cleanup 2019-01-24 10:24:45 +01:00
David Tellenbach
237b03b372 PR 574: use variadic template instead of initializer_list to implement fixed-size vector ctor from coefficients. 2019-01-23 00:07:19 +01:00
Gael Guennebaud
80f81f9c4b Cleanup SFINAE in Array/Matrix(initializer_list) ctors and minor doc editing. 2019-01-22 17:08:47 +01:00
David Tellenbach
db152b9ee6 PR 572: Add initializer list constructors to Matrix and Array (include unit tests and doc)
- {1,2,3,4,5,...} for fixed-size vectors only
- {{1,2,3},{4,5,6}} for the general cases
- {{1,2,3,4,5,....}} is allowed for both row and column-vector
2019-01-21 16:25:57 +01:00
nluehr
92774f0275 Replace host_define.h with cuda_runtime_api.h 2019-01-18 16:10:09 -06:00
Christoph Hertzberg
da0a41b9ce Mask unused-parameter warnings, when building with NDEBUG 2019-01-18 10:41:14 +01:00
Rasmus Munk Larsen
2eccbaf3f7 Add missing logical packet ops for GPU and NEON. 2019-01-17 17:45:08 -08:00
Gael Guennebaud
ee3662abc5 Remove some useless const_cast 2019-01-17 18:27:49 +01:00
Gael Guennebaud
0fe6b7d687 Make nestByValue works again (broken since 3.3) and add unit tests. 2019-01-17 18:27:25 +01:00
Gael Guennebaud
4b7cf7ff82 Extend reshaped unit tests and remove useless const_cast 2019-01-17 17:35:32 +01:00
Gael Guennebaud
b57c9787b1 Cleanup useless const_cast and add missing broadcast assignment tests 2019-01-17 16:55:42 +01:00
Gael Guennebaud
be05d0030d Make FullPivLU use conjugateIf<> 2019-01-17 12:01:00 +01:00
Patrick Peltzer
15e53d5d93 PR 567: makes all dense solvers inherit SoverBase (LU,Cholesky,QR,SVD).
This changeset also includes:
 * add HouseholderSequence::conjugateIf
 * define int as the StorageIndex type for all dense solvers
 * dedicated unit tests, including assertion checking
 * _check_solve_assertion(): this method can be implemented in derived solver classes to implement custom checks
 * CompleteOrthogonalDecompositions: add applyZOnTheLeftInPlace, fix scalar type in applyZAdjointOnTheLeftInPlace(), add missing assertions
 * Cholesky: add missing assertions
 * FullPivHouseholderQR: Corrected Scalar type in _solve_impl()
 * BDCSVD: Unambiguous return type for ternary operator
 * SVDBase: Corrected Scalar type in _solve_impl()
2019-01-17 01:17:39 +01:00
Gael Guennebaud
7f32109c11 Add conjugateIf<bool> members to DesneBase, TriangularView, SelfadjointView, and make PartialPivLU use it. 2019-01-17 11:33:43 +01:00
Gael Guennebaud
562985bac4 bug #1646: fix false aliasing detection for A.row(0) = A.col(0);
This changeset completely disable the detection for vectors for which are current mechanism cannot detect any positive aliasing anyway.
2019-01-17 00:14:27 +01:00
Rasmus Munk Larsen
7401e2541d Fix compilation error for logical packet ops with older compilers. 2019-01-16 14:43:33 -08:00
Gael Guennebaud
0f028f61cb GEBP: fix swapped kernel mode with AVX512 and complex scalars 2019-01-16 22:26:38 +01:00
Gael Guennebaud
e118ce86fd GEBP: cleanup logic to choose between a 4 packets of 1 packet 2019-01-16 21:47:42 +01:00
Gael Guennebaud
70e133333d bug #1661: fix regression in GEBP and AVX512 2019-01-16 21:22:20 +01:00
Gael Guennebaud
502f717980 bug #1646: disable aliasing detection for empty and 1x1 expression 2019-01-16 14:33:45 +01:00
Gael Guennebaud
0b466b6933 bug #1633: use proper type for madd temporaries, factorize RhsPacketx4. 2019-01-16 13:50:13 +01:00
Renjie Liu
dbfcceabf5 Bug: 1633: refactor gebp kernel and optimize for neon 2019-01-16 12:51:36 +08:00
Gael Guennebaud
2b70b2f570 Make Transform::rotation() an alias to Transform::linear() in the case of an Isometry 2019-01-15 22:50:42 +01:00
Gael Guennebaud
2c2c114995 Silent maybe-uninitialized warnings by gcc 2019-01-15 16:53:15 +01:00
Gael Guennebaud
6ec6bf0b0d Enable visitor on empty matrices (the visitor is left unchanged), and protect min/maxCoeff(Index*,Index*) on empty matrices by an assertion (+ doc & unit tests) 2019-01-15 15:21:14 +01:00
Gael Guennebaud
027e44ed24 bug #1592: makes partial min/max reductions trigger an assertion on inputs with a zero reduction length (+doc and tests) 2019-01-15 15:13:24 +01:00
Gael Guennebaud
f8bc5cb39e Fix detection of vector-at-time: use Rows/Cols instead of MaxRow/MaxCols.
This fix VectorXd(n).middleCol(0,0).outerSize() which was equal to 1.
2019-01-15 15:09:49 +01:00
Gael Guennebaud
6cf7afa3d9 Typo 2019-01-15 11:04:37 +01:00
Rasmus Larsen
7b3aab0936 Merged in rmlarsen/eigen (pull request PR-570)
Add support for inverse hyperbolic functions. Fix cost of division.
2019-01-14 21:31:33 +00:00
Gael Guennebaud
250dcd1fdb bug #1652: fix position of EIGEN_ALIGN16 attributes in Neon and Altivec 2019-01-14 21:45:56 +01:00
Rasmus Larsen
5a59452aae Merged eigen/eigen into default 2019-01-14 10:23:23 -08:00
Gael Guennebaud
3c9e6d206d AVX512: fix pgather/pscatter for Packet4cd and unaligned pointers 2019-01-14 17:57:28 +01:00
Gael Guennebaud
61b6eb05fe AVX512 (r)sqrt(double) was mistakenly disabled with clang and others 2019-01-14 17:28:47 +01:00
Gael Guennebaud
ccddeaad90 fix warning 2019-01-14 16:51:16 +01:00
Gael Guennebaud
d4881751d3 Doc: add Isometry in the list of supported Mode of Transform<> 2019-01-14 16:38:26 +01:00
Greg Coombe
9d988a1e1a Initialize isometric transforms like affine transforms.
The isometric transform, like the affine transform, has an implicit last
row of [0, 0, 0, 1]. This was not being properly initialized, as verified
by a new test function.
2019-01-11 23:14:35 -08:00
Gael Guennebaud
4356a55a61 PR 571: Implements an accurate argument reduction algorithm for huge inputs of sin/cos and call it instead of falling back to std::sin/std::cos.
This makes both the small and huge argument cases faster because:
- for small inputs this removes the last pselect
- for large inputs only the reduction part follows a scalar path,
the rest use the same SIMD path as the small-argument case.
2019-01-14 13:54:01 +01:00
Gael Guennebaud
f566724023 Fix StorageIndex FIXME in dense LU solvers 2019-01-13 17:54:30 +01:00
Rasmus Munk Larsen
1c6e6e2c3f Merge. 2019-01-11 17:47:11 -08:00
Rasmus Munk Larsen
28ba1b2c32 Add support for inverse hyperbolic functions.
Fix cost of division.
2019-01-11 17:45:37 -08:00
Rasmus Munk Larsen
89c4001d6f Fix warnings in ptrue for complex and half types. 2019-01-11 14:10:57 -08:00
Rasmus Munk Larsen
a49d01edba Fix warnings in ptrue for complex and half types. 2019-01-11 13:18:17 -08:00
Rasmus Munk Larsen
df29511ac0 Fix merge. 2019-01-11 10:36:36 -08:00
Rasmus Munk Larsen
9396ace46b Merge. 2019-01-11 10:28:52 -08:00
Rasmus Larsen
74882471d0 Merged eigen/eigen into default 2019-01-11 10:20:55 -08:00
Gael Guennebaud
9005f0111f Replace compiler's alignas/alignof extension by respective c++11 keywords when available. This also fix a compilation issue with gcc-4.7. 2019-01-11 17:10:54 +01:00
Mark D Ryan
3c9add6598 Remove reinterpret_cast from AVX512 complex implementation
The reinterpret_casts used in ptranspose(PacketBlock<Packet8cf,4>&)
ptranspose(PacketBlock<Packet8cf,8>&) don't appear to be working
correctly.  They're used to convert the kernel parameters to
PacketBlock<Packet8d,T>& so that the complex number versions of
ptranspose can be written using the existing double implementations.
Unfortunately, they don't seem to work and are responsible for 9 unit
test failures in the AVX512 build of tensorflow master.  This commit
fixes the issue by manually initialising PacketBlock<Packet8d,T>
variables with the contents of the kernel parameter before calling
the double version of ptranspose, and then copying the resulting
values back into the kernel parameter before returning.
2019-01-11 14:02:09 +01:00
Rasmus Munk Larsen
fcfced13ed Rename pones -> ptrue. Use _CMP_TRUE_UQ where appropriate. 2019-01-09 17:20:33 -08:00
Rasmus Munk Larsen
e15bb785ad Collapsed revision
* Add packet up "pones". Write pnot(a) as pxor(pones(a), a).
* Collapsed revision
* Simplify a bit.
* Undo useless diffs.
* Fix typo.
2019-01-09 16:34:23 -08:00
Rasmus Munk Larsen
f6ba6071c5 Fix typo. 2019-01-09 16:34:23 -08:00
Rasmus Munk Larsen
8f04442526 Collapsed revision
* Collapsed revision
* Add packet up "pones". Write pnot(a) as pxor(pones(a), a).
* Collapsed revision
* Simplify a bit.
* Undo useless diffs.
* Fix typo.
2019-01-09 16:34:23 -08:00
Rasmus Munk Larsen
e00521b514 Undo useless diffs. 2019-01-09 16:32:53 -08:00
Rasmus Munk Larsen
f2767112c8 Simplify a bit. 2019-01-09 16:29:18 -08:00
Rasmus Munk Larsen
cb955df9a6 Add packet up "pones". Write pnot(a) as pxor(pones(a), a). 2019-01-09 16:17:08 -08:00
Rasmus Larsen
cb3c059fa4 Merged eigen/eigen into default 2019-01-09 15:04:17 -08:00
Gael Guennebaud
d812f411c3 bug #1654: fix compilation with cuda and no c++11 2019-01-09 18:00:05 +01:00
Gael Guennebaud
3492a1ca74 fix plog(+inf) with AVX512 2019-01-09 16:53:37 +01:00
Gael Guennebaud
47810cf5b7 Add dedicated implementations of predux_any for AVX512, NEON, and Altivec/VSE 2019-01-09 16:40:42 +01:00
Gael Guennebaud
3f14e0d19e fix warning 2019-01-09 15:45:21 +01:00
Gael Guennebaud
aeec68f77b Add missing pcmp_lt and others for AVX512 2019-01-09 15:36:41 +01:00
Gael Guennebaud
e6b217b8dd bug #1652: implements a much more accurate version of vectorized sin/cos. This new version achieve same speed for SSE/AVX, and is slightly faster with FMA. Guarantees are as follows:
- no FMA: 1ULP up to 3pi, 2ULP up to sin(25966) and cos(18838), fallback to std::sin/cos for larger inputs
  - FMA: 1ULP up to sin(117435.992) and cos(71476.0625), fallback to std::sin/cos for larger inputs
2019-01-09 15:25:17 +01:00
Rasmus Munk Larsen
055f0b73db Add support for pcmp_eq and pnot, including for complex types. 2019-01-07 16:53:36 -08:00
Eugene Zhulenev
190d053e41 Explicitly set fill character when printing aligned data to ostream 2019-01-03 14:55:28 -08:00
Mark D Ryan
bc5dd4cafd PR560: Fix the AVX512f only builds
Commit c53eececb0
 introduced AVX512 support for complex numbers but required
avx512dq to build.  Commit 1d683ae2f5
 fixed some but not, it would seem all,
of the hard avx512dq dependencies.  Build failures are still evident on
Eigen and TensorFlow when compiling with just avx512f and no avx512dq
using gcc 7.3.  Looking at the code there does indeed seem to be a problem.
Commit c53eececb0
 calls avx512dq intrinsics directly, e.g, _mm512_extractf32x8_ps
and _mm512_and_ps.  This commit fixes the issue by replacing the direct
intrinsic calls with the various wrapper functions that are safe to use on
avx512f only builds.
2019-01-03 14:33:04 +01:00
Gael Guennebaud
60d3fe9a89 One more stupid AVX 512 fix (I don't have direct access to AVX512 machines) 2018-12-24 13:05:03 +01:00
Gael Guennebaud
4aa667b510 Add EIGEN_STRONG_INLINE where required 2018-12-24 10:45:01 +01:00
Gael Guennebaud
961ff567e8 Add missing pcmp_lt_or_nan for AVX512 2018-12-23 22:13:29 +01:00
Gael Guennebaud
0f6f75bd8a Implement a faster fix for sin/cos of large entries that also correctly handle INF input. 2018-12-23 17:26:21 +01:00
Gael Guennebaud
38d704def8 Make sure that psin/pcos return number in [-1,1] for large inputs (though sin/cos on large entries is quite useless because it's inaccurate) 2018-12-23 16:13:24 +01:00
Gael Guennebaud
5713fb7feb Fix plog(+INF): it returned ~87 instead of +INF 2018-12-23 15:40:52 +01:00
Christoph Hertzberg
6dd93f7e3b Make code compile again for older compilers.
See https://stackoverflow.com/questions/7411515/
2018-12-22 13:09:07 +01:00
Gustavo Lima Chaves
1024a70e82 gebp: Add new ½ and ¼ packet rows per (peeling) round on the lhs
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The patch works by altering the gebp lhs packing routines to also
consider ½ and ¼ packet lenght rows when packing, besides the original
whole package and row-by-row attempts. Finally, gebp itself will try
to fit a fraction of a packet at a time if:

i) ½ and/or ¼ packets are available for the current context (e.g. AVX2
   and SSE-sized SIMD register for x86)

ii) The matrix's height is favorable to it (it may be it's too small
    in that dimension to take full advantage of the current/maximum
    packet width or it may be the case that last rows may take
    advantage of smaller packets before gebp goes row-by-row)

This helps mitigate huge slowdowns one had on AVX512 builds when
compared to AVX2 ones, for some dimensions. Gains top at an extra 1x
in throughput. This patch is a complement to changeset 4ad359237a
.

Since packing is changed, Eigen users which would go for very
low-level API usage, like TensorFlow, will have to be adapted to work
fine with the changes.
2018-12-21 11:03:18 -08:00
Gustavo Lima Chaves
e763fcd09e Introducing "vectorized" byte on unpacket_traits structs
This is a preparation to a change on gebp_traits, where a new template
argument will be introduced to dictate the packet size, so it won't be
bound to the current/max packet size only anymore.

By having packet types defined early on gebp_traits, one has now to
act on packet types, not scalars anymore, for the enum values defined
on that class. One approach for reaching the vectorizable/size
properties one needs there could be getting the packet's scalar again
with unpacket_traits<>, then the size/Vectorizable enum entries from
packet_traits<>. It turns out guards like "#ifndef
EIGEN_VECTORIZE_AVX512" at AVX/PacketMath.h will hide smaller packet
variations of packet_traits<> for some types (and it makes sense to
keep that). In other words, one can't go back to the scalar and create
a new PacketType, as this will always lead to the maximum packet type
for the architecture.

The less costly/invasive solution for that, thus, is to add the
vectorizable info on every unpacket_traits struct as well.
2018-12-19 14:24:44 -08:00
Gael Guennebaud
efa4c9c40f bug #1615: slightly increase the default unrolling limit to compensate for changeset 101ea26f5e
.
This solves a performance regression with clang and 3x3 matrix products.
2018-12-13 10:42:39 +01:00
Gael Guennebaud
f582ea3579 Fix compilation with expression template scalar type. 2018-12-12 22:47:00 +01:00
Gael Guennebaud
2de8da70fd bug #1557: fix RealSchur and EigenSolver for matrices with only zeros on the diagonal. 2018-12-12 17:30:08 +01:00
Gael Guennebaud
37c91e1836 bug #1644: fix warning 2018-12-11 22:07:20 +01:00
Gael Guennebaud
f159cf3d75 Artificially increase l1-blocking size for AVX512. +10% speedup with current kernels.
With a 6pX4 kernel (not committed yet), this provides a +20% speedup.
2018-12-11 15:36:27 +01:00
Gael Guennebaud
0a7e7af6fd Properly set the number of registers for AVX512 2018-12-11 15:33:17 +01:00
Gael Guennebaud
7166496f70 bug #1643: fix compilation issue with gcc and no optimizaion 2018-12-11 13:24:42 +01:00