It basically performs 4 dot products at once reducing loads of the vector and improving
instructions scheduling. With 3 cache friendly algorithms, we now handle all product
configurations with outstanding perf for large matrices.
and vector * row-major products. Currently, it is enabled only is the matrix
has DirectAccessBit flag and the product is "large enough".
Added the respective unit tests in test/product/cpp.
might be twice faster fot small fixed size matrix
* added a sparse triangular solver (sparse version
of inverseProduct)
* various other improvements in the Sparse module
* added complete implementation of sparse matrix product
(with a little glue in Eigen/Core)
* added an exhaustive bench of sparse products including GMM++ and MTL4
=> Eigen outperforms in all transposed/density configurations !
* rework PacketMath and DummyPacketMath, make these actual template
specializations instead of just overriding by non-template inline
functions
* introduce ei_ploadt and ei_pstoret, make use of them in Map and Matrix
* remove Matrix::map() methods, use Map constructors instead.
* added some glue to Eigen/Core (SparseBit, ei_eval, Matrix)
* add two new sparse matrix types:
HashMatrix: based on std::map (for random writes)
LinkedVectorMatrix: array of linked vectors
(for outer coherent writes, e.g. to transpose a matrix)
* add a SparseSetter class to easily set/update any kind of matrices, e.g.:
{ SparseSetter<MatrixType,RandomAccessPattern> wrapper(mymatrix);
for (...) wrapper->coeffRef(rand(),rand()) = rand(); }
* automatic shallow copy for RValue
* and a lot of mess !
plus:
* remove the remaining ArrayBit related stuff
* don't use alloca in product for very large memory allocation
to "public:method()" i.e. reimplementing the generic method()
from MatrixBase.
improves compilation speed by 7%, reduces almost by half the call depth
of trivial functions, making gcc errors and application backtraces
nicer...
* introduce packet(int), make use of it in linear vectorized paths
--> completely fixes the slowdown noticed in benchVecAdd.
* generalize coeff(int) to linear-access xprs
* clarify the access flag bits
* rework api dox in Coeffs.h and util/Constants.h
* improve certain expressions's flags, allowing more vectorization
* fix bug in Block: start(int) and end(int) returned dyn*dyn size
* fix bug in Block: just because the Eval type has packet access
doesn't imply the block xpr should have it too.
* make the conj functor vectorizable: it is just identity in real case,
and complex doesn't use the vectorized path anyway.
* fix bug in Block: a 3x1 block in a 4x4 matrix (all fixed-size)
should not be vectorizable, since in fixed-size we are assuming
the size to be a multiple of packet size. (Or would you prefer
Vector3d to be flagged "packetaccess" even though no packet access
is possible on vectors of that type?)
* rename:
isOrtho for vectors ---> isOrthogonal
isOrtho for matrices ---> isUnitary
* add normalize()
* reimplement normalized with quotient1 functor
- uses the common "Compressed Column Storage" scheme
- supports every unary and binary operators with xpr template
assuming binaryOp(0,0) == 0 and unaryOp(0) = 0 (otherwise a sparse
matrix doesnot make sense)
- this is the first commit, so of course, there are still several shorcommings !
(could come back to redux after it has been vectorized,
and could serve as a starting point for that)
also make the abs2 functor vectorizable (for real types).
packet access, it is not certain that it will bring a performance
improvement: benchmarking needed.
* improve logic choosing slice vectorization.
* fix typo in SSE packet math, causing crash in unaligned case.
* fix bug in Product, causing crash in unaligned case.
* add TEST_SSE3 CMake option.
* make Matrix2f (and similar) vectorized using linear path
* fix a couple of warnings and compilation issues with ICC and gcc 3.3/3.4
(cannot get Transform compiles with gcc 3.3/3.4, see the FIXME)
* use ProductReturnType<>::Type to get the correct Product xpr type
* Product is no longer instanciated for xpr types which are evaluated
* vectorization of "a.transpose() * b" for the normal product (small and fixed-size matrix)
* some cleanning
* removed ArrayBase
** Much better organization
** Fix a few bugs
** Add the ability to unroll only the inner loop
** Add an unrolled path to the Like1D vectorization. Not well tested.
** Add placeholder for sliced vectorization. Unimplemented.
* Rework of corrected_flags:
** improve rules determining vectorizability
** for vectors, the storage-order is indifferent, so we tweak it
to allow vectorization of row-vectors.
* fix compilation in benchmark, and a warning in Transpose.
to optimize matrix-diag and diag-matrix products without
making Product over complicated.
* compilation fixes in Tridiagonalization and HessenbergDecomposition
in the case of 2x2 matrices.
* added an Orientation2D small class with similar interface than Quaternion
(used by Transform to handle 2D and 3D orientations seamlessly)
* added a couple of features in Transform.
them in the ei_traits, so that they're guaranteed even if the user
specified his own non-default flags (like before).
Measured to not make compilation any slower.
flags. This ensures that unless explicitly messed up otherwise,
a Matrix type is equal to its own Eval type. This seriously reduces
the number of types instantiated. Measured +13% compile speed, -7%
binary size.
* Improve doc of Matrix template parameters.
This is the first step towards a non-selfadjoint eigen solver.
Notes:
- We might consider merging Tridiagonalization and Hessenberg toghether ?
- Or we could factorize some code into a Householder class (could also be shared with QR)
which is even better optimized by the compiler.
* Quaternion no longer inherits MatrixBase. Instead it stores the coefficients
using a Matrix<> and provides only relevant methods.
tends to show L2 norm works very well here.
(the legacy implementation is still available via a preprocessor token
to allow further experiments if needed...)
as an argument of a function. Other possibilities for the name could be "end" or "matrix" ??
* various update in Quaternion, in particular I added a lot of FIXME about the API options,
these have to be discussed and fixed.
- get the doc of the flags in Constants right
- finally give up with SEPARATE_MEMBER_PAGES: it triggers too big
Doxygen bugs, and produces too many small pages. So we have one
huge page for MatrixBase at currently 300kb and going up, so the
solution especially for users with low bandwidth will be to provide
an archive of the html documentation.
- add MatrixBase::matrixNorm(); in the non-selfadjoint case, we reduce to the
selfadjoint case by using the "C*-identity" a.k.a.
norm of x = sqrt(norm of x * x.adjoint())
* added MatrixBase::real()
* added the ability to extract a selfadjoint matrix from the
lower or upper part of a matrix, e.g.:
m.extract<Upper|SelfAdjoint>()
will ignore the strict lower part and return a selfadjoint.
This is compatible with ZeroDiag and UnitDiag.
* added ConjugateReturnType and AdjointReturnType that are type-defined to Derived&
and Transpose<Derived> if the scalar type is not complex: this avoids abusive copies in
the cache friendly Product
was a sign that we were doing something wrong. In fact, having
NestByValue as a special case of Flagged was wrong, and the previous
commit, while not buggy, was inefficient because then when the resulting
NestByValue xpr was nested -- hence copied -- the original xpr which was
already nested by value was copied again; hence instead of 1 copy we got
3 copies.
The solution was to ressuscitate the old Temporary.h (renamed
NestByValue.h) as it was the right approach.
finally that's more subtle than just using ei_nested, because when
flagging with NestByValueBit we want to store the expression by value
already, regardless of whether it already had the NestByValueBit set.
* rename temporary() ----> nestByValue()
* move the old Product.h to disabled/, replace by what was ProductWIP.h
* tweak -O and -g flags for tests and examples
* reorder the tests -- basic things go first
* simplifications, e.g. in many methoeds return derived() and count on
implicit casting to the actual return type.
* strip some not-really-useful stuff from the heaviest tests
Triangular class
- full meta-unrolling in Part
- move inverseProduct() to MatrixBase
- compilation fix in ProductWIP: introduce a meta-selector to only do
direct access on types that support it.
- phase out the old Product, remove the WIP_DIRTY stuff.
- misc renaming and fixes
- fallback to normal product for small dynamic matrices
- overloaded "c += (a * b).lazy()" to avoid the expensive and useless temporary and setZero()
in such very common cases.
* fix a couple of issues with the flags
(does not support complex and does not re-use the QR decomposition)
* Rewrite the cache friendly product to have only one instance per scalar type !
This significantly speeds up compilation time and reduces executable size.
The current drawback is that some trivial expressions might be
evaluated like conjugate or negate.
* Renamed "cache optimal" to "cache friendly"
* Added the ability to directly access matrix data of some expressions via:
- the stride()/_stride() methods
- DirectAccessBit flag (replace ReferencableBit)
* Fix compilation of Inverse.h with vectorisation
* Introduce EIGEN_GNUC_AT_LEAST(x,y) macro doing future-proof (e.g. gcc v5.0) check
* Only use ProductWIP if vectorisation is enabled
* rename EIGEN_ALWAYS_INLINE -> EIGEN_INLINE with fall-back to inline keyword
* some cleanup/indentation
(needed by the new product implementation)
* Make the packet* members template to support aligned and unaligned
access. This makes Block vectorizable. Combined with ReferencableBit,
we should be able to determine at runtime (in some specific cases) if
an aligned vectorization is possible or not.
* Improved the new product implementation to robustly handle all cases,
it now passes all the tests.
* Renamed the packet version ei_predux to ei_preduxp to avoid name collision.
* Introduce a new highly optimized matrix-matrix product for large
matrices. The code is still highly experimental and it is activated
only if you define EIGEN_WIP_PRODUCT at compile time.
Currently the third dimension of the product must be a factor of
the packet size (x4 for floats) and the right handed side matrix
must be column major.
Moreover, currently c = a*b; actually computes c += a*b !!
Therefore, the code is provided for experimentation purpose only !
These limitations will be fixed soon or later to become the default
product implementation.
extended cache optimal product to work in any row/column
major situations, and a few bugfixes (forgot to add the
Cholesky header, vectorization of CwiseBinary)
m.upper() = a+b;
only updates the upper triangular part of m.
Note that:
m = (a+b).upper();
updates all coefficients of m (but half of the additions
will be skiped)
Updated back/forward substitution to better use Eigen's capability.
part of a matrix. Triangular also provide an optimised method for forward
and backward substitution. Further optimizations regarding assignments and
products might come later.
Updated determinant() to take into account triangular matrices.
Started the QR module with a QR decompostion algorithm.
Help needed to build a QR algorithm (eigen solver) based on it.
- support dynamic sizes
- support arbitrary matrix size when the matrix can be seen as a 1D array
(except for fixed size matrices where the size in Bytes must be a factor of 16,
this is to allow compact storage of a vector of matrices)
Note that the explict vectorization is still experimental and far to be completely tested.
are provided to handle not suported types seemlessly.
Added a generic null-ary expression with null-ary functors. They replace
Zero, Ones, Identity and Random.
- let Inverse take template parameter MatrixType instead
of ExpressionType, in order to reduce executable code size
when taking inverses of xpr's.
- introduce ei_corrected_matrix_flags : the flags template
parameter to the Matrix class is only a suggestion. This
is also useful in ei_eval.
(only 30 muls for size 4)
- rework the matrix inversion: now using cofactor technique for size<=3,
so the ugly unrolling is only used for size 4 anymore, and even there
I'm looking to get rid of it.
* Use them to write an unrolled path in echelon.cpp, as an
experiment before I do this LU module.
* For floating-point types, make ei_random() use an amplitude
of 1.
using a macro and _Pragma.
- use OpenMP also in cacheOptimalProduct and in the
vectorized paths as well
- kill the vector assignment unroller. implement in
operator= the logic for assigning a row-vector in
a col-vector.
- CMakeLists support for building tests/examples
with -fopenmp and/or -msse2
- updates in bench/, especially replace identity()
by ones() which prevents underflows from perturbing
bench results.
* add -pedantic to CXXFLAGS
* cleanup intricated expressions with && and ||
which gave warnings because of "missing" parentheses
* fix compile error in NumTraits, apparently discovered
by -pedantic
Currently only the following platform/operations are supported:
- SSE2 compatible architecture
- compiler compatible with intel's SSE2 intrinsics
- float, double and int data types
- fixed size matrices with a storage major dimension multiple of 4 (or 2 for double)
- scalar-matrix product, component wise: +,-,*,min,max
- matrix-matrix product only if the left matrix is vectorizable and column major
or the right matrix is vectorizable and row major, e.g.:
a.transpose() * b is not vectorized with the default column major storage.
To use it you must define EIGEN_VECTORIZE and EIGEN_INTEL_PLATFORM.
- make use of CoeffReadCost to determine when to unroll the loops,
for now only in Product.h and in OperatorEquals.h
performance remains the same: generally still not as good as before the
big changes.
ei_xpr_copy to evaluate args when needed. Had to introduce an ugly
trick with ei_unref as when the XprCopy type is a reference one can't
directly access member typedefs such as Scalar.
in ei_xpr_copy and operator=, respectively.
* added Matrix::lazyAssign() when EvalBeforeAssigningBit must be skipped
(mainly internal use only)
* all expressions are now stored by const reference
* added Temporary xpr: .temporary() must be called on any temporary expression
not directly returned by a function (mainly internal use only)
* moved all functors in the Functors.h header
* added some preliminaries stuff for the explicit vectorization
* added "all" and "any" special redux operators
* added support bool matrices
* added support for cost model of STL functors via ei_functor_traits
(By default ei_functor_traits query the functor member Cost)
useless copies are made when evaluating nested expressions.
Changes:
- kill LazyBit, introduce EvalBeforeNestingBit and EvalBeforeAssigningBit
- product and random don't evaluate immediately anymore
- eval() always evaluates
- change the value of Dynamic to some large positive value,
in preparation of future simplifications
before the Product<> type is constructed. This resets template depth on each
intermediate evaluation, and gives simpler code. Introducing
ei_eval_if_expensive<Derived, n> which evaluates Derived if it's worth it
given that each of its coeffs will be accessed n times. Operator*
uses this with adequate values of n to evaluate args exactly when needed.