Igor Babuschkin
|
dc03b8f3a1
|
Add generic scan method
|
2016-06-03 17:37:04 +01:00 |
|
Benoit Steiner
|
c3c8ad8046
|
Align the first element of the Waiter struct instead of padding it. This reduces its memory footprint a bit while achieving the goal of preventing false sharing
|
2016-06-02 21:17:41 -07:00 |
|
Rasmus Munk Larsen
|
811aadbe00
|
Add syntactic sugar to Eigen tensors to allow more natural syntax.
Specifically, this enables expressions involving:
scalar + tensor
scalar * tensor
scalar / tensor
scalar - tensor
|
2016-06-02 12:41:28 -07:00 |
|
Igor Babuschkin
|
fbd7ed6ff7
|
Add tensor scan op
This is the initial implementation a generic scan operation.
Based on this, cumsum and cumprod method have been added to TensorBase.
|
2016-06-02 13:35:47 +01:00 |
|
Benoit Steiner
|
0ed08fd281
|
Use a single PacketSize variable
|
2016-06-01 21:19:05 -07:00 |
|
Benoit Steiner
|
8f6fedc55f
|
Fixed compilation warning
|
2016-06-01 21:14:46 -07:00 |
|
Benoit Steiner
|
873e6ac54b
|
Silenced compilation warning generated by nvcc.
|
2016-06-01 14:20:50 -07:00 |
|
Benoit Steiner
|
d27b0ad4c8
|
Added support for mean reductions on fp16
|
2016-06-01 11:12:07 -07:00 |
|
Benoit Steiner
|
5aeb3687c4
|
Only enable optimized reductions of fp16 if the reduction functor supports them
|
2016-05-31 10:33:40 -07:00 |
|
Benoit Steiner
|
e2946d962d
|
Reimplement clamp as a static function.
|
2016-05-27 12:58:43 -07:00 |
|
Benoit Steiner
|
e96d36d4cd
|
Use NULL instead of nullptr to preserve the compatibility with cxx03
|
2016-05-27 12:54:06 -07:00 |
|
Benoit Steiner
|
abc815798b
|
Added a new operation to enable more powerful tensorindexing.
|
2016-05-27 12:22:25 -07:00 |
|
Benoit Steiner
|
1ae2567861
|
Fixed some compilation warnings
|
2016-05-26 15:57:19 -07:00 |
|
Benoit Steiner
|
1a47844529
|
Preserve the ability to vectorize the evaluation of an expression even when it involves a cast that isn't vectorized (e.g fp16 to float)
|
2016-05-26 14:37:09 -07:00 |
|
Benoit Steiner
|
36369ab63c
|
Resolved merge conflicts
|
2016-05-26 13:39:39 -07:00 |
|
Benoit Steiner
|
28fcb5ca2a
|
Merged latest reduction improvements
|
2016-05-26 12:19:33 -07:00 |
|
Benoit Steiner
|
c1c7f06c35
|
Improved the performance of inner reductions.
|
2016-05-26 11:53:59 -07:00 |
|
Benoit Steiner
|
8288b0aec2
|
Code cleanup.
|
2016-05-26 09:00:04 -07:00 |
|
Benoit Steiner
|
2d7ed54ba2
|
Made the static storage class qualifier come first.
|
2016-05-25 22:16:15 -07:00 |
|
Benoit Steiner
|
e1fca8866e
|
Deleted unnecessary explicit qualifiers.
|
2016-05-25 22:15:26 -07:00 |
|
Benoit Steiner
|
9b0aaf5113
|
Don't mark inline functions as static since it confuses the ICC compiler
|
2016-05-25 22:10:11 -07:00 |
|
Benoit Steiner
|
037a463fd5
|
Marked unused variables as such
|
2016-05-25 22:07:48 -07:00 |
|
Benoit Steiner
|
3ac4045272
|
Made the IndexPair code compile in non cxx11 mode
|
2016-05-25 15:15:12 -07:00 |
|
Benoit Steiner
|
66556d0e05
|
Made the index pair list code more portable accross various compilers
|
2016-05-25 14:34:27 -07:00 |
|
Benoit Steiner
|
034aa3b2c0
|
Improved the performance of tensor padding
|
2016-05-25 11:43:08 -07:00 |
|
Benoit Steiner
|
58026905ae
|
Added support for statically known lists of pairs of indices
|
2016-05-25 11:04:14 -07:00 |
|
Benoit Steiner
|
0835667329
|
There is no need to make the fp16 full reduction kernel a static function.
|
2016-05-24 23:11:56 -07:00 |
|
Benoit Steiner
|
b5d6b52a4d
|
Fixed compilation warning
|
2016-05-24 23:10:57 -07:00 |
|
Benoit Steiner
|
a09cbf9905
|
Merged in rmlarsen/eigen (pull request PR-188)
Minor cleanups: 1. Get rid of a few unused variables. 2. Get rid of last uses of EIGEN_USE_COST_MODEL.
|
2016-05-23 12:55:12 -07:00 |
|
Christoph Hertzberg
|
25a03c02d6
|
Fix some sign-compare warnings
|
2016-05-22 16:42:27 +02:00 |
|
Gael Guennebaud
|
ccaace03c9
|
Make EIGEN_HAS_CONSTEXPR user configurable
|
2016-05-20 15:10:08 +02:00 |
|
Gael Guennebaud
|
c3410804cd
|
Make EIGEN_HAS_VARIADIC_TEMPLATES user configurable
|
2016-05-20 15:05:38 +02:00 |
|
Gael Guennebaud
|
48bf5ec216
|
Make EIGEN_HAS_RVALUE_REFERENCES user configurable
|
2016-05-20 14:54:20 +02:00 |
|
Gael Guennebaud
|
f43ae88892
|
Rename EIGEN_HAVE_RVALUE_REFERENCES to EIGEN_HAS_RVALUE_REFERENCES
|
2016-05-20 14:48:51 +02:00 |
|
Rasmus Larsen
|
b1e080c752
|
Merged eigen/eigen into default
|
2016-05-18 15:21:50 -07:00 |
|
Rasmus Munk Larsen
|
5624219b6b
|
Merge.
|
2016-05-18 15:16:06 -07:00 |
|
Rasmus Munk Larsen
|
7df811cfe5
|
Minor cleanups: 1. Get rid of unused variables. 2. Get rid of last uses of EIGEN_USE_COST_MODEL.
|
2016-05-18 15:09:48 -07:00 |
|
Benoit Steiner
|
bb3ff8e9d9
|
Advertize the packet api of the tensor reducers iff the corresponding packet primitives are available.
|
2016-05-18 14:52:49 -07:00 |
|
Rasmus Munk Larsen
|
f519fca72b
|
Reduce overhead for small tensors and cheap ops by short-circuiting the const computation and block size calculation in parallelFor.
|
2016-05-17 16:06:00 -07:00 |
|
Benoit Steiner
|
86ae94462e
|
#if defined(EIGEN_USE_NONBLOCKING_THREAD_POOL) is now #if !defined(EIGEN_USE_SIMPLE_THREAD_POOL): the non blocking thread pool is the default since it's more scalable, and one needs to request the old thread pool explicitly.
|
2016-05-17 14:06:15 -07:00 |
|
Benoit Steiner
|
997c335970
|
Fixed compilation error
|
2016-05-17 12:54:18 -07:00 |
|
Benoit Steiner
|
ebf6ada5ee
|
Fixed compilation error in the tensor thread pool
|
2016-05-17 12:33:46 -07:00 |
|
Rasmus Munk Larsen
|
0bb61b04ca
|
Merge upstream.
|
2016-05-17 10:26:10 -07:00 |
|
Rasmus Munk Larsen
|
0dbd68145f
|
Roll back changes to core. Move include of TensorFunctors.h up to satisfy dependence in TensorCostModel.h.
|
2016-05-17 10:25:19 -07:00 |
|
Rasmus Larsen
|
00228f2506
|
Merged eigen/eigen into default
|
2016-05-17 09:49:31 -07:00 |
|
Benoit Steiner
|
e7e64c3277
|
Enable the use of the packet api to evaluate tensor broadcasts. This speed things up quite a bit:
Before"
M_broadcasting/10 500000 3690 27.10 MFlops/s
BM_broadcasting/80 500000 4014 1594.24 MFlops/s
BM_broadcasting/640 100000 14770 27731.35 MFlops/s
BM_broadcasting/4K 5000 632711 39512.48 MFlops/s
After:
BM_broadcasting/10 500000 4287 23.33 MFlops/s
BM_broadcasting/80 500000 4455 1436.41 MFlops/s
BM_broadcasting/640 200000 10195 40173.01 MFlops/s
BM_broadcasting/4K 5000 423746 58997.57 MFlops/s
|
2016-05-17 09:24:35 -07:00 |
|
Benoit Steiner
|
5fa27574dd
|
Allow vectorized padding on GPU. This helps speed things up a little
Before:
BM_padding/10 5000000 460 217.03 MFlops/s
BM_padding/80 5000000 460 13899.40 MFlops/s
BM_padding/640 5000000 461 888421.17 MFlops/s
BM_padding/4K 5000000 460 54316322.55 MFlops/s
After:
BM_padding/10 5000000 454 220.20 MFlops/s
BM_padding/80 5000000 455 14039.86 MFlops/s
BM_padding/640 5000000 452 904968.83 MFlops/s
BM_padding/4K 5000000 411 60750049.21 MFlops/s
|
2016-05-17 09:17:26 -07:00 |
|
Benoit Steiner
|
8d06c02ffd
|
Allow vectorized padding on GPU. This helps speed things up a little.
Before:
BM_padding/10 5000000 460 217.03 MFlops/s
BM_padding/80 5000000 460 13899.40 MFlops/s
BM_padding/640 5000000 461 888421.17 MFlops/s
BM_padding/4K 5000000 460 54316322.55 MFlops/s
After:
BM_padding/10 5000000 454 220.20 MFlops/s
BM_padding/80 5000000 455 14039.86 MFlops/s
BM_padding/640 5000000 452 904968.83 MFlops/s
BM_padding/4K 5000000 411 60750049.21 MFlops/s
|
2016-05-17 09:13:27 -07:00 |
|
Benoit Steiner
|
a80d875916
|
Added missing costPerCoeff method
|
2016-05-16 09:31:10 -07:00 |
|
Benoit Steiner
|
83ef39e055
|
Turn on the cost model by default. This results in some significant speedups for smaller tensors. For example, below are the results for the various tensor reductions.
Before:
BM_colReduction_12T/10 1000000 1949 51.29 MFlops/s
BM_colReduction_12T/80 100000 15636 409.29 MFlops/s
BM_colReduction_12T/640 20000 95100 4307.01 MFlops/s
BM_colReduction_12T/4K 500 4573423 5466.36 MFlops/s
BM_colReduction_4T/10 1000000 1867 53.56 MFlops/s
BM_colReduction_4T/80 500000 5288 1210.11 MFlops/s
BM_colReduction_4T/640 10000 106924 3830.75 MFlops/s
BM_colReduction_4T/4K 500 9946374 2513.48 MFlops/s
BM_colReduction_8T/10 1000000 1912 52.30 MFlops/s
BM_colReduction_8T/80 200000 8354 766.09 MFlops/s
BM_colReduction_8T/640 20000 85063 4815.22 MFlops/s
BM_colReduction_8T/4K 500 5445216 4591.19 MFlops/s
BM_rowReduction_12T/10 1000000 2041 48.99 MFlops/s
BM_rowReduction_12T/80 100000 15426 414.87 MFlops/s
BM_rowReduction_12T/640 50000 39117 10470.98 MFlops/s
BM_rowReduction_12T/4K 500 3034298 8239.14 MFlops/s
BM_rowReduction_4T/10 1000000 1834 54.51 MFlops/s
BM_rowReduction_4T/80 500000 5406 1183.81 MFlops/s
BM_rowReduction_4T/640 50000 35017 11697.16 MFlops/s
BM_rowReduction_4T/4K 500 3428527 7291.76 MFlops/s
BM_rowReduction_8T/10 1000000 1925 51.95 MFlops/s
BM_rowReduction_8T/80 200000 8519 751.23 MFlops/s
BM_rowReduction_8T/640 50000 33441 12248.42 MFlops/s
BM_rowReduction_8T/4K 1000 2852841 8763.19 MFlops/s
After:
BM_colReduction_12T/10 50000000 59 1678.30 MFlops/s
BM_colReduction_12T/80 5000000 725 8822.71 MFlops/s
BM_colReduction_12T/640 20000 90882 4506.93 MFlops/s
BM_colReduction_12T/4K 500 4668855 5354.63 MFlops/s
BM_colReduction_4T/10 50000000 59 1687.37 MFlops/s
BM_colReduction_4T/80 5000000 737 8681.24 MFlops/s
BM_colReduction_4T/640 50000 108637 3770.34 MFlops/s
BM_colReduction_4T/4K 500 7912954 3159.38 MFlops/s
BM_colReduction_8T/10 50000000 60 1657.21 MFlops/s
BM_colReduction_8T/80 5000000 726 8812.48 MFlops/s
BM_colReduction_8T/640 20000 91451 4478.90 MFlops/s
BM_colReduction_8T/4K 500 5441692 4594.16 MFlops/s
BM_rowReduction_12T/10 20000000 93 1065.28 MFlops/s
BM_rowReduction_12T/80 2000000 950 6730.96 MFlops/s
BM_rowReduction_12T/640 50000 38196 10723.48 MFlops/s
BM_rowReduction_12T/4K 500 3019217 8280.29 MFlops/s
BM_rowReduction_4T/10 20000000 93 1064.30 MFlops/s
BM_rowReduction_4T/80 2000000 959 6667.71 MFlops/s
BM_rowReduction_4T/640 50000 37433 10941.96 MFlops/s
BM_rowReduction_4T/4K 500 3036476 8233.23 MFlops/s
BM_rowReduction_8T/10 20000000 93 1072.47 MFlops/s
BM_rowReduction_8T/80 2000000 959 6670.04 MFlops/s
BM_rowReduction_8T/640 50000 38069 10759.37 MFlops/s
BM_rowReduction_8T/4K 1000 2758988 9061.29 MFlops/s
|
2016-05-16 08:55:21 -07:00 |
|