Benoit Steiner
c3c8ad8046
Align the first element of the Waiter struct instead of padding it. This reduces its memory footprint a bit while achieving the goal of preventing false sharing
2016-06-02 21:17:41 -07:00
Eugene Brevdo
39baff850c
Add TernaryFunctors and the betainc SpecialFunction.
...
TernaryFunctors and their executors allow operations on 3-tuples of inputs.
API fully implemented for Arrays and Tensors based on binary functors.
Ported the cephes betainc function (regularized incomplete beta
integral) to Eigen, with support for CPU and GPU, floats, doubles, and
half types.
Added unit tests in array.cpp and cxx11_tensor_cuda.cu
Collapsed revision
* Merged helper methods for betainc across floats and doubles.
* Added TensorGlobalFunctions with betainc(). Removed betainc() from TensorBase.
* Clean up CwiseTernaryOp checks, change igamma_helper to cephes_helper.
* betainc: merge incbcf and incbd into incbeta_cfe. and more cleanup.
* Update TernaryOp and SpecialFunctions (betainc) based on review comments.
2016-06-02 17:04:19 -07:00
Benoit Steiner
c21eaedce6
Use array_prod to compute the number of elements contained in the input tensor expression
2016-06-04 07:47:04 -07:00
Benoit Steiner
36a4500822
Merged in ibab/eigen (pull request PR-192)
...
Add generic scan method
2016-06-03 17:28:33 -07:00
Benoit Steiner
c2a102345f
Improved the performance of full reductions.
...
AFTER:
BM_fullReduction/10 4541 4543 154017 21.0M items/s
BM_fullReduction/64 5191 5193 100000 752.5M items/s
BM_fullReduction/512 9588 9588 71361 25.5G items/s
BM_fullReduction/4k 244314 244281 2863 64.0G items/s
BM_fullReduction/5k 359382 359363 1946 64.8G items/s
BEFORE:
BM_fullReduction/10 9085 9087 74395 10.5M items/s
BM_fullReduction/64 9478 9478 72014 412.1M items/s
BM_fullReduction/512 14643 14646 46902 16.7G items/s
BM_fullReduction/4k 260338 260384 2678 60.0G items/s
BM_fullReduction/5k 385076 385178 1818 60.5G items/s
2016-06-03 17:27:08 -07:00
Igor Babuschkin
dc03b8f3a1
Add generic scan method
2016-06-03 17:37:04 +01:00
Rasmus Munk Larsen
811aadbe00
Add syntactic sugar to Eigen tensors to allow more natural syntax.
...
Specifically, this enables expressions involving:
scalar + tensor
scalar * tensor
scalar / tensor
scalar - tensor
2016-06-02 12:41:28 -07:00
Igor Babuschkin
fbd7ed6ff7
Add tensor scan op
...
This is the initial implementation a generic scan operation.
Based on this, cumsum and cumprod method have been added to TensorBase.
2016-06-02 13:35:47 +01:00
Benoit Steiner
0ed08fd281
Use a single PacketSize variable
2016-06-01 21:19:05 -07:00
Benoit Steiner
8f6fedc55f
Fixed compilation warning
2016-06-01 21:14:46 -07:00
Benoit Steiner
873e6ac54b
Silenced compilation warning generated by nvcc.
2016-06-01 14:20:50 -07:00
Benoit Steiner
d27b0ad4c8
Added support for mean reductions on fp16
2016-06-01 11:12:07 -07:00
Benoit Steiner
5aeb3687c4
Only enable optimized reductions of fp16 if the reduction functor supports them
2016-05-31 10:33:40 -07:00
Benoit Steiner
e2946d962d
Reimplement clamp as a static function.
2016-05-27 12:58:43 -07:00
Benoit Steiner
e96d36d4cd
Use NULL instead of nullptr to preserve the compatibility with cxx03
2016-05-27 12:54:06 -07:00
Benoit Steiner
abc815798b
Added a new operation to enable more powerful tensorindexing.
2016-05-27 12:22:25 -07:00
Benoit Steiner
1ae2567861
Fixed some compilation warnings
2016-05-26 15:57:19 -07:00
Benoit Steiner
1a47844529
Preserve the ability to vectorize the evaluation of an expression even when it involves a cast that isn't vectorized (e.g fp16 to float)
2016-05-26 14:37:09 -07:00
Benoit Steiner
36369ab63c
Resolved merge conflicts
2016-05-26 13:39:39 -07:00
Benoit Steiner
28fcb5ca2a
Merged latest reduction improvements
2016-05-26 12:19:33 -07:00
Benoit Steiner
c1c7f06c35
Improved the performance of inner reductions.
2016-05-26 11:53:59 -07:00
Benoit Steiner
8288b0aec2
Code cleanup.
2016-05-26 09:00:04 -07:00
Benoit Steiner
2d7ed54ba2
Made the static storage class qualifier come first.
2016-05-25 22:16:15 -07:00
Benoit Steiner
e1fca8866e
Deleted unnecessary explicit qualifiers.
2016-05-25 22:15:26 -07:00
Benoit Steiner
9b0aaf5113
Don't mark inline functions as static since it confuses the ICC compiler
2016-05-25 22:10:11 -07:00
Benoit Steiner
037a463fd5
Marked unused variables as such
2016-05-25 22:07:48 -07:00
Benoit Steiner
3ac4045272
Made the IndexPair code compile in non cxx11 mode
2016-05-25 15:15:12 -07:00
Benoit Steiner
66556d0e05
Made the index pair list code more portable accross various compilers
2016-05-25 14:34:27 -07:00
Benoit Steiner
034aa3b2c0
Improved the performance of tensor padding
2016-05-25 11:43:08 -07:00
Benoit Steiner
58026905ae
Added support for statically known lists of pairs of indices
2016-05-25 11:04:14 -07:00
Benoit Steiner
0835667329
There is no need to make the fp16 full reduction kernel a static function.
2016-05-24 23:11:56 -07:00
Benoit Steiner
b5d6b52a4d
Fixed compilation warning
2016-05-24 23:10:57 -07:00
Benoit Steiner
a09cbf9905
Merged in rmlarsen/eigen (pull request PR-188)
...
Minor cleanups: 1. Get rid of a few unused variables. 2. Get rid of last uses of EIGEN_USE_COST_MODEL.
2016-05-23 12:55:12 -07:00
Christoph Hertzberg
25a03c02d6
Fix some sign-compare warnings
2016-05-22 16:42:27 +02:00
Gael Guennebaud
ccaace03c9
Make EIGEN_HAS_CONSTEXPR user configurable
2016-05-20 15:10:08 +02:00
Gael Guennebaud
c3410804cd
Make EIGEN_HAS_VARIADIC_TEMPLATES user configurable
2016-05-20 15:05:38 +02:00
Gael Guennebaud
48bf5ec216
Make EIGEN_HAS_RVALUE_REFERENCES user configurable
2016-05-20 14:54:20 +02:00
Gael Guennebaud
f43ae88892
Rename EIGEN_HAVE_RVALUE_REFERENCES to EIGEN_HAS_RVALUE_REFERENCES
2016-05-20 14:48:51 +02:00
Rasmus Larsen
b1e080c752
Merged eigen/eigen into default
2016-05-18 15:21:50 -07:00
Rasmus Munk Larsen
5624219b6b
Merge.
2016-05-18 15:16:06 -07:00
Rasmus Munk Larsen
7df811cfe5
Minor cleanups: 1. Get rid of unused variables. 2. Get rid of last uses of EIGEN_USE_COST_MODEL.
2016-05-18 15:09:48 -07:00
Benoit Steiner
bb3ff8e9d9
Advertize the packet api of the tensor reducers iff the corresponding packet primitives are available.
2016-05-18 14:52:49 -07:00
Rasmus Munk Larsen
f519fca72b
Reduce overhead for small tensors and cheap ops by short-circuiting the const computation and block size calculation in parallelFor.
2016-05-17 16:06:00 -07:00
Benoit Steiner
86ae94462e
#if defined(EIGEN_USE_NONBLOCKING_THREAD_POOL) is now #if !defined(EIGEN_USE_SIMPLE_THREAD_POOL): the non blocking thread pool is the default since it's more scalable, and one needs to request the old thread pool explicitly.
2016-05-17 14:06:15 -07:00
Benoit Steiner
997c335970
Fixed compilation error
2016-05-17 12:54:18 -07:00
Benoit Steiner
ebf6ada5ee
Fixed compilation error in the tensor thread pool
2016-05-17 12:33:46 -07:00
Rasmus Munk Larsen
0bb61b04ca
Merge upstream.
2016-05-17 10:26:10 -07:00
Rasmus Munk Larsen
0dbd68145f
Roll back changes to core. Move include of TensorFunctors.h up to satisfy dependence in TensorCostModel.h.
2016-05-17 10:25:19 -07:00
Rasmus Larsen
00228f2506
Merged eigen/eigen into default
2016-05-17 09:49:31 -07:00
Benoit Steiner
e7e64c3277
Enable the use of the packet api to evaluate tensor broadcasts. This speed things up quite a bit:
...
Before"
M_broadcasting/10 500000 3690 27.10 MFlops/s
BM_broadcasting/80 500000 4014 1594.24 MFlops/s
BM_broadcasting/640 100000 14770 27731.35 MFlops/s
BM_broadcasting/4K 5000 632711 39512.48 MFlops/s
After:
BM_broadcasting/10 500000 4287 23.33 MFlops/s
BM_broadcasting/80 500000 4455 1436.41 MFlops/s
BM_broadcasting/640 200000 10195 40173.01 MFlops/s
BM_broadcasting/4K 5000 423746 58997.57 MFlops/s
2016-05-17 09:24:35 -07:00
Benoit Steiner
5fa27574dd
Allow vectorized padding on GPU. This helps speed things up a little
...
Before:
BM_padding/10 5000000 460 217.03 MFlops/s
BM_padding/80 5000000 460 13899.40 MFlops/s
BM_padding/640 5000000 461 888421.17 MFlops/s
BM_padding/4K 5000000 460 54316322.55 MFlops/s
After:
BM_padding/10 5000000 454 220.20 MFlops/s
BM_padding/80 5000000 455 14039.86 MFlops/s
BM_padding/640 5000000 452 904968.83 MFlops/s
BM_padding/4K 5000000 411 60750049.21 MFlops/s
2016-05-17 09:17:26 -07:00
Benoit Steiner
8d06c02ffd
Allow vectorized padding on GPU. This helps speed things up a little.
...
Before:
BM_padding/10 5000000 460 217.03 MFlops/s
BM_padding/80 5000000 460 13899.40 MFlops/s
BM_padding/640 5000000 461 888421.17 MFlops/s
BM_padding/4K 5000000 460 54316322.55 MFlops/s
After:
BM_padding/10 5000000 454 220.20 MFlops/s
BM_padding/80 5000000 455 14039.86 MFlops/s
BM_padding/640 5000000 452 904968.83 MFlops/s
BM_padding/4K 5000000 411 60750049.21 MFlops/s
2016-05-17 09:13:27 -07:00
Benoit Steiner
a80d875916
Added missing costPerCoeff method
2016-05-16 09:31:10 -07:00
Benoit Steiner
83ef39e055
Turn on the cost model by default. This results in some significant speedups for smaller tensors. For example, below are the results for the various tensor reductions.
...
Before:
BM_colReduction_12T/10 1000000 1949 51.29 MFlops/s
BM_colReduction_12T/80 100000 15636 409.29 MFlops/s
BM_colReduction_12T/640 20000 95100 4307.01 MFlops/s
BM_colReduction_12T/4K 500 4573423 5466.36 MFlops/s
BM_colReduction_4T/10 1000000 1867 53.56 MFlops/s
BM_colReduction_4T/80 500000 5288 1210.11 MFlops/s
BM_colReduction_4T/640 10000 106924 3830.75 MFlops/s
BM_colReduction_4T/4K 500 9946374 2513.48 MFlops/s
BM_colReduction_8T/10 1000000 1912 52.30 MFlops/s
BM_colReduction_8T/80 200000 8354 766.09 MFlops/s
BM_colReduction_8T/640 20000 85063 4815.22 MFlops/s
BM_colReduction_8T/4K 500 5445216 4591.19 MFlops/s
BM_rowReduction_12T/10 1000000 2041 48.99 MFlops/s
BM_rowReduction_12T/80 100000 15426 414.87 MFlops/s
BM_rowReduction_12T/640 50000 39117 10470.98 MFlops/s
BM_rowReduction_12T/4K 500 3034298 8239.14 MFlops/s
BM_rowReduction_4T/10 1000000 1834 54.51 MFlops/s
BM_rowReduction_4T/80 500000 5406 1183.81 MFlops/s
BM_rowReduction_4T/640 50000 35017 11697.16 MFlops/s
BM_rowReduction_4T/4K 500 3428527 7291.76 MFlops/s
BM_rowReduction_8T/10 1000000 1925 51.95 MFlops/s
BM_rowReduction_8T/80 200000 8519 751.23 MFlops/s
BM_rowReduction_8T/640 50000 33441 12248.42 MFlops/s
BM_rowReduction_8T/4K 1000 2852841 8763.19 MFlops/s
After:
BM_colReduction_12T/10 50000000 59 1678.30 MFlops/s
BM_colReduction_12T/80 5000000 725 8822.71 MFlops/s
BM_colReduction_12T/640 20000 90882 4506.93 MFlops/s
BM_colReduction_12T/4K 500 4668855 5354.63 MFlops/s
BM_colReduction_4T/10 50000000 59 1687.37 MFlops/s
BM_colReduction_4T/80 5000000 737 8681.24 MFlops/s
BM_colReduction_4T/640 50000 108637 3770.34 MFlops/s
BM_colReduction_4T/4K 500 7912954 3159.38 MFlops/s
BM_colReduction_8T/10 50000000 60 1657.21 MFlops/s
BM_colReduction_8T/80 5000000 726 8812.48 MFlops/s
BM_colReduction_8T/640 20000 91451 4478.90 MFlops/s
BM_colReduction_8T/4K 500 5441692 4594.16 MFlops/s
BM_rowReduction_12T/10 20000000 93 1065.28 MFlops/s
BM_rowReduction_12T/80 2000000 950 6730.96 MFlops/s
BM_rowReduction_12T/640 50000 38196 10723.48 MFlops/s
BM_rowReduction_12T/4K 500 3019217 8280.29 MFlops/s
BM_rowReduction_4T/10 20000000 93 1064.30 MFlops/s
BM_rowReduction_4T/80 2000000 959 6667.71 MFlops/s
BM_rowReduction_4T/640 50000 37433 10941.96 MFlops/s
BM_rowReduction_4T/4K 500 3036476 8233.23 MFlops/s
BM_rowReduction_8T/10 20000000 93 1072.47 MFlops/s
BM_rowReduction_8T/80 2000000 959 6670.04 MFlops/s
BM_rowReduction_8T/640 50000 38069 10759.37 MFlops/s
BM_rowReduction_8T/4K 1000 2758988 9061.29 MFlops/s
2016-05-16 08:55:21 -07:00
Benoit Steiner
b789a26804
Fixed syntax error
2016-05-16 08:51:08 -07:00
Benoit Steiner
83dfb40f66
Turnon the new thread pool by default since it scales much better over multiple cores. It is still possible to revert to the old thread pool by compiling with the EIGEN_USE_SIMPLE_THREAD_POOL define.
2016-05-13 17:23:15 -07:00
Benoit Steiner
97605c7b27
New multithreaded contraction that doesn't rely on the thread pool to run the closure in the order in which they are enqueued. This is needed in order to switch to the new non blocking thread pool since this new thread pool can execute the closure in any order.
2016-05-13 17:11:29 -07:00
Benoit Steiner
c4fc8b70ec
Removed unnecessary thread synchronization
2016-05-13 10:49:38 -07:00
Benoit Steiner
7aa3557d31
Fixed compilation errors triggered by old versions of gcc
2016-05-12 18:59:04 -07:00
Rasmus Munk Larsen
5005b27fc8
Diasbled cost model by accident. Revert.
2016-05-12 16:55:21 -07:00
Rasmus Munk Larsen
989e419328
Address comments by bsteiner.
2016-05-12 16:54:19 -07:00
Rasmus Munk Larsen
e55deb21c5
Improvements to parallelFor.
...
Move some scalar functors from TensorFunctors. to Eigen core.
2016-05-12 14:07:22 -07:00
Benoit Steiner
ae9688f313
Worked around a compilation error triggered by nvcc when compiling a tensor concatenation kernel.
2016-05-12 12:06:51 -07:00
Benoit Steiner
2a54b70d45
Fixed potential race condition in the non blocking thread pool
2016-05-12 11:45:48 -07:00
Benoit Steiner
a071629fec
Replace implicit cast with an explicit one
2016-05-12 10:40:07 -07:00
Benoit Steiner
2f9401b061
Worked around compilation errors with older versions of gcc
2016-05-11 23:39:20 -07:00
Benoit Steiner
09653e1f82
Improved the portability of the tensor code
2016-05-11 23:29:09 -07:00
Benoit Steiner
b6a517c47d
Added the ability to load fp16 using the texture path.
...
Improved the performance of some reductions on fp16
2016-05-11 21:26:48 -07:00
Christoph Hertzberg
2150f13d65
fixed some double-promotion and sign-compare warnings
2016-05-11 23:02:26 +02:00
Benoit Steiner
217d984abc
Fixed a typo in my previous commit
2016-05-11 10:22:15 -07:00
Benoit Steiner
08348b4e48
Fix potential race condition in the CUDA reduction code.
2016-05-11 10:08:51 -07:00
Benoit Steiner
6a5717dc74
Explicitely initialize all the atomic variables.
2016-05-11 10:04:41 -07:00
Benoit Steiner
4ede059de1
Properly gate the use of half2.
2016-05-10 17:04:01 -07:00
Benoit Steiner
661e710092
Added support for fp16 to the sigmoid functor.
2016-05-10 12:25:27 -07:00
Benoit Steiner
0eb69b7552
Small improvement to the full reduction of fp16
2016-05-10 11:58:18 -07:00
Benoit Steiner
4013b8feca
Simplified the reduction code a little.
2016-05-10 09:40:42 -07:00
Benoit Steiner
4670d7d5ce
Improved the performance of full reductions on GPU:
...
Before:
BM_fullReduction/10 200000 11751 8.51 MFlops/s
BM_fullReduction/80 5000 523385 12.23 MFlops/s
BM_fullReduction/640 50 36179326 11.32 MFlops/s
BM_fullReduction/4K 1 2173517195 11.50 MFlops/s
After:
BM_fullReduction/10 500000 5987 16.70 MFlops/s
BM_fullReduction/80 200000 10636 601.73 MFlops/s
BM_fullReduction/640 50000 58428 7010.31 MFlops/s
BM_fullReduction/4K 1000 2006106 12461.95 MFlops/s
2016-05-09 17:09:54 -07:00
Benoit Steiner
c3859a2b58
Added the ability to use a scratch buffer in cuda kernels
2016-05-09 17:05:53 -07:00
Benoit Steiner
ba95e43ea2
Added a new parallelFor api to the thread pool device.
2016-05-09 10:45:12 -07:00
Benoit Steiner
dc7dbc2df7
Optimized the non blocking thread pool:
...
* Use a pseudo-random permutation of queue indices during random stealing. This ensures that all the queues are considered.
* Directly pop from a non-empty queue when we are waiting for work,
instead of first noticing that there is a non-empty queue and
then doing another round of random stealing to re-discover the non-empty
queue.
* Steal only 1 task from a remote queue instead of half of tasks.
2016-05-09 10:17:17 -07:00
Benoit Steiner
c54ae65c83
Marked a few tensor operations as read only
2016-05-05 17:18:47 -07:00
Benoit Steiner
910e013506
Relaxed an assertion that was tighter that necessary.
2016-05-05 15:38:16 -07:00
Benoit Steiner
a4d6e8fef0
Strongly hint but don't force the compiler to unroll a some loops in the tensor executor. This results in up to 27% faster code.
2016-05-05 09:25:55 -07:00
Benoit Steiner
f363e533aa
Added tests for full contractions using thread pools and gpu devices.
...
Fixed a couple of issues in the corresponding code.
2016-05-05 09:05:45 -07:00
Benoit Steiner
06d774bf58
Updated the contraction code to ensure that full contraction return a tensor of rank 0
2016-05-05 08:37:47 -07:00
Benoit Steiner
dd2b45feed
Removed extraneous 'explicit' keywords
2016-05-04 16:57:52 -07:00
Benoit Steiner
968ec1c2ae
Use numext::isfinite instead of std::isfinite
2016-05-03 19:56:40 -07:00
Benoit Steiner
aad9a04da4
Deleted superfluous explicit keyword.
2016-05-03 09:37:19 -07:00
Benoit Steiner
8a9228ed9b
Fixed compilation error
2016-05-01 14:48:01 -07:00
Benoit Steiner
d6c9596fd8
Added missing accessors to fixed sized tensors
2016-04-29 18:51:33 -07:00
Benoit Steiner
17fe7f354e
Deleted trailing commas
2016-04-29 18:39:01 -07:00
Benoit Steiner
e5f71aa6b2
Deleted useless trailing commas
2016-04-29 18:36:10 -07:00
Benoit Steiner
44f592dceb
Deleted unnecessary trailing commas.
2016-04-29 18:33:46 -07:00
Benoit Steiner
f100d1494c
Return the proper size (ie 1) for tensors of rank 0
2016-04-29 18:14:33 -07:00
Benoit Steiner
a8c0405cf5
Deleted unused default values for template parameters
2016-04-29 16:34:43 -07:00
Benoit Steiner
c07404f6a1
Restore Tensor support for non c++11 compilers
2016-04-29 15:19:19 -07:00
Benoit Steiner
ba32ded021
Fixed include path
2016-04-29 15:11:09 -07:00
Rasmus Munk Larsen
463738ccbe
Use computeProductBlockingSizes to compute blocking for both ShardByCol and ShardByRow cases.
2016-04-27 12:26:18 -07:00
Gael Guennebaud
3dddd34133
Refactor the unsupported CXX11/Core module to internal headers only.
2016-04-26 11:20:25 +02:00
Benoit Steiner
4a164d2c46
Fixed the partial evaluation of non vectorizable tensor subexpressions
2016-04-25 10:43:03 -07:00