Commit Graph

1862 Commits

Author SHA1 Message Date
Rasmus Munk Larsen
f519fca72b Reduce overhead for small tensors and cheap ops by short-circuiting the const computation and block size calculation in parallelFor. 2016-05-17 16:06:00 -07:00
Benoit Steiner
86ae94462e #if defined(EIGEN_USE_NONBLOCKING_THREAD_POOL) is now #if !defined(EIGEN_USE_SIMPLE_THREAD_POOL): the non blocking thread pool is the default since it's more scalable, and one needs to request the old thread pool explicitly. 2016-05-17 14:06:15 -07:00
Benoit Steiner
997c335970 Fixed compilation error 2016-05-17 12:54:18 -07:00
Benoit Steiner
ebf6ada5ee Fixed compilation error in the tensor thread pool 2016-05-17 12:33:46 -07:00
Rasmus Munk Larsen
0bb61b04ca Merge upstream. 2016-05-17 10:26:10 -07:00
Rasmus Munk Larsen
0dbd68145f Roll back changes to core. Move include of TensorFunctors.h up to satisfy dependence in TensorCostModel.h. 2016-05-17 10:25:19 -07:00
Rasmus Larsen
00228f2506 Merged eigen/eigen into default 2016-05-17 09:49:31 -07:00
Benoit Steiner
e7e64c3277 Enable the use of the packet api to evaluate tensor broadcasts. This speed things up quite a bit:
Before"
M_broadcasting/10        500000       3690    27.10 MFlops/s
BM_broadcasting/80        500000       4014  1594.24 MFlops/s
BM_broadcasting/640       100000      14770 27731.35 MFlops/s
BM_broadcasting/4K          5000     632711 39512.48 MFlops/s
After:
BM_broadcasting/10        500000       4287    23.33 MFlops/s
BM_broadcasting/80        500000       4455  1436.41 MFlops/s
BM_broadcasting/640       200000      10195 40173.01 MFlops/s
BM_broadcasting/4K          5000     423746 58997.57 MFlops/s
2016-05-17 09:24:35 -07:00
Benoit Steiner
5fa27574dd Allow vectorized padding on GPU. This helps speed things up a little
Before:
BM_padding/10            5000000        460   217.03 MFlops/s
BM_padding/80            5000000        460 13899.40 MFlops/s
BM_padding/640           5000000        461 888421.17 MFlops/s
BM_padding/4K            5000000        460 54316322.55 MFlops/s
After:
BM_padding/10            5000000        454   220.20 MFlops/s
BM_padding/80            5000000        455 14039.86 MFlops/s
BM_padding/640           5000000        452 904968.83 MFlops/s
BM_padding/4K            5000000        411 60750049.21 MFlops/s
2016-05-17 09:17:26 -07:00
Benoit Steiner
a910bcee43 Merged latest updates from trunk 2016-05-17 09:14:22 -07:00
Benoit Steiner
8d06c02ffd Allow vectorized padding on GPU. This helps speed things up a little.
Before:
BM_padding/10            5000000        460   217.03 MFlops/s
BM_padding/80            5000000        460 13899.40 MFlops/s
BM_padding/640           5000000        461 888421.17 MFlops/s
BM_padding/4K            5000000        460 54316322.55 MFlops/s
After:
BM_padding/10            5000000        454   220.20 MFlops/s
BM_padding/80            5000000        455 14039.86 MFlops/s
BM_padding/640           5000000        452 904968.83 MFlops/s
BM_padding/4K            5000000        411 60750049.21 MFlops/s
2016-05-17 09:13:27 -07:00
Benoit Steiner
86da77cb9b Pulled latest updates from trunk. 2016-05-17 07:21:48 -07:00
Benoit Steiner
92fc6add43 Don't rely on c++11 extension when we don't have to. 2016-05-17 07:21:22 -07:00
David Dement
ccc7563ac5 made a fix to the GMRES solver so that it now correctly reports the error achieved in the solution process 2016-05-16 14:26:41 -04:00
Benoit Steiner
a80d875916 Added missing costPerCoeff method 2016-05-16 09:31:10 -07:00
Benoit Steiner
83ef39e055 Turn on the cost model by default. This results in some significant speedups for smaller tensors. For example, below are the results for the various tensor reductions.
Before:
BM_colReduction_12T/10       1000000       1949    51.29 MFlops/s
BM_colReduction_12T/80        100000      15636   409.29 MFlops/s
BM_colReduction_12T/640        20000      95100  4307.01 MFlops/s
BM_colReduction_12T/4K           500    4573423  5466.36 MFlops/s
BM_colReduction_4T/10        1000000       1867    53.56 MFlops/s
BM_colReduction_4T/80         500000       5288  1210.11 MFlops/s
BM_colReduction_4T/640         10000     106924  3830.75 MFlops/s
BM_colReduction_4T/4K            500    9946374  2513.48 MFlops/s
BM_colReduction_8T/10        1000000       1912    52.30 MFlops/s
BM_colReduction_8T/80         200000       8354   766.09 MFlops/s
BM_colReduction_8T/640         20000      85063  4815.22 MFlops/s
BM_colReduction_8T/4K            500    5445216  4591.19 MFlops/s
BM_rowReduction_12T/10       1000000       2041    48.99 MFlops/s
BM_rowReduction_12T/80        100000      15426   414.87 MFlops/s
BM_rowReduction_12T/640        50000      39117 10470.98 MFlops/s
BM_rowReduction_12T/4K           500    3034298  8239.14 MFlops/s
BM_rowReduction_4T/10        1000000       1834    54.51 MFlops/s
BM_rowReduction_4T/80         500000       5406  1183.81 MFlops/s
BM_rowReduction_4T/640         50000      35017 11697.16 MFlops/s
BM_rowReduction_4T/4K            500    3428527  7291.76 MFlops/s
BM_rowReduction_8T/10        1000000       1925    51.95 MFlops/s
BM_rowReduction_8T/80         200000       8519   751.23 MFlops/s
BM_rowReduction_8T/640         50000      33441 12248.42 MFlops/s
BM_rowReduction_8T/4K           1000    2852841  8763.19 MFlops/s


After:
BM_colReduction_12T/10      50000000         59  1678.30 MFlops/s
BM_colReduction_12T/80       5000000        725  8822.71 MFlops/s
BM_colReduction_12T/640        20000      90882  4506.93 MFlops/s
BM_colReduction_12T/4K           500    4668855  5354.63 MFlops/s
BM_colReduction_4T/10       50000000         59  1687.37 MFlops/s
BM_colReduction_4T/80        5000000        737  8681.24 MFlops/s
BM_colReduction_4T/640         50000     108637  3770.34 MFlops/s
BM_colReduction_4T/4K            500    7912954  3159.38 MFlops/s
BM_colReduction_8T/10       50000000         60  1657.21 MFlops/s
BM_colReduction_8T/80        5000000        726  8812.48 MFlops/s
BM_colReduction_8T/640         20000      91451  4478.90 MFlops/s
BM_colReduction_8T/4K            500    5441692  4594.16 MFlops/s
BM_rowReduction_12T/10      20000000         93  1065.28 MFlops/s
BM_rowReduction_12T/80       2000000        950  6730.96 MFlops/s
BM_rowReduction_12T/640        50000      38196 10723.48 MFlops/s
BM_rowReduction_12T/4K           500    3019217  8280.29 MFlops/s
BM_rowReduction_4T/10       20000000         93  1064.30 MFlops/s
BM_rowReduction_4T/80        2000000        959  6667.71 MFlops/s
BM_rowReduction_4T/640         50000      37433 10941.96 MFlops/s
BM_rowReduction_4T/4K            500    3036476  8233.23 MFlops/s
BM_rowReduction_8T/10       20000000         93  1072.47 MFlops/s
BM_rowReduction_8T/80        2000000        959  6670.04 MFlops/s
BM_rowReduction_8T/640         50000      38069 10759.37 MFlops/s
BM_rowReduction_8T/4K           1000    2758988  9061.29 MFlops/s
2016-05-16 08:55:21 -07:00
Benoit Steiner
b789a26804 Fixed syntax error 2016-05-16 08:51:08 -07:00
Benoit Steiner
83dfb40f66 Turnon the new thread pool by default since it scales much better over multiple cores. It is still possible to revert to the old thread pool by compiling with the EIGEN_USE_SIMPLE_THREAD_POOL define. 2016-05-13 17:23:15 -07:00
Benoit Steiner
97605c7b27 New multithreaded contraction that doesn't rely on the thread pool to run the closure in the order in which they are enqueued. This is needed in order to switch to the new non blocking thread pool since this new thread pool can execute the closure in any order. 2016-05-13 17:11:29 -07:00
Benoit Steiner
c4fc8b70ec Removed unnecessary thread synchronization 2016-05-13 10:49:38 -07:00
Benoit Steiner
7aa3557d31 Fixed compilation errors triggered by old versions of gcc 2016-05-12 18:59:04 -07:00
Rasmus Munk Larsen
5005b27fc8 Diasbled cost model by accident. Revert. 2016-05-12 16:55:21 -07:00
Rasmus Munk Larsen
989e419328 Address comments by bsteiner. 2016-05-12 16:54:19 -07:00
Rasmus Munk Larsen
e55deb21c5 Improvements to parallelFor.
Move some scalar functors from TensorFunctors. to Eigen core.
2016-05-12 14:07:22 -07:00
Benoit Steiner
ae9688f313 Worked around a compilation error triggered by nvcc when compiling a tensor concatenation kernel. 2016-05-12 12:06:51 -07:00
Benoit Steiner
2a54b70d45 Fixed potential race condition in the non blocking thread pool 2016-05-12 11:45:48 -07:00
Benoit Steiner
a071629fec Replace implicit cast with an explicit one 2016-05-12 10:40:07 -07:00
Benoit Steiner
2f9401b061 Worked around compilation errors with older versions of gcc 2016-05-11 23:39:20 -07:00
Benoit Steiner
09653e1f82 Improved the portability of the tensor code 2016-05-11 23:29:09 -07:00
Benoit Steiner
fae0493f98 Fixed a couple of bugs related to the Pascalfamily of GPUs
H: Enter commit message.  Lines beginning with 'HG:' are removed.
2016-05-11 23:02:26 -07:00
Benoit Steiner
886445ce4d Avoid unnecessary conversions between floats and doubles 2016-05-11 23:00:03 -07:00
Benoit Steiner
595e890391 Added more tests for half floats 2016-05-11 21:27:15 -07:00
Benoit Steiner
b6a517c47d Added the ability to load fp16 using the texture path.
Improved the performance of some reductions on fp16
2016-05-11 21:26:48 -07:00
Christoph Hertzberg
1a1ce6ff61 Removed deprecated flag (which apparently was ignored anyway) 2016-05-11 23:05:37 +02:00
Christoph Hertzberg
2150f13d65 fixed some double-promotion and sign-compare warnings 2016-05-11 23:02:26 +02:00
Benoit Steiner
217d984abc Fixed a typo in my previous commit 2016-05-11 10:22:15 -07:00
Benoit Steiner
08348b4e48 Fix potential race condition in the CUDA reduction code. 2016-05-11 10:08:51 -07:00
Benoit Steiner
cbb14ed47e Added a few tests to validate the generation of random tensors on GPU. 2016-05-11 10:05:56 -07:00
Benoit Steiner
6a5717dc74 Explicitely initialize all the atomic variables. 2016-05-11 10:04:41 -07:00
Benoit Steiner
4ede059de1 Properly gate the use of half2. 2016-05-10 17:04:01 -07:00
Benoit Steiner
661e710092 Added support for fp16 to the sigmoid functor. 2016-05-10 12:25:27 -07:00
Benoit Steiner
0eb69b7552 Small improvement to the full reduction of fp16 2016-05-10 11:58:18 -07:00
Benoit Steiner
6bf8273bc0 Added a test to validate the new non blocking thread pool 2016-05-10 10:49:34 -07:00
Benoit Steiner
4013b8feca Simplified the reduction code a little. 2016-05-10 09:40:42 -07:00
Benoit Steiner
75bd2bd32d Fixed compilation warning 2016-05-09 19:24:41 -07:00
Benoit Steiner
4670d7d5ce Improved the performance of full reductions on GPU:
Before:
BM_fullReduction/10       200000      11751     8.51 MFlops/s
BM_fullReduction/80         5000     523385    12.23 MFlops/s
BM_fullReduction/640          50   36179326    11.32 MFlops/s
BM_fullReduction/4K            1 2173517195    11.50 MFlops/s

After:
BM_fullReduction/10       500000       5987    16.70 MFlops/s
BM_fullReduction/80       200000      10636   601.73 MFlops/s
BM_fullReduction/640       50000      58428  7010.31 MFlops/s
BM_fullReduction/4K         1000    2006106 12461.95 MFlops/s
2016-05-09 17:09:54 -07:00
Benoit Steiner
c3859a2b58 Added the ability to use a scratch buffer in cuda kernels 2016-05-09 17:05:53 -07:00
Benoit Steiner
ba95e43ea2 Added a new parallelFor api to the thread pool device. 2016-05-09 10:45:12 -07:00
Benoit Steiner
dc7dbc2df7 Optimized the non blocking thread pool:
* Use a pseudo-random permutation of queue indices during random stealing. This ensures that all the queues are considered.
 * Directly pop from a non-empty queue when we are waiting for work,
instead of first noticing that there is a non-empty queue and
then doing another round of random stealing to re-discover the non-empty
queue.
 * Steal only 1 task from a remote queue instead of half of tasks.
2016-05-09 10:17:17 -07:00
Benoit Steiner
691614bd2c Worked around a bug in nvcc on tegra x1 2016-05-07 13:28:53 -07:00
Benoit Steiner
c54ae65c83 Marked a few tensor operations as read only 2016-05-05 17:18:47 -07:00
Benoit Steiner
69a8a4e1f3 Added a test to validate full reduction on tensor of half floats 2016-05-05 16:52:50 -07:00
Benoit Steiner
678a17ba79 Made the testing of contractions on fp16 more robust 2016-05-05 16:36:39 -07:00
Benoit Steiner
e3d053e14e Refined the testing of log and exp on fp16 2016-05-05 16:24:15 -07:00
Benoit Steiner
9a48688d37 Further improved the testing of fp16 2016-05-05 15:58:05 -07:00
Benoit Steiner
910e013506 Relaxed an assertion that was tighter that necessary. 2016-05-05 15:38:16 -07:00
Benoit Steiner
28d5572658 Fixed some incorrect assertions 2016-05-05 10:02:26 -07:00
Benoit Steiner
2aba40d208 Avoid unecessary type promotion 2016-05-05 09:26:57 -07:00
Benoit Steiner
a4d6e8fef0 Strongly hint but don't force the compiler to unroll a some loops in the tensor executor. This results in up to 27% faster code. 2016-05-05 09:25:55 -07:00
Benoit Steiner
7875437ca0 Avoided unecessary type promotion 2016-05-05 09:08:42 -07:00
Benoit Steiner
f363e533aa Added tests for full contractions using thread pools and gpu devices.
Fixed a couple of issues in the corresponding code.
2016-05-05 09:05:45 -07:00
Benoit Steiner
06d774bf58 Updated the contraction code to ensure that full contraction return a tensor of rank 0 2016-05-05 08:37:47 -07:00
Christoph Hertzberg
b300a84989 Fixed some singed/unsigned comparison warnings 2016-05-05 13:36:28 +02:00
Christoph Hertzberg
dacb469bc9 Enable and fix -Wdouble-conversion warnings 2016-05-05 13:35:45 +02:00
Benoit Steiner
62b710072e Reduced the memory footprint of the cxx11_tensor_image_patch test 2016-05-04 21:08:22 -07:00
Benoit Steiner
dd2b45feed Removed extraneous 'explicit' keywords 2016-05-04 16:57:52 -07:00
Benoit Steiner
968ec1c2ae Use numext::isfinite instead of std::isfinite 2016-05-03 19:56:40 -07:00
Benoit Steiner
2c5568a757 Added a test to validate the computation of exp and log on 16bit floats 2016-05-03 12:06:07 -07:00
Benoit Steiner
aad9a04da4 Deleted superfluous explicit keyword. 2016-05-03 09:37:19 -07:00
Benoit Steiner
8a9228ed9b Fixed compilation error 2016-05-01 14:48:01 -07:00
Benoit Steiner
d6c9596fd8 Added missing accessors to fixed sized tensors 2016-04-29 18:51:33 -07:00
Benoit Steiner
17fe7f354e Deleted trailing commas 2016-04-29 18:39:01 -07:00
Benoit Steiner
e5f71aa6b2 Deleted useless trailing commas 2016-04-29 18:36:10 -07:00
Benoit Steiner
44f592dceb Deleted unnecessary trailing commas. 2016-04-29 18:33:46 -07:00
Benoit Steiner
2b890ae618 Fixed compilation errors generated by clang 2016-04-29 18:30:40 -07:00
Benoit Steiner
d217217842 Added a few tests to ensure that the dimensions of rank 0 tensors are correctly computed 2016-04-29 18:15:34 -07:00
Benoit Steiner
f100d1494c Return the proper size (ie 1) for tensors of rank 0 2016-04-29 18:14:33 -07:00
Benoit Steiner
d14105f158 Made several tensor tests compatible with cxx03 2016-04-29 17:22:37 -07:00
Benoit Steiner
c0882ef4d9 Moved a number of tensor tests that don't require cxx11 to work properly outside the EIGEN_TEST_CXX11 test section 2016-04-29 17:13:51 -07:00
Benoit Steiner
9d1dbd1ec0 Fixed teh cxx11_tensor_empty test to compile without requiring cxx11 support 2016-04-29 16:53:55 -07:00
Benoit Steiner
a8c0405cf5 Deleted unused default values for template parameters 2016-04-29 16:34:43 -07:00
Benoit Steiner
4f53178e62 Made a coupe of tensor tests compile without requiring c++11 support. 2016-04-29 16:09:54 -07:00
Benoit Steiner
1131a984a6 Made the cxx11_tensor_forced_eval compile without c++11. 2016-04-29 15:48:59 -07:00
Benoit Steiner
c07404f6a1 Restore Tensor support for non c++11 compilers 2016-04-29 15:19:19 -07:00
Benoit Steiner
ba32ded021 Fixed include path 2016-04-29 15:11:09 -07:00
Benoit Steiner
a524a26fdc Fixed a few memory leaks 2016-04-28 18:55:53 -07:00
Gael Guennebaud
318e65e0ae Fix missing inclusion of Eigen/Core 2016-04-27 23:05:40 +02:00
Rasmus Munk Larsen
463738ccbe Use computeProductBlockingSizes to compute blocking for both ShardByCol and ShardByRow cases. 2016-04-27 12:26:18 -07:00
Gael Guennebaud
3dddd34133 Refactor the unsupported CXX11/Core module to internal headers only. 2016-04-26 11:20:25 +02:00
Benoit Steiner
4a164d2c46 Fixed the partial evaluation of non vectorizable tensor subexpressions 2016-04-25 10:43:03 -07:00
Benoit Steiner
fd9401f260 Refined the cost of the striding operation. 2016-04-25 09:16:08 -07:00
Benoit Steiner
4bbc97be5e Provide access to the base threadpool classes 2016-04-21 17:59:33 -07:00
Benoit Steiner
33adce5c3a Added the ability to switch to the new thread pool with a #define 2016-04-21 11:59:58 -07:00
Benoit Steiner
f670613e4b Fixed several compilation warnings 2016-04-21 11:03:02 -07:00
Benoit Steiner
32ffce04fc Use EIGEN_THREAD_YIELD instead of std::this_thread::yield to make the code more portable. 2016-04-21 08:47:28 -07:00
Benoit Steiner
2dde1b1028 Don't crash when attempting to reduce empty tensors. 2016-04-20 18:08:20 -07:00
Benoit Steiner
a792cd357d Added more tests 2016-04-20 17:33:58 -07:00
Benoit Steiner
c7c2054bb5 Started to implement a portable way to yield. 2016-04-19 17:59:58 -07:00
Benoit Steiner
2b72163028 Implemented a more portable version of thread local variables 2016-04-19 15:56:02 -07:00
Benoit Steiner
04f954956d Fixed a few typos 2016-04-19 15:27:09 -07:00