Commit Graph

1398 Commits

Author SHA1 Message Date
Rasmus Munk Larsen
e55deb21c5 Improvements to parallelFor.
Move some scalar functors from TensorFunctors. to Eigen core.
2016-05-12 14:07:22 -07:00
Benoit Steiner
ae9688f313 Worked around a compilation error triggered by nvcc when compiling a tensor concatenation kernel. 2016-05-12 12:06:51 -07:00
Benoit Steiner
2a54b70d45 Fixed potential race condition in the non blocking thread pool 2016-05-12 11:45:48 -07:00
Benoit Steiner
a071629fec Replace implicit cast with an explicit one 2016-05-12 10:40:07 -07:00
Benoit Steiner
2f9401b061 Worked around compilation errors with older versions of gcc 2016-05-11 23:39:20 -07:00
Benoit Steiner
09653e1f82 Improved the portability of the tensor code 2016-05-11 23:29:09 -07:00
Benoit Steiner
b6a517c47d Added the ability to load fp16 using the texture path.
Improved the performance of some reductions on fp16
2016-05-11 21:26:48 -07:00
Christoph Hertzberg
1a1ce6ff61 Removed deprecated flag (which apparently was ignored anyway) 2016-05-11 23:05:37 +02:00
Christoph Hertzberg
2150f13d65 fixed some double-promotion and sign-compare warnings 2016-05-11 23:02:26 +02:00
Benoit Steiner
217d984abc Fixed a typo in my previous commit 2016-05-11 10:22:15 -07:00
Benoit Steiner
08348b4e48 Fix potential race condition in the CUDA reduction code. 2016-05-11 10:08:51 -07:00
Benoit Steiner
6a5717dc74 Explicitely initialize all the atomic variables. 2016-05-11 10:04:41 -07:00
Benoit Steiner
4ede059de1 Properly gate the use of half2. 2016-05-10 17:04:01 -07:00
Benoit Steiner
661e710092 Added support for fp16 to the sigmoid functor. 2016-05-10 12:25:27 -07:00
Benoit Steiner
0eb69b7552 Small improvement to the full reduction of fp16 2016-05-10 11:58:18 -07:00
Benoit Steiner
4013b8feca Simplified the reduction code a little. 2016-05-10 09:40:42 -07:00
Benoit Steiner
4670d7d5ce Improved the performance of full reductions on GPU:
Before:
BM_fullReduction/10       200000      11751     8.51 MFlops/s
BM_fullReduction/80         5000     523385    12.23 MFlops/s
BM_fullReduction/640          50   36179326    11.32 MFlops/s
BM_fullReduction/4K            1 2173517195    11.50 MFlops/s

After:
BM_fullReduction/10       500000       5987    16.70 MFlops/s
BM_fullReduction/80       200000      10636   601.73 MFlops/s
BM_fullReduction/640       50000      58428  7010.31 MFlops/s
BM_fullReduction/4K         1000    2006106 12461.95 MFlops/s
2016-05-09 17:09:54 -07:00
Benoit Steiner
c3859a2b58 Added the ability to use a scratch buffer in cuda kernels 2016-05-09 17:05:53 -07:00
Benoit Steiner
ba95e43ea2 Added a new parallelFor api to the thread pool device. 2016-05-09 10:45:12 -07:00
Benoit Steiner
dc7dbc2df7 Optimized the non blocking thread pool:
* Use a pseudo-random permutation of queue indices during random stealing. This ensures that all the queues are considered.
 * Directly pop from a non-empty queue when we are waiting for work,
instead of first noticing that there is a non-empty queue and
then doing another round of random stealing to re-discover the non-empty
queue.
 * Steal only 1 task from a remote queue instead of half of tasks.
2016-05-09 10:17:17 -07:00
Benoit Steiner
c54ae65c83 Marked a few tensor operations as read only 2016-05-05 17:18:47 -07:00
Benoit Steiner
910e013506 Relaxed an assertion that was tighter that necessary. 2016-05-05 15:38:16 -07:00
Benoit Steiner
28d5572658 Fixed some incorrect assertions 2016-05-05 10:02:26 -07:00
Benoit Steiner
a4d6e8fef0 Strongly hint but don't force the compiler to unroll a some loops in the tensor executor. This results in up to 27% faster code. 2016-05-05 09:25:55 -07:00
Benoit Steiner
f363e533aa Added tests for full contractions using thread pools and gpu devices.
Fixed a couple of issues in the corresponding code.
2016-05-05 09:05:45 -07:00
Benoit Steiner
06d774bf58 Updated the contraction code to ensure that full contraction return a tensor of rank 0 2016-05-05 08:37:47 -07:00
Christoph Hertzberg
dacb469bc9 Enable and fix -Wdouble-conversion warnings 2016-05-05 13:35:45 +02:00
Benoit Steiner
dd2b45feed Removed extraneous 'explicit' keywords 2016-05-04 16:57:52 -07:00
Benoit Steiner
968ec1c2ae Use numext::isfinite instead of std::isfinite 2016-05-03 19:56:40 -07:00
Benoit Steiner
aad9a04da4 Deleted superfluous explicit keyword. 2016-05-03 09:37:19 -07:00
Benoit Steiner
8a9228ed9b Fixed compilation error 2016-05-01 14:48:01 -07:00
Benoit Steiner
d6c9596fd8 Added missing accessors to fixed sized tensors 2016-04-29 18:51:33 -07:00
Benoit Steiner
17fe7f354e Deleted trailing commas 2016-04-29 18:39:01 -07:00
Benoit Steiner
e5f71aa6b2 Deleted useless trailing commas 2016-04-29 18:36:10 -07:00
Benoit Steiner
44f592dceb Deleted unnecessary trailing commas. 2016-04-29 18:33:46 -07:00
Benoit Steiner
f100d1494c Return the proper size (ie 1) for tensors of rank 0 2016-04-29 18:14:33 -07:00
Benoit Steiner
a8c0405cf5 Deleted unused default values for template parameters 2016-04-29 16:34:43 -07:00
Benoit Steiner
c07404f6a1 Restore Tensor support for non c++11 compilers 2016-04-29 15:19:19 -07:00
Benoit Steiner
ba32ded021 Fixed include path 2016-04-29 15:11:09 -07:00
Gael Guennebaud
318e65e0ae Fix missing inclusion of Eigen/Core 2016-04-27 23:05:40 +02:00
Rasmus Munk Larsen
463738ccbe Use computeProductBlockingSizes to compute blocking for both ShardByCol and ShardByRow cases. 2016-04-27 12:26:18 -07:00
Gael Guennebaud
3dddd34133 Refactor the unsupported CXX11/Core module to internal headers only. 2016-04-26 11:20:25 +02:00
Benoit Steiner
4a164d2c46 Fixed the partial evaluation of non vectorizable tensor subexpressions 2016-04-25 10:43:03 -07:00
Benoit Steiner
fd9401f260 Refined the cost of the striding operation. 2016-04-25 09:16:08 -07:00
Benoit Steiner
4bbc97be5e Provide access to the base threadpool classes 2016-04-21 17:59:33 -07:00
Benoit Steiner
33adce5c3a Added the ability to switch to the new thread pool with a #define 2016-04-21 11:59:58 -07:00
Benoit Steiner
f670613e4b Fixed several compilation warnings 2016-04-21 11:03:02 -07:00
Benoit Steiner
2dde1b1028 Don't crash when attempting to reduce empty tensors. 2016-04-20 18:08:20 -07:00
Benoit Steiner
c7c2054bb5 Started to implement a portable way to yield. 2016-04-19 17:59:58 -07:00
Benoit Steiner
2b72163028 Implemented a more portable version of thread local variables 2016-04-19 15:56:02 -07:00