Rasmus Munk Larsen
5624219b6b
Merge.
2016-05-18 15:16:06 -07:00
Rasmus Munk Larsen
7df811cfe5
Minor cleanups: 1. Get rid of unused variables. 2. Get rid of last uses of EIGEN_USE_COST_MODEL.
2016-05-18 15:09:48 -07:00
Rasmus Munk Larsen
f519fca72b
Reduce overhead for small tensors and cheap ops by short-circuiting the const computation and block size calculation in parallelFor.
2016-05-17 16:06:00 -07:00
Benoit Steiner
86ae94462e
#if defined(EIGEN_USE_NONBLOCKING_THREAD_POOL) is now #if !defined(EIGEN_USE_SIMPLE_THREAD_POOL): the non blocking thread pool is the default since it's more scalable, and one needs to request the old thread pool explicitly.
2016-05-17 14:06:15 -07:00
Benoit Steiner
997c335970
Fixed compilation error
2016-05-17 12:54:18 -07:00
Benoit Steiner
ebf6ada5ee
Fixed compilation error in the tensor thread pool
2016-05-17 12:33:46 -07:00
Rasmus Munk Larsen
0bb61b04ca
Merge upstream.
2016-05-17 10:26:10 -07:00
Rasmus Munk Larsen
0dbd68145f
Roll back changes to core. Move include of TensorFunctors.h up to satisfy dependence in TensorCostModel.h.
2016-05-17 10:25:19 -07:00
Rasmus Larsen
00228f2506
Merged eigen/eigen into default
2016-05-17 09:49:31 -07:00
Benoit Steiner
e7e64c3277
Enable the use of the packet api to evaluate tensor broadcasts. This speed things up quite a bit:
...
Before"
M_broadcasting/10 500000 3690 27.10 MFlops/s
BM_broadcasting/80 500000 4014 1594.24 MFlops/s
BM_broadcasting/640 100000 14770 27731.35 MFlops/s
BM_broadcasting/4K 5000 632711 39512.48 MFlops/s
After:
BM_broadcasting/10 500000 4287 23.33 MFlops/s
BM_broadcasting/80 500000 4455 1436.41 MFlops/s
BM_broadcasting/640 200000 10195 40173.01 MFlops/s
BM_broadcasting/4K 5000 423746 58997.57 MFlops/s
2016-05-17 09:24:35 -07:00
Benoit Steiner
5fa27574dd
Allow vectorized padding on GPU. This helps speed things up a little
...
Before:
BM_padding/10 5000000 460 217.03 MFlops/s
BM_padding/80 5000000 460 13899.40 MFlops/s
BM_padding/640 5000000 461 888421.17 MFlops/s
BM_padding/4K 5000000 460 54316322.55 MFlops/s
After:
BM_padding/10 5000000 454 220.20 MFlops/s
BM_padding/80 5000000 455 14039.86 MFlops/s
BM_padding/640 5000000 452 904968.83 MFlops/s
BM_padding/4K 5000000 411 60750049.21 MFlops/s
2016-05-17 09:17:26 -07:00
Benoit Steiner
a80d875916
Added missing costPerCoeff method
2016-05-16 09:31:10 -07:00
Benoit Steiner
83ef39e055
Turn on the cost model by default. This results in some significant speedups for smaller tensors. For example, below are the results for the various tensor reductions.
...
Before:
BM_colReduction_12T/10 1000000 1949 51.29 MFlops/s
BM_colReduction_12T/80 100000 15636 409.29 MFlops/s
BM_colReduction_12T/640 20000 95100 4307.01 MFlops/s
BM_colReduction_12T/4K 500 4573423 5466.36 MFlops/s
BM_colReduction_4T/10 1000000 1867 53.56 MFlops/s
BM_colReduction_4T/80 500000 5288 1210.11 MFlops/s
BM_colReduction_4T/640 10000 106924 3830.75 MFlops/s
BM_colReduction_4T/4K 500 9946374 2513.48 MFlops/s
BM_colReduction_8T/10 1000000 1912 52.30 MFlops/s
BM_colReduction_8T/80 200000 8354 766.09 MFlops/s
BM_colReduction_8T/640 20000 85063 4815.22 MFlops/s
BM_colReduction_8T/4K 500 5445216 4591.19 MFlops/s
BM_rowReduction_12T/10 1000000 2041 48.99 MFlops/s
BM_rowReduction_12T/80 100000 15426 414.87 MFlops/s
BM_rowReduction_12T/640 50000 39117 10470.98 MFlops/s
BM_rowReduction_12T/4K 500 3034298 8239.14 MFlops/s
BM_rowReduction_4T/10 1000000 1834 54.51 MFlops/s
BM_rowReduction_4T/80 500000 5406 1183.81 MFlops/s
BM_rowReduction_4T/640 50000 35017 11697.16 MFlops/s
BM_rowReduction_4T/4K 500 3428527 7291.76 MFlops/s
BM_rowReduction_8T/10 1000000 1925 51.95 MFlops/s
BM_rowReduction_8T/80 200000 8519 751.23 MFlops/s
BM_rowReduction_8T/640 50000 33441 12248.42 MFlops/s
BM_rowReduction_8T/4K 1000 2852841 8763.19 MFlops/s
After:
BM_colReduction_12T/10 50000000 59 1678.30 MFlops/s
BM_colReduction_12T/80 5000000 725 8822.71 MFlops/s
BM_colReduction_12T/640 20000 90882 4506.93 MFlops/s
BM_colReduction_12T/4K 500 4668855 5354.63 MFlops/s
BM_colReduction_4T/10 50000000 59 1687.37 MFlops/s
BM_colReduction_4T/80 5000000 737 8681.24 MFlops/s
BM_colReduction_4T/640 50000 108637 3770.34 MFlops/s
BM_colReduction_4T/4K 500 7912954 3159.38 MFlops/s
BM_colReduction_8T/10 50000000 60 1657.21 MFlops/s
BM_colReduction_8T/80 5000000 726 8812.48 MFlops/s
BM_colReduction_8T/640 20000 91451 4478.90 MFlops/s
BM_colReduction_8T/4K 500 5441692 4594.16 MFlops/s
BM_rowReduction_12T/10 20000000 93 1065.28 MFlops/s
BM_rowReduction_12T/80 2000000 950 6730.96 MFlops/s
BM_rowReduction_12T/640 50000 38196 10723.48 MFlops/s
BM_rowReduction_12T/4K 500 3019217 8280.29 MFlops/s
BM_rowReduction_4T/10 20000000 93 1064.30 MFlops/s
BM_rowReduction_4T/80 2000000 959 6667.71 MFlops/s
BM_rowReduction_4T/640 50000 37433 10941.96 MFlops/s
BM_rowReduction_4T/4K 500 3036476 8233.23 MFlops/s
BM_rowReduction_8T/10 20000000 93 1072.47 MFlops/s
BM_rowReduction_8T/80 2000000 959 6670.04 MFlops/s
BM_rowReduction_8T/640 50000 38069 10759.37 MFlops/s
BM_rowReduction_8T/4K 1000 2758988 9061.29 MFlops/s
2016-05-16 08:55:21 -07:00
Benoit Steiner
b789a26804
Fixed syntax error
2016-05-16 08:51:08 -07:00
Benoit Steiner
83dfb40f66
Turnon the new thread pool by default since it scales much better over multiple cores. It is still possible to revert to the old thread pool by compiling with the EIGEN_USE_SIMPLE_THREAD_POOL define.
2016-05-13 17:23:15 -07:00
Benoit Steiner
97605c7b27
New multithreaded contraction that doesn't rely on the thread pool to run the closure in the order in which they are enqueued. This is needed in order to switch to the new non blocking thread pool since this new thread pool can execute the closure in any order.
2016-05-13 17:11:29 -07:00
Benoit Steiner
c4fc8b70ec
Removed unnecessary thread synchronization
2016-05-13 10:49:38 -07:00
Benoit Steiner
7aa3557d31
Fixed compilation errors triggered by old versions of gcc
2016-05-12 18:59:04 -07:00
Rasmus Munk Larsen
5005b27fc8
Diasbled cost model by accident. Revert.
2016-05-12 16:55:21 -07:00
Rasmus Munk Larsen
989e419328
Address comments by bsteiner.
2016-05-12 16:54:19 -07:00
Rasmus Munk Larsen
e55deb21c5
Improvements to parallelFor.
...
Move some scalar functors from TensorFunctors. to Eigen core.
2016-05-12 14:07:22 -07:00
Benoit Steiner
ae9688f313
Worked around a compilation error triggered by nvcc when compiling a tensor concatenation kernel.
2016-05-12 12:06:51 -07:00
Benoit Steiner
2a54b70d45
Fixed potential race condition in the non blocking thread pool
2016-05-12 11:45:48 -07:00
Benoit Steiner
a071629fec
Replace implicit cast with an explicit one
2016-05-12 10:40:07 -07:00
Benoit Steiner
2f9401b061
Worked around compilation errors with older versions of gcc
2016-05-11 23:39:20 -07:00
Benoit Steiner
09653e1f82
Improved the portability of the tensor code
2016-05-11 23:29:09 -07:00
Benoit Steiner
b6a517c47d
Added the ability to load fp16 using the texture path.
...
Improved the performance of some reductions on fp16
2016-05-11 21:26:48 -07:00
Christoph Hertzberg
2150f13d65
fixed some double-promotion and sign-compare warnings
2016-05-11 23:02:26 +02:00
Benoit Steiner
217d984abc
Fixed a typo in my previous commit
2016-05-11 10:22:15 -07:00
Benoit Steiner
08348b4e48
Fix potential race condition in the CUDA reduction code.
2016-05-11 10:08:51 -07:00
Benoit Steiner
6a5717dc74
Explicitely initialize all the atomic variables.
2016-05-11 10:04:41 -07:00
Benoit Steiner
4ede059de1
Properly gate the use of half2.
2016-05-10 17:04:01 -07:00
Benoit Steiner
661e710092
Added support for fp16 to the sigmoid functor.
2016-05-10 12:25:27 -07:00
Benoit Steiner
0eb69b7552
Small improvement to the full reduction of fp16
2016-05-10 11:58:18 -07:00
Benoit Steiner
4013b8feca
Simplified the reduction code a little.
2016-05-10 09:40:42 -07:00
Benoit Steiner
4670d7d5ce
Improved the performance of full reductions on GPU:
...
Before:
BM_fullReduction/10 200000 11751 8.51 MFlops/s
BM_fullReduction/80 5000 523385 12.23 MFlops/s
BM_fullReduction/640 50 36179326 11.32 MFlops/s
BM_fullReduction/4K 1 2173517195 11.50 MFlops/s
After:
BM_fullReduction/10 500000 5987 16.70 MFlops/s
BM_fullReduction/80 200000 10636 601.73 MFlops/s
BM_fullReduction/640 50000 58428 7010.31 MFlops/s
BM_fullReduction/4K 1000 2006106 12461.95 MFlops/s
2016-05-09 17:09:54 -07:00
Benoit Steiner
c3859a2b58
Added the ability to use a scratch buffer in cuda kernels
2016-05-09 17:05:53 -07:00
Benoit Steiner
ba95e43ea2
Added a new parallelFor api to the thread pool device.
2016-05-09 10:45:12 -07:00
Benoit Steiner
dc7dbc2df7
Optimized the non blocking thread pool:
...
* Use a pseudo-random permutation of queue indices during random stealing. This ensures that all the queues are considered.
* Directly pop from a non-empty queue when we are waiting for work,
instead of first noticing that there is a non-empty queue and
then doing another round of random stealing to re-discover the non-empty
queue.
* Steal only 1 task from a remote queue instead of half of tasks.
2016-05-09 10:17:17 -07:00
Benoit Steiner
c54ae65c83
Marked a few tensor operations as read only
2016-05-05 17:18:47 -07:00
Benoit Steiner
910e013506
Relaxed an assertion that was tighter that necessary.
2016-05-05 15:38:16 -07:00
Benoit Steiner
a4d6e8fef0
Strongly hint but don't force the compiler to unroll a some loops in the tensor executor. This results in up to 27% faster code.
2016-05-05 09:25:55 -07:00
Benoit Steiner
f363e533aa
Added tests for full contractions using thread pools and gpu devices.
...
Fixed a couple of issues in the corresponding code.
2016-05-05 09:05:45 -07:00
Benoit Steiner
06d774bf58
Updated the contraction code to ensure that full contraction return a tensor of rank 0
2016-05-05 08:37:47 -07:00
Benoit Steiner
dd2b45feed
Removed extraneous 'explicit' keywords
2016-05-04 16:57:52 -07:00
Benoit Steiner
968ec1c2ae
Use numext::isfinite instead of std::isfinite
2016-05-03 19:56:40 -07:00
Benoit Steiner
aad9a04da4
Deleted superfluous explicit keyword.
2016-05-03 09:37:19 -07:00
Benoit Steiner
8a9228ed9b
Fixed compilation error
2016-05-01 14:48:01 -07:00
Benoit Steiner
d6c9596fd8
Added missing accessors to fixed sized tensors
2016-04-29 18:51:33 -07:00
Benoit Steiner
17fe7f354e
Deleted trailing commas
2016-04-29 18:39:01 -07:00