Commit Graph

2734 Commits

Author SHA1 Message Date
Srinivas Vasudevan
6e215cf109 Add Bessel functions to SpecialFunctions.
- Split SpecialFunctions files in to a separate BesselFunctions file.

In particular add:
    - Modified bessel functions of the second kind k0, k1, k0e, k1e
    - Bessel functions of the first kind j0, j1
    - Bessel functions of the second kind y0, y1
2019-09-14 12:16:47 -04:00
Srinivas Vasudevan
facdec5aa7 Add packetized versions of i0e and i1e special functions.
- In particular refactor the i0e and i1e code so scalar and vectorized path share code.
  - Move chebevl to GenericPacketMathFunctions.


A brief benchmark with building Eigen with FMA, AVX and AVX2 flags

Before:

CPU: Intel Haswell with HyperThreading (6 cores)
Benchmark                  Time(ns)        CPU(ns)     Iterations
-----------------------------------------------------------------
BM_eigen_i0e_double/1            57.3           57.3     10000000
BM_eigen_i0e_double/8           398            398        1748554
BM_eigen_i0e_double/64         3184           3184         218961
BM_eigen_i0e_double/512       25579          25579          27330
BM_eigen_i0e_double/4k       205043         205042           3418
BM_eigen_i0e_double/32k     1646038        1646176            422
BM_eigen_i0e_double/256k   13180959       13182613             53
BM_eigen_i0e_double/1M     52684617       52706132             10
BM_eigen_i0e_float/1             28.4           28.4     24636711
BM_eigen_i0e_float/8             75.7           75.7      9207634
BM_eigen_i0e_float/64           512            512        1000000
BM_eigen_i0e_float/512         4194           4194         166359
BM_eigen_i0e_float/4k         32756          32761          21373
BM_eigen_i0e_float/32k       261133         261153           2678
BM_eigen_i0e_float/256k     2087938        2088231            333
BM_eigen_i0e_float/1M       8380409        8381234             84
BM_eigen_i1e_double/1            56.3           56.3     10000000
BM_eigen_i1e_double/8           397            397        1772376
BM_eigen_i1e_double/64         3114           3115         223881
BM_eigen_i1e_double/512       25358          25361          27761
BM_eigen_i1e_double/4k       203543         203593           3462
BM_eigen_i1e_double/32k     1613649        1613803            428
BM_eigen_i1e_double/256k   12910625       12910374             54
BM_eigen_i1e_double/1M     51723824       51723991             10
BM_eigen_i1e_float/1             28.3           28.3     24683049
BM_eigen_i1e_float/8             74.8           74.9      9366216
BM_eigen_i1e_float/64           505            505        1000000
BM_eigen_i1e_float/512         4068           4068         171690
BM_eigen_i1e_float/4k         31803          31806          21948
BM_eigen_i1e_float/32k       253637         253692           2763
BM_eigen_i1e_float/256k     2019711        2019918            346
BM_eigen_i1e_float/1M       8238681        8238713             86


After:

CPU: Intel Haswell with HyperThreading (6 cores)
Benchmark                  Time(ns)        CPU(ns)     Iterations
-----------------------------------------------------------------
BM_eigen_i0e_double/1            15.8           15.8     44097476
BM_eigen_i0e_double/8            99.3           99.3      7014884
BM_eigen_i0e_double/64          777            777         886612
BM_eigen_i0e_double/512        6180           6181         100000
BM_eigen_i0e_double/4k        48136          48140          14678
BM_eigen_i0e_double/32k      385936         385943           1801
BM_eigen_i0e_double/256k    3293324        3293551            228
BM_eigen_i0e_double/1M     12423600       12424458             57
BM_eigen_i0e_float/1             16.3           16.3     43038042
BM_eigen_i0e_float/8             30.1           30.1     23456931
BM_eigen_i0e_float/64           169            169        4132875
BM_eigen_i0e_float/512         1338           1339         516860
BM_eigen_i0e_float/4k         10191          10191          68513
BM_eigen_i0e_float/32k        81338          81337           8531
BM_eigen_i0e_float/256k      651807         651984           1000
BM_eigen_i0e_float/1M       2633821        2634187            268
BM_eigen_i1e_double/1            16.2           16.2     42352499
BM_eigen_i1e_double/8           110            110        6316524
BM_eigen_i1e_double/64          822            822         851065
BM_eigen_i1e_double/512        6480           6481         100000
BM_eigen_i1e_double/4k        51843          51843          10000
BM_eigen_i1e_double/32k      414854         414852           1680
BM_eigen_i1e_double/256k    3320001        3320568            212
BM_eigen_i1e_double/1M     13442795       13442391             53
BM_eigen_i1e_float/1             17.6           17.6     41025735
BM_eigen_i1e_float/8             35.5           35.5     19597891
BM_eigen_i1e_float/64           240            240        2924237
BM_eigen_i1e_float/512         1424           1424         485953
BM_eigen_i1e_float/4k         10722          10723          65162
BM_eigen_i1e_float/32k        86286          86297           8048
BM_eigen_i1e_float/256k      691821         691868           1000
BM_eigen_i1e_float/1M       2777336        2777747            256


This shows anywhere from a 50% to 75% improvement on these operations.

I've also benchmarked without any of these flags turned on, and got similar
performance to before (if not better).

Also tested packetmath.cpp + special_functions to ensure no regressions.
2019-09-11 18:34:02 -07:00
Deven Desai
cdb377d0cb Fix for the HIP build+test errors introduced by the ndtri support.
The fixes needed are
 * adding EIGEN_DEVICE_FUNC attribute to a couple of funcs (else HIPCC will error out when non-device funcs are called from global/device funcs)
 * switching to using ::<math_func> instead std::<math_func> (only for HIPCC) in cases where the std::<math_func> is not recognized as a device func by HIPCC
 * removing an errant "j" from a testcase (don't know how that made it in to begin with!)
2019-09-06 16:03:49 +00:00
Eugene Zhulenev
d918bd9a8b Update ThreadLocal to use separate Initialize/Release callables 2019-09-10 16:13:32 -07:00
Eugene Zhulenev
e3dec4dcc1 ThreadLocal container that does not rely on thread local storage 2019-09-09 15:18:14 -07:00
Srinivas Vasudevan
e38dd48a27 PR 681: Add ndtri function, the inverse of the normal distribution function. 2019-08-12 19:26:29 -04:00
Eugene Zhulenev
47fefa235f Allow move-only done callback in TensorAsyncDevice 2019-09-03 17:20:56 -07:00
Eugene Zhulenev
a8d264fa9c Add test for const TensorMap underlying data mutation 2019-09-03 11:38:39 -07:00
Eugene Zhulenev
f68f2bba09 TensorMap constness should not change underlying storage constness 2019-09-03 11:08:09 -07:00
Alberto Luaces
c694be1214 Fixed Tensor documentation formatting. 2019-07-23 09:24:06 +00:00
Eugene Zhulenev
79c402e40e Fix shadow warnings in TensorContractionThreadPool 2019-08-30 15:38:31 -07:00
Eugene Zhulenev
edf2ec28d8 Fix block mapper type name in TensorExecutor 2019-08-30 15:29:25 -07:00
Eugene Zhulenev
f0b36fb9a4 evalSubExprsIfNeededAsync + async TensorContractionThreadPool 2019-08-30 15:13:38 -07:00
Eugene Zhulenev
619cea9491 Revert accidentally removed <memory> header from ThreadPool 2019-08-30 14:51:17 -07:00
Eugene Zhulenev
66665e7e76 Asynchronous expression evaluation with TensorAsyncDevice 2019-08-30 14:49:40 -07:00
Eugene Zhulenev
bc40d4522c Const correctness in TensorMap<const Tensor<T, ...>> expressions 2019-08-28 17:46:05 -07:00
Eugene Zhulenev
6e77f9bef3 Remove shadow warnings in TensorDeviceThreadPool 2019-08-28 10:32:19 -07:00
Rasmus Larsen
84fefdf321 Merged in ezhulenev/eigen-01 (pull request PR-683)
Asynchronous parallelFor in Eigen ThreadPoolDevice
2019-08-26 21:49:17 +00:00
maratek
8b5ab0e4dd Fix get_random_seed on Native Client
Newlib in Native Client SDK does not provide ::random function.
Implement get_random_seed for NaCl using ::rand, similarly to Windows version.
2019-08-23 15:25:56 -07:00
Eugene Zhulenev
6901788013 Asynchronous parallelFor in Eigen ThreadPoolDevice 2019-08-22 10:50:51 -07:00
Eugene Zhulenev
071311821e Remove XSMM support from Tensor module 2019-08-19 11:44:25 -07:00
Rasmus Munk Larsen
facc4e4536 Disable tests for contraction with output kernels when using libxsmm, which does not support this. 2019-08-07 14:11:15 -07:00
Rasmus Munk Larsen
eab7e52db2 [Eigen] Vectorize evaluation of coefficient-wise functions over tensor blocks if the strides are known to be 1. Provides up to 20-25% speedup of the TF cross entropy op with AVX.
A few benchmark numbers:

name                              old time/op             new time/op             delta
BM_Xent_16_10000_cpu              448µs ± 3%              389µs ± 2%  -13.21%
(p=0.008 n=5+5)
BM_Xent_32_10000_cpu              575µs ± 6%              454µs ± 3%  -21.00%          (p=0.008 n=5+5)
BM_Xent_64_10000_cpu              933µs ± 4%              712µs ± 1%  -23.71%          (p=0.008 n=5+5)
2019-08-07 12:57:42 -07:00
Rasmus Munk Larsen
0987126165 Clean up unnecessary namespace specifiers in TensorBlock.h. 2019-08-07 12:12:52 -07:00
Rasmus Munk Larsen
e2999d4c38 Fix performance regressions due to https://bitbucket.org/eigen/eigen/pull-requests/662.
The change caused the device struct to be copied for each expression evaluation, and caused, e.g., a 10% regression in the TensorFlow multinomial op on GPU:


Benchmark                       Time(ns)        CPU(ns)     Iterations
----------------------------------------------------------------------
BM_Multinomial_gpu_1_100000_4     128173         231326           2922  1.610G items/s

VS

Benchmark                       Time(ns)        CPU(ns)     Iterations
----------------------------------------------------------------------
BM_Multinomial_gpu_1_100000_4     146683         246914           2719  1.509G items/s
2019-08-02 11:18:13 -07:00
Eugene Zhulenev
3cd148f983 Fix expression evaluation heuristic for TensorSliceOp 2019-07-09 12:10:26 -07:00
Eugene Zhulenev
6083014594 Add outer/inner chipping optimization for chipping dimension specified at runtime 2019-07-03 11:35:25 -07:00
Deven Desai
7eb2e0a95b adding the EIGEN_DEVICE_FUNC attribute to the constCast routine.
Not having this attribute results in the following failures in the `--config=rocm` TF build.

```
In file included from tensorflow/core/kernels/cross_op_gpu.cu.cc:20:
In file included from ./tensorflow/core/framework/register_types.h:20:
In file included from ./tensorflow/core/framework/numeric_types.h:20:
In file included from ./third_party/eigen3/unsupported/Eigen/CXX11/Tensor:1:
In file included from external/eigen_archive/unsupported/Eigen/CXX11/Tensor:140:
external/eigen_archive/unsupported/Eigen/CXX11/src/Tensor/TensorChipping.h:356:37: error:  'Eigen::constCast':  no overloaded function has restriction specifiers that are compatible with the ambient context 'data'
    typename Storage::Type result = constCast(m_impl.data());
                                    ^
external/eigen_archive/unsupported/Eigen/CXX11/src/Tensor/TensorChipping.h:356:37: error:  'Eigen::constCast':  no overloaded function has restriction specifiers that are compatible with the ambient context 'data'
external/eigen_archive/unsupported/Eigen/CXX11/src/Tensor/TensorAssign.h:148:56: note: in instantiation of member function 'Eigen::TensorEvaluator<const Eigen::TensorChippingOp<1, Eigen::TensorMap<Eigen::Tensor<int, 2, 1, long>, 16, MakePointer> >, Eigen::Gpu\
Device>::data' requested here
    return m_rightImpl.evalSubExprsIfNeeded(m_leftImpl.data());

```

Adding the EIGEN_DEVICE_FUNC attribute resolves those errors
2019-07-02 20:02:46 +00:00
Gael Guennebaud
ef8aca6a89 Merged in codeplaysoftware/eigen (pull request PR-667)
[SYCL] :

Approved-by: Gael Guennebaud <g.gael@free.fr>
Approved-by: Rasmus Larsen <rmlarsen@google.com>
2019-07-02 12:45:23 +00:00
Eugene Zhulenev
4ac93f8edc Allocate non-const scalar buffer for block evaluation with DefaultDevice 2019-07-01 10:55:19 -07:00
Mehdi Goli
9ea490c82c [SYCL] :
* Modifying TensorDeviceSYCL to use `EIGEN_THROW_X`.
  * Modifying TensorMacro to use `EIGEN_TRY/CATCH(X)` macro.
  * Modifying TensorReverse.h to use `EIGEN_DEVICE_REF` instead of `&`.
  * Fixing the SYCL device macro in SpecialFunctionsImpl.h.
2019-07-01 16:27:28 +01:00
Eugene Zhulenev
81a03bec75 Fix TensorReverse on GPU with m_stride[i]==0 2019-06-28 15:50:39 -07:00
Rasmus Munk Larsen
74a9dd1102 Fix preprocessor condition to only generate a warning when calling eigen::GpuDevice::synchronize() from device code, but not when calling from a non-GPU compilation unit. 2019-06-28 11:56:21 -07:00
Rasmus Munk Larsen
70d4020ad9 Remove comma causing warning in c++03 mode. 2019-06-28 11:39:45 -07:00
Eugene Zhulenev
6e7c76481a Merge with Eigen head 2019-06-28 11:22:46 -07:00
Eugene Zhulenev
878845cb25 Add block access to TensorReverseOp and make sure that TensorForcedEval uses block access when preferred 2019-06-28 11:13:44 -07:00
Mehdi Goli
7d08fa805a [SYCL] This PR adds the minimum modifications to the Eigen unsupported module required to run it on devices supporting SYCL.
* Abstracting the pointer type so that both SYCL memory and pointer can be captured.
* Converting SYCL virtual pointer to SYCL device memory in Eigen evaluator class.
* Binding SYCL placeholder accessor to command group handler by using bind method in Eigen evaluator node.
* Adding SYCL macro for controlling loop unrolling.
* Modifying the TensorDeviceSycl.h and SYCL executor method to adopt the above changes.
2019-06-28 10:08:23 +01:00
Christoph Hertzberg
adec097c61 Remove extra comma (causes warnings in C++03) 2019-06-26 16:14:28 +02:00
Eugene Zhulenev
229db81572 Optimize evaluation strategy for TensorSlicingOp and TensorChippingOp 2019-06-25 15:41:37 -07:00
Rasmus Larsen
c1b0aea653 Merged in Artem-B/eigen (pull request PR-654)
Minor build improvements

Approved-by: Rasmus Larsen <rmlarsen@google.com>
2019-05-31 22:27:04 +00:00
Rasmus Munk Larsen
b08527b0c1 Clean up CUDA/NVCC version macros and their use in Eigen, and a few other CUDA build failures. 2019-05-31 15:26:06 -07:00
tra
b4c49bf00e Minor build improvements
* Allow specifying multiple GPU architectures. E.g.:
  cmake -DEIGEN_CUDA_COMPUTE_ARCH="60;70"
* Pass CUDA SDK path to clang. Without it it will default to /usr/local/cuda
which may not be the right location, if cmake was invoked with
-DCUDA_TOOLKIT_ROOT_DIR=/some/other/CUDA/path
2019-05-31 14:08:34 -07:00
Michael Tesch
c5019f722b Use pade for matrix exponential also for complex values. 2019-05-08 17:04:55 +02:00
Rasmus Larsen
e92486b8c3 Merged in rmlarsen/eigen (pull request PR-643)
Make Eigen build with cuda 10 and clang.

Approved-by: Justin Lebar <justin.lebar@gmail.com>
2019-05-20 17:02:39 +00:00
Eugene Zhulenev
01654d97fa Prevent potential division by zero in TensorExecutor 2019-05-17 14:02:25 -07:00
Eugene Zhulenev
96a276803c Always evaluate Tensor expressions with broadcasting via tiled evaluation code path 2019-05-16 16:15:45 -07:00
Rasmus Munk Larsen
ab0a30e429 Make Eigen build with cuda 10 and clang. 2019-05-15 13:32:15 -07:00
Rasmus Larsen
c8d8d5c0fc Merged in rmlarsen/eigen_threadpool (pull request PR-640)
Fix deadlocks in thread pool.

Approved-by: Eugene Zhulenev <ezhulenev@google.com>
2019-05-13 20:04:35 +00:00
Christoph Hertzberg
4ccd1ece92 bug #1707: Fix deprecation warnings, or disable warnings when testing deprecated functions 2019-05-10 14:57:05 +02:00
Rasmus Munk Larsen
e5ac8cbd7a A) fix deadlocks in thread pool caused by EventCount
This fixed 2 deadlocks caused by sloppiness in the EventCount logic.
Both most likely were introduced by cl/236729920 which includes the new EventCount algorithm:
01da8caf00

bug #1 (Prewait):
Prewait must not consume existing signals.
Consider the following scenario.
There are 2 thread pool threads (1 and 2) and 1 external thread (3). RunQueue is empty.
Thread 1 checks the queue, calls Prewait, checks RunQueue again and now is going to call CommitWait.
Thread 2 checks the queue and now is going to call Prewait.
Thread 3 submits 2 tasks, EventCount signals is set to 1 because only 1 waiter is registered the second signal is discarded).
Now thread 2 resumes and calls Prewait and takes away the signal.
Thread 1 resumes and calls CommitWait, there are no pending signals anymore, so it blocks.
As the result we have 2 tasks, but only 1 thread is running.

bug #2 (CancelWait):
CancelWait must not take away a signal if it's not sure that the signal was meant for this thread.
When one thread blocks and another submits a new task concurrently, the EventCount protocol guarantees only the following properties (similar to the Dekker's algorithm):
(a) the registered waiter notices presence of the new task and does not block
(b) the signaler notices presence of the waiters and wakes it
(c) both the waiter notices presence of the new task and signaler notices presence of the waiter
[it's only that both of them do not notice each other must not be possible, because it would lead to a deadlock]
CancelWait is called for cases (a) and (c). For case (c) it is OK to take the notification signal away, but it's not OK for (a) because nobody queued a signals for us and we take away a signal meant for somebody else.
Consider:
Thread 1 calls Prewait, checks RunQueue, it's empty, now it's going to call CommitWait.
Thread 3 submits 2 tasks, EventCount signals is set to 1 because only 1 waiter is registered the second signal is discarded).
Thread 2 calls Prewait, checks RunQueue, discovers the tasks, calls CancelWait and consumes the pending signal (meant for thread 1).
Now Thread 1 resumes and calls CommitWait, since there are no signals it blocks.
As the result we have 2 tasks, but only 1 thread is running.

Both deadlocks are only a problem if the tasks require parallelism. Most computational tasks do not require parallelism, i.e. a single thread will run task 1, finish it and then dequeue and run task 2.

This fix undoes some of the sloppiness in the EventCount that was meant to reduce CPU consumption by idle threads, because we now have more threads running in these corner cases. But we still don't have pthread_yield's and maybe the strictness introduced by this change will actually help to reduce tail latency because we will have threads running when we actually need them running.



B) fix deadlock in thread pool caused by RunQueue

This fixed a deadlock caused by sloppiness in the RunQueue logic.
Most likely this was introduced with the non-blocking thread pool.
The deadlock only affects workloads that require parallelism.
Most computational tasks don't require parallelism.

PopBack must not fail spuriously. If it does, it can effectively lead to single thread consuming several wake up signals.
Consider 2 worker threads are blocked.
External thread submits a task. One of the threads is woken.
It tries to steal the task, but fails due to a spurious failure in PopBack (external thread submits another task and holds the lock).
The thread executes blocking protocol again (it won't block because NonEmptyQueueIndex is precise and the thread will discover pending work, but it has called PrepareWait).
Now external thread submits another task and signals EventCount again.
The signal is consumed by the first thread again. But now we have 2 tasks pending but only 1 worker thread running.

It may be possible to fix this in a different way: make EventCount::CancelWait forward wakeup signal to a blocked thread rather then consuming it. But this looks more complex and I am not 100% that it will fix the bug.
It's also possible to have 2 versions of PopBack: one will do try_to_lock and another won't. Then worker threads could first opportunistically check all queues with try_to_lock, and only use the blocking version before blocking. But let's first fix the bug with the simpler change.
2019-05-08 10:16:46 -07:00