Commit aa110e681b
optimised the multiplication of small dyanmically
sized matrices by restricting the packet size to a maximum of 4, increasing
the chances that SIMD instructions are used in the computation. However, it
introduced a mismatch between the packet size and the requestedAlignment. This
mismatch can lead to crashes when the destination is not aligned. This patch
fixes the issue by ensuring that the AssignmentTraits are correctly computed
when using a restricted packet size.
* * *
Bind LinearPacketType to MaxPacketSize
This commit applies any packet size limit specified when instantiating
copy_using_evaluator_traits to the LinearPacketType, providing that the
size of the destination is not known at compile time.
* * *
Add unit test for restricted packet assignment
A new unit test is added to check that multiplication of small dynamically
sized matrices works correctly when the packet size is restricted to 4 and
the destination is unaligned.
* Support compiling without IO streams
Add the preprocessor definition EIGEN_NO_IO which, if defined,
disables all use of the IO streams part of the standard library.
INFO: From Compiling tensorflow/core/kernels/maxpooling_op_gpu.cu.cc:
/b/f/w/run/external/eigen_archive/Eigen/src/Core/arch/GPU/Half.h(197): error: calling a __host__ function("std::equal_to<float> ::operator () const") from a __global__ function("tensorflow::_NV_ANON_NAMESPACE::MaxPoolGradBackwardNoMaskNHWC< ::Eigen::half> ") is not allowed
/b/f/w/run/external/eigen_archive/Eigen/src/Core/arch/GPU/Half.h(197): error: identifier "std::equal_to<float> ::operator () const" is undefined in device code"
/b/f/w/run/external/eigen_archive/Eigen/src/Core/arch/GPU/Half.h(197): error: calling a __host__ function("std::equal_to<float> ::operator () const") from a __global__ function("tensorflow::_NV_ANON_NAMESPACE::MaxPoolGradBackwardNoMaskNCHW< ::Eigen::half> ") is not allowed
/b/f/w/run/external/eigen_archive/Eigen/src/Core/arch/GPU/Half.h(197): error: identifier "std::equal_to<float> ::operator () const" is undefined in device code
4 errors detected in the compilation of "/tmp/tmpxft_00000011_00000000-6_maxpooling_op_gpu.cu.cpp1.ii".
ERROR: /tmpfs/tensor_flow/tensorflow/core/kernels/BUILD:3753:1: output 'tensorflow/core/kernels/_objs/pooling_ops_gpu/maxpooling_op_gpu.cu.pic.o' was not created
ERROR: /tmpfs/tensor_flow/tensorflow/core/kernels/BUILD:3753:1: Couldn't build file tensorflow/core/kernels/_objs/pooling_ops_gpu/maxpooling_op_gpu.cu.pic.o: not all outputs were created or valid
The Packet16f, Packet8f and Packet8d types are too large to use with dynamically
sized matrices typically processed by the SliceVectorizedTraversal specialization of
the dense_assignment_loop. Using these types is likely to lead to little or no
vectorization. Significant slowdown in the multiplication of these small matrices can
be observed when building with AVX and AVX512 enabled.
This patch introduces a new dense_assignment_kernel that is used when
computing small products whose operands have dynamic dimensions. It ensures that the
PacketSize used is no larger than 4, thereby increasing the chance that vectorized
instructions will be used when computing the product.
I tested all 969 possible combinations of M, K, and N that are handled by the
dense_assignment_loop on x86 builds. Although a few combinations are slowed down
by this patch they are far outnumbered by the cases that are sped up, as the
following results demonstrate.
Disabling Packed8d on AVX512 builds:
Total Cases: 969
Better: 511
Worse: 85
Same: 373
Max Improvement: 169.00% (4 8 6)
Max Degradation: 36.50% (8 5 3)
Median Improvement: 35.46%
Median Degradation: 17.41%
Total FLOPs Improvement: 19.42%
Disabling Packet16f and Packed8f on AVX512 builds:
Total Cases: 969
Better: 658
Worse: 5
Same: 306
Max Improvement: 214.05% (8 6 5)
Max Degradation: 22.26% (16 2 1)
Median Improvement: 60.05%
Median Degradation: 13.32%
Total FLOPs Improvement: 59.58%
Disabling Packed8f on AVX builds:
Total Cases: 969
Better: 663
Worse: 96
Same: 210
Max Improvement: 155.29% (4 10 5)
Max Degradation: 35.12% (8 3 2)
Median Improvement: 34.28%
Median Degradation: 15.05%
Total FLOPs Improvement: 26.02%
- unsupported/Eigen/CXX11/src/Tensor/TensorReductionGpu.h
Changing "pass-by-reference" argument to be "pass-by-value" instead
(in a __global__ function decl).
"pass-by-reference" arguments to __global__ functions are unwise,
and will be explicitly flagged as errors by the newer versions of HIP.
- Eigen/src/Core/util/Memory.h
- unsupported/Eigen/CXX11/src/Tensor/TensorContraction.h
Changes introduced in recent commits breaks the HIP compile.
Adding EIGEN_DEVICE_FUNC attribute to some functions and
calling ::malloc/free instead of the corresponding std:: versions
to get the HIP compile working again
- unsupported/Eigen/CXX11/src/Tensor/TensorReduction.h
Change introduced a recent commit breaks the HIP compile
(link stage errors out due to failure to inline a function).
Disabling the recently introduced code (only for HIP compile), to get
the eigen nightly testing going again.
Will submit another PR once we have te proper fix.
- Eigen/src/Core/util/ConfigureVectorization.h
Enabling GPU VECTOR support when HIP compiler is in use
(for both the host and device compile phases)
Compiling the eigen unittests with hip-clang (HIP with clang as the underlying compiler instead of hcc or nvcc), results in compile errors. The changes in this commit fix those compile errors. The main change is to convert a few instances of "__device__" to "EIGEN_DEVICE_FUNC"
Previously the rendered math in the doc string called the optional return value
'r', while the actual parameter and the doc string text referred to the
parameter as 'z'. This changeset renames all the z's to r's to match the math.
This commit fixes the AVX512 implementations of psqrt in the same
way that 3ed67cb0bb
fixed the AVX2 version of this function. The
AVX512 versions of psqrt incorrectly return -0.0 for negative
values, instead of NaN. Fixing the issues requires adding
some additional instructions that slow down the algorithms. A
similar test to the one used in 3ed67cb0bb
shows that the
corrected Packet16f code runs at 73% of the speed of the existing code,
while the corrected Packed8d function runs at 68% of the original.
Also, a few minor fixes for GPU tests running in HIP mode.
1. Adding an include for hip/hip_runtime.h in the Macros.h file
For HIP __host__ and __device__ are macros which are defined in hip headers.
Their definitions need to be included before their use in the file.
2. Fixing the compile failure in TensorContractionGpu introduced by the commit to
"Fuse computations into the Tensor contractions using output kernel"
3. Fixing a HIP/clang specific compile error by making the struct-member assignment explicit
The major changes are
1. Moving CUDA/PacketMath.h to GPU/PacketMath.h
2. Moving CUDA/MathFunctions.h to GPU/MathFunction.h
3. Moving CUDA/CudaSpecialFunctions.h to GPU/GpuSpecialFunctions.h
The above three changes effectively enable the Eigen "Packet" layer for the HIP platform
4. Merging the "hip_basic" and "cuda_basic" unit tests into one ("gpu_basic")
5. Updating the "EIGEN_DEVICE_FUNC" marking in some places
The change has been tested on the HIP and CUDA platforms.
The AVX512 version of ptranpose for PacketBlock<Packet16h,16> was
reordering the PacketBlock argument incorrectly. This lead to errors in
the multiplication of matrices composed of 16 bit floats on AVX512
machines, if at least of the matrices was using RowMajor order. This
error is responsible for one tensorflow unit test failure on AVX512
machines:
//tensorflow/python/kernel_tests:batch_matmul_op_test
There are two major changes (and a few minor ones which are not listed here...see PR discussion for details)
1. Eigen::half implementations for HIP and CUDA have been merged.
This means that
- `CUDA/Half.h` and `HIP/hcc/Half.h` got merged to a new file `GPU/Half.h`
- `CUDA/PacketMathHalf.h` and `HIP/hcc/PacketMathHalf.h` got merged to a new file `GPU/PacketMathHalf.h`
- `CUDA/TypeCasting.h` and `HIP/hcc/TypeCasting.h` got merged to a new file `GPU/TypeCasting.h`
After this change the `HIP/hcc` directory only contains one file `math_constants.h`. That will go away too once that file becomes a part of the HIP install.
2. new macros EIGEN_GPUCC, EIGEN_GPU_COMPILE_PHASE and EIGEN_HAS_GPU_FP16 have been added and the code has been updated to use them where appropriate.
- `EIGEN_GPUCC` is the same as `(EIGEN_CUDACC || EIGEN_HIPCC)`
- `EIGEN_GPU_DEVICE_COMPILE` is the same as `(EIGEN_CUDA_ARCH || EIGEN_HIP_DEVICE_COMPILE)`
- `EIGEN_HAS_GPU_FP16` is the same as `(EIGEN_HAS_CUDA_FP16 or EIGEN_HAS_HIP_FP16)`
In addition to igamma(a, x), this code implements:
* igamma_der_a(a, x) = d igamma(a, x) / da -- derivative of igamma with respect to the parameter
* gamma_sample_der_alpha(alpha, sample) -- reparameterization derivative of a Gamma(alpha, 1) random variable sample with respect to the alpha parameter
The derivatives are computed by forward mode differentiation of the igamma(a, x) code. Although gamma_sample_der_alpha can be implemented via igamma_der_a, a separate function is more accurate and efficient due to analytical cancellation of some terms. All three functions are implemented by a method parameterized with "mode" that always computes the derivatives, but does not return them unless required by the mode. The compiler is expected to (and, based on benchmarks, does) skip the unnecessary computations depending on the mode.
This commit enables the use of Eigen on HIP kernels / AMD GPUs. Support has been added along the same lines as what already exists for using Eigen in CUDA kernels / NVidia GPUs.
Application code needs to explicitly define EIGEN_USE_HIP when using Eigen in HIP kernels. This is because some of the CUDA headers get picked up by default during Eigen compile (irrespective of whether or not the underlying compiler is CUDACC/NVCC, for e.g. Eigen/src/Core/arch/CUDA/Half.h). In order to maintain this behavior, the EIGEN_USE_HIP macro is used to switch to using the HIP version of those header files (see Eigen/Core and unsupported/Eigen/CXX11/Tensor)
Use the "-DEIGEN_TEST_HIP" cmake option to enable the HIP specific unit tests.
The functions are conventionally called i0e and i1e. The exponentially scaled version is more numerically stable. The standard Bessel functions can be obtained as i0(x) = exp(|x|) i0e(x)
The code is ported from Cephes and tested against SciPy.
bug #1548
The macro EIGEN_IDEAL_MAX_ALIGN_BYTES is being incorrectly set to 32
on AVX512 builds. It should be set to 64. In the current code it is
only set to 64 if the macro EIGEN_VECTORIZE_AVX512 is defined. This
macro does get defined in AVX512 builds in Core, but only after Macros.h,
the file that defines EIGEN_IDEAL_MAX_ALIGN_BYTES, has been included.
This commit fixes the issue by setting EIGEN_IDEAL_MAX_ALIGN_BYTES to
64 if __AVX512F__ is defined.
specializations. Otherwise causes problems with small fixed size matrix multiplication (call to
0x00 in call_assignment_no_alias in debug mode or trap in release with CUDA 9.1).
1) Q is always square
2) Q*R*P' is valid and recovers the original matrix
This implies that the size of Q is the number of rows in the original matrix, square,
and that the size of R is the size of the original matrix.