evalShardedByInnerDim ensures that the values it passes for start_k and
end_k to evalGemmPartialWithoutOutputKernel are multiples of 8 as the kernel
does not work correctly when the values of k are not multiples of the
packet_size. While this precaution works for AVX builds, it is insufficient
for AVX512 builds where the maximum packet size is 16. The result is slightly
incorrect float32 contractions on AVX512 builds.
This commit fixes the problem by ensuring that k is always a multiple of
the packet_size if the packet_size is > 8.
This patch modifies the TensorContraction class to ensure that the kc_ field is
always a multiple of the packet_size, if the packet_size is > 8. Without this
change spatial convolutions in Tensorflow do not work properly as the code that
re-arranges the input matrices can assert if kc_ is not a multiple of the
packet_size. This leads to a unit test failure,
//tensorflow/python/kernel_tests:conv_ops_test, on AVX512 builds of tensorflow.
* Support compiling without IO streams
Add the preprocessor definition EIGEN_NO_IO which, if defined,
disables all use of the IO streams part of the standard library.
INFO: From Compiling tensorflow/core/kernels/maxpooling_op_gpu.cu.cc:
/b/f/w/run/external/eigen_archive/Eigen/src/Core/arch/GPU/Half.h(197): error: calling a __host__ function("std::equal_to<float> ::operator () const") from a __global__ function("tensorflow::_NV_ANON_NAMESPACE::MaxPoolGradBackwardNoMaskNHWC< ::Eigen::half> ") is not allowed
/b/f/w/run/external/eigen_archive/Eigen/src/Core/arch/GPU/Half.h(197): error: identifier "std::equal_to<float> ::operator () const" is undefined in device code"
/b/f/w/run/external/eigen_archive/Eigen/src/Core/arch/GPU/Half.h(197): error: calling a __host__ function("std::equal_to<float> ::operator () const") from a __global__ function("tensorflow::_NV_ANON_NAMESPACE::MaxPoolGradBackwardNoMaskNCHW< ::Eigen::half> ") is not allowed
/b/f/w/run/external/eigen_archive/Eigen/src/Core/arch/GPU/Half.h(197): error: identifier "std::equal_to<float> ::operator () const" is undefined in device code
4 errors detected in the compilation of "/tmp/tmpxft_00000011_00000000-6_maxpooling_op_gpu.cu.cpp1.ii".
ERROR: /tmpfs/tensor_flow/tensorflow/core/kernels/BUILD:3753:1: output 'tensorflow/core/kernels/_objs/pooling_ops_gpu/maxpooling_op_gpu.cu.pic.o' was not created
ERROR: /tmpfs/tensor_flow/tensorflow/core/kernels/BUILD:3753:1: Couldn't build file tensorflow/core/kernels/_objs/pooling_ops_gpu/maxpooling_op_gpu.cu.pic.o: not all outputs were created or valid
The Packet16f, Packet8f and Packet8d types are too large to use with dynamically
sized matrices typically processed by the SliceVectorizedTraversal specialization of
the dense_assignment_loop. Using these types is likely to lead to little or no
vectorization. Significant slowdown in the multiplication of these small matrices can
be observed when building with AVX and AVX512 enabled.
This patch introduces a new dense_assignment_kernel that is used when
computing small products whose operands have dynamic dimensions. It ensures that the
PacketSize used is no larger than 4, thereby increasing the chance that vectorized
instructions will be used when computing the product.
I tested all 969 possible combinations of M, K, and N that are handled by the
dense_assignment_loop on x86 builds. Although a few combinations are slowed down
by this patch they are far outnumbered by the cases that are sped up, as the
following results demonstrate.
Disabling Packed8d on AVX512 builds:
Total Cases: 969
Better: 511
Worse: 85
Same: 373
Max Improvement: 169.00% (4 8 6)
Max Degradation: 36.50% (8 5 3)
Median Improvement: 35.46%
Median Degradation: 17.41%
Total FLOPs Improvement: 19.42%
Disabling Packet16f and Packed8f on AVX512 builds:
Total Cases: 969
Better: 658
Worse: 5
Same: 306
Max Improvement: 214.05% (8 6 5)
Max Degradation: 22.26% (16 2 1)
Median Improvement: 60.05%
Median Degradation: 13.32%
Total FLOPs Improvement: 59.58%
Disabling Packed8f on AVX builds:
Total Cases: 969
Better: 663
Worse: 96
Same: 210
Max Improvement: 155.29% (4 10 5)
Max Degradation: 35.12% (8 3 2)
Median Improvement: 34.28%
Median Degradation: 15.05%
Total FLOPs Improvement: 26.02%