eigen

mirror of https://gitlab.com/libeigen/eigen.git synced 2024-12-21 07:19:46 +08:00

History

Mehdi Goli f499fe9496 Adding synchronisation to convolution kernel for sycl backend.		2017-03-13 09:18:37 +00:00
..
benchmark_main.cc	Fixed the tensor benchmarks on apple devices	2016-01-28 21:08:07 -08:00
benchmark.h
contraction_benchmarks_cpu.cc	Added benchmarks for contraction on CPU.	2016-05-13 14:32:17 -07:00
README	Fixing typo in sycl Benchmark.	2017-03-08 18:20:06 +00:00
tensor_benchmarks_cpu.cc	Extended the tensor benchmark suite to support types other than floats	2016-02-23 05:28:02 +00:00
tensor_benchmarks_fp16_gpu.cu	Added a benchmark to measure the performance of full reductions of 16 bit floats	2016-05-05 14:15:11 -07:00
tensor_benchmarks_gpu.cu	Added benchmarks for full reduction	2016-02-29 14:57:52 -08:00
tensor_benchmarks_sycl_include_headers.cc	Fixing typo in sycl Benchmark.	2017-03-08 18:20:06 +00:00
tensor_benchmarks_sycl.cc	Adding sycl Benchmarks.	2017-03-08 14:17:48 +00:00
tensor_benchmarks.h	Adding synchronisation to convolution kernel for sycl backend.	2017-03-13 09:18:37 +00:00

README

The tensor benchmark suite is made of several parts.

The first part is a generic suite, in which each benchmark comes in 2 flavors: one that runs on CPU, and one that runs on GPU.

To compile the floating point CPU benchmarks, simply call:
g++ tensor_benchmarks_cpu.cc benchmark_main.cc -I ../../ -std=c++11 -O3 -DNDEBUG -pthread -mavx -o benchmarks_cpu

To compile the floating point GPU benchmarks, simply call:
nvcc tensor_benchmarks_gpu.cu benchmark_main.cc -I ../../ -std=c++11 -O2 -DNDEBUG -use_fast_math -ftz=true -arch compute_35 -o benchmarks_gpu

We also provide a version of the generic GPU tensor benchmarks that uses half floats (aka fp16) instead of regular floats. To compile these benchmarks, simply call the command line below. You'll need a recent GPU that supports compute capability 5.3 or higher to run them and nvcc 7.5 or higher to compile the code.
nvcc tensor_benchmarks_fp16_gpu.cu benchmark_main.cc -I ../../ -std=c++11 -O2 -DNDEBUG -use_fast_math -ftz=true -arch compute_53 -o benchmarks_fp16_gpu

last but not least, we also provide a suite of benchmarks to measure the scalability of the contraction code on CPU. To compile these benchmarks, call
g++ contraction_benchmarks_cpu.cc benchmark_main.cc -I ../../ -std=c++11 -O3 -DNDEBUG -pthread -mavx -o benchmarks_cpu

To compile and run the benchmark for SYCL, using ComputeCpp you currently need following passes (only for translation units containing device code):
1. The device compilation pass that generates the device code (SYCL kernels and referenced device functions) and glue code needed by the host compiler to reference the device code from host code.
{ComputeCpp_ROOT}/bin/compute++ -I ../../ -I {ComputeCpp_ROOT}/include/ -std=c++11 -mllvm -inline-threshold=1000 -Wno-ignored-attributes -sycl -intelspirmetadata -emit-llvm -no-serial-memop -sycl-compress-name -DBUILD_PLATFORM_SPIR -DNDBUG -O3 -c tensor_benchmarks_sycl.cc -DEIGEN_USE_SYCL=1
2. The host compilation pass that generates the final host binary.
clang++ -O3 -c benchmark_main.cc -pthread -I ../../ -D_GLIBCXX_USE_CXX11_ABI=0 -DEIGEN_USE_SYCL=1 -std=c++11 -o benchmark_main.o
clang++ -O3 tensor_benchmarks_sycl_include_headers.cc -pthread -I ../../ -I  {ComputeCpp_ROOT}/include/ -L  {ComputeCpp_ROOT}/lib/ -lComputeCpp -lOpenCL -D_GLIBCXX_USE_CXX11_ABI=0 -DEIGEN_USE_SYCL=1 -std=c++11 benchmark_main.o -o tensor_benchmark_sycl
export LD_LIBRARY_PATH={ComputeCpp_ROOT}/lib
3. Run the benchmark
./tensor_benchmark_sycl