eigen

mirror of https://gitlab.com/libeigen/eigen.git synced 2024-12-21 07:19:46 +08:00

Author	SHA1	Message	Date
Benoit Steiner	e7e64c3277	Enable the use of the packet api to evaluate tensor broadcasts. This speed things up quite a bit: Before" M_broadcasting/10 500000 3690 27.10 MFlops/s BM_broadcasting/80 500000 4014 1594.24 MFlops/s BM_broadcasting/640 100000 14770 27731.35 MFlops/s BM_broadcasting/4K 5000 632711 39512.48 MFlops/s After: BM_broadcasting/10 500000 4287 23.33 MFlops/s BM_broadcasting/80 500000 4455 1436.41 MFlops/s BM_broadcasting/640 200000 10195 40173.01 MFlops/s BM_broadcasting/4K 5000 423746 58997.57 MFlops/s	2016-05-17 09:24:35 -07:00
Benoit Steiner	5fa27574dd	Allow vectorized padding on GPU. This helps speed things up a little Before: BM_padding/10 5000000 460 217.03 MFlops/s BM_padding/80 5000000 460 13899.40 MFlops/s BM_padding/640 5000000 461 888421.17 MFlops/s BM_padding/4K 5000000 460 54316322.55 MFlops/s After: BM_padding/10 5000000 454 220.20 MFlops/s BM_padding/80 5000000 455 14039.86 MFlops/s BM_padding/640 5000000 452 904968.83 MFlops/s BM_padding/4K 5000000 411 60750049.21 MFlops/s	2016-05-17 09:17:26 -07:00
Benoit Steiner	8d06c02ffd	Allow vectorized padding on GPU. This helps speed things up a little. Before: BM_padding/10 5000000 460 217.03 MFlops/s BM_padding/80 5000000 460 13899.40 MFlops/s BM_padding/640 5000000 461 888421.17 MFlops/s BM_padding/4K 5000000 460 54316322.55 MFlops/s After: BM_padding/10 5000000 454 220.20 MFlops/s BM_padding/80 5000000 455 14039.86 MFlops/s BM_padding/640 5000000 452 904968.83 MFlops/s BM_padding/4K 5000000 411 60750049.21 MFlops/s	2016-05-17 09:13:27 -07:00
David Dement	ccc7563ac5	made a fix to the GMRES solver so that it now correctly reports the error achieved in the solution process	2016-05-16 14:26:41 -04:00
Benoit Steiner	a80d875916	Added missing costPerCoeff method	2016-05-16 09:31:10 -07:00
Benoit Steiner	83ef39e055	Turn on the cost model by default. This results in some significant speedups for smaller tensors. For example, below are the results for the various tensor reductions. Before: BM_colReduction_12T/10 1000000 1949 51.29 MFlops/s BM_colReduction_12T/80 100000 15636 409.29 MFlops/s BM_colReduction_12T/640 20000 95100 4307.01 MFlops/s BM_colReduction_12T/4K 500 4573423 5466.36 MFlops/s BM_colReduction_4T/10 1000000 1867 53.56 MFlops/s BM_colReduction_4T/80 500000 5288 1210.11 MFlops/s BM_colReduction_4T/640 10000 106924 3830.75 MFlops/s BM_colReduction_4T/4K 500 9946374 2513.48 MFlops/s BM_colReduction_8T/10 1000000 1912 52.30 MFlops/s BM_colReduction_8T/80 200000 8354 766.09 MFlops/s BM_colReduction_8T/640 20000 85063 4815.22 MFlops/s BM_colReduction_8T/4K 500 5445216 4591.19 MFlops/s BM_rowReduction_12T/10 1000000 2041 48.99 MFlops/s BM_rowReduction_12T/80 100000 15426 414.87 MFlops/s BM_rowReduction_12T/640 50000 39117 10470.98 MFlops/s BM_rowReduction_12T/4K 500 3034298 8239.14 MFlops/s BM_rowReduction_4T/10 1000000 1834 54.51 MFlops/s BM_rowReduction_4T/80 500000 5406 1183.81 MFlops/s BM_rowReduction_4T/640 50000 35017 11697.16 MFlops/s BM_rowReduction_4T/4K 500 3428527 7291.76 MFlops/s BM_rowReduction_8T/10 1000000 1925 51.95 MFlops/s BM_rowReduction_8T/80 200000 8519 751.23 MFlops/s BM_rowReduction_8T/640 50000 33441 12248.42 MFlops/s BM_rowReduction_8T/4K 1000 2852841 8763.19 MFlops/s After: BM_colReduction_12T/10 50000000 59 1678.30 MFlops/s BM_colReduction_12T/80 5000000 725 8822.71 MFlops/s BM_colReduction_12T/640 20000 90882 4506.93 MFlops/s BM_colReduction_12T/4K 500 4668855 5354.63 MFlops/s BM_colReduction_4T/10 50000000 59 1687.37 MFlops/s BM_colReduction_4T/80 5000000 737 8681.24 MFlops/s BM_colReduction_4T/640 50000 108637 3770.34 MFlops/s BM_colReduction_4T/4K 500 7912954 3159.38 MFlops/s BM_colReduction_8T/10 50000000 60 1657.21 MFlops/s BM_colReduction_8T/80 5000000 726 8812.48 MFlops/s BM_colReduction_8T/640 20000 91451 4478.90 MFlops/s BM_colReduction_8T/4K 500 5441692 4594.16 MFlops/s BM_rowReduction_12T/10 20000000 93 1065.28 MFlops/s BM_rowReduction_12T/80 2000000 950 6730.96 MFlops/s BM_rowReduction_12T/640 50000 38196 10723.48 MFlops/s BM_rowReduction_12T/4K 500 3019217 8280.29 MFlops/s BM_rowReduction_4T/10 20000000 93 1064.30 MFlops/s BM_rowReduction_4T/80 2000000 959 6667.71 MFlops/s BM_rowReduction_4T/640 50000 37433 10941.96 MFlops/s BM_rowReduction_4T/4K 500 3036476 8233.23 MFlops/s BM_rowReduction_8T/10 20000000 93 1072.47 MFlops/s BM_rowReduction_8T/80 2000000 959 6670.04 MFlops/s BM_rowReduction_8T/640 50000 38069 10759.37 MFlops/s BM_rowReduction_8T/4K 1000 2758988 9061.29 MFlops/s	2016-05-16 08:55:21 -07:00
Benoit Steiner	b789a26804	Fixed syntax error	2016-05-16 08:51:08 -07:00
Benoit Steiner	83dfb40f66	Turnon the new thread pool by default since it scales much better over multiple cores. It is still possible to revert to the old thread pool by compiling with the EIGEN_USE_SIMPLE_THREAD_POOL define.	2016-05-13 17:23:15 -07:00
Benoit Steiner	97605c7b27	New multithreaded contraction that doesn't rely on the thread pool to run the closure in the order in which they are enqueued. This is needed in order to switch to the new non blocking thread pool since this new thread pool can execute the closure in any order.	2016-05-13 17:11:29 -07:00
Benoit Steiner	c4fc8b70ec	Removed unnecessary thread synchronization	2016-05-13 10:49:38 -07:00
Benoit Steiner	7aa3557d31	Fixed compilation errors triggered by old versions of gcc	2016-05-12 18:59:04 -07:00
Rasmus Munk Larsen	5005b27fc8	Diasbled cost model by accident. Revert.	2016-05-12 16:55:21 -07:00
Rasmus Munk Larsen	989e419328	Address comments by bsteiner.	2016-05-12 16:54:19 -07:00
Rasmus Munk Larsen	e55deb21c5	Improvements to parallelFor. Move some scalar functors from TensorFunctors. to Eigen core.	2016-05-12 14:07:22 -07:00
Benoit Steiner	ae9688f313	Worked around a compilation error triggered by nvcc when compiling a tensor concatenation kernel.	2016-05-12 12:06:51 -07:00
Benoit Steiner	2a54b70d45	Fixed potential race condition in the non blocking thread pool	2016-05-12 11:45:48 -07:00
Benoit Steiner	a071629fec	Replace implicit cast with an explicit one	2016-05-12 10:40:07 -07:00
Benoit Steiner	2f9401b061	Worked around compilation errors with older versions of gcc	2016-05-11 23:39:20 -07:00
Benoit Steiner	09653e1f82	Improved the portability of the tensor code	2016-05-11 23:29:09 -07:00
Benoit Steiner	b6a517c47d	Added the ability to load fp16 using the texture path. Improved the performance of some reductions on fp16	2016-05-11 21:26:48 -07:00
Christoph Hertzberg	1a1ce6ff61	Removed deprecated flag (which apparently was ignored anyway)	2016-05-11 23:05:37 +02:00
Christoph Hertzberg	2150f13d65	fixed some double-promotion and sign-compare warnings	2016-05-11 23:02:26 +02:00
Benoit Steiner	217d984abc	Fixed a typo in my previous commit	2016-05-11 10:22:15 -07:00
Benoit Steiner	08348b4e48	Fix potential race condition in the CUDA reduction code.	2016-05-11 10:08:51 -07:00
Benoit Steiner	6a5717dc74	Explicitely initialize all the atomic variables.	2016-05-11 10:04:41 -07:00
Benoit Steiner	4ede059de1	Properly gate the use of half2.	2016-05-10 17:04:01 -07:00
Benoit Steiner	661e710092	Added support for fp16 to the sigmoid functor.	2016-05-10 12:25:27 -07:00
Benoit Steiner	0eb69b7552	Small improvement to the full reduction of fp16	2016-05-10 11:58:18 -07:00
Benoit Steiner	4013b8feca	Simplified the reduction code a little.	2016-05-10 09:40:42 -07:00
Benoit Steiner	4670d7d5ce	Improved the performance of full reductions on GPU: Before: BM_fullReduction/10 200000 11751 8.51 MFlops/s BM_fullReduction/80 5000 523385 12.23 MFlops/s BM_fullReduction/640 50 36179326 11.32 MFlops/s BM_fullReduction/4K 1 2173517195 11.50 MFlops/s After: BM_fullReduction/10 500000 5987 16.70 MFlops/s BM_fullReduction/80 200000 10636 601.73 MFlops/s BM_fullReduction/640 50000 58428 7010.31 MFlops/s BM_fullReduction/4K 1000 2006106 12461.95 MFlops/s	2016-05-09 17:09:54 -07:00
Benoit Steiner	c3859a2b58	Added the ability to use a scratch buffer in cuda kernels	2016-05-09 17:05:53 -07:00
Benoit Steiner	ba95e43ea2	Added a new parallelFor api to the thread pool device.	2016-05-09 10:45:12 -07:00
Benoit Steiner	dc7dbc2df7	Optimized the non blocking thread pool: * Use a pseudo-random permutation of queue indices during random stealing. This ensures that all the queues are considered. * Directly pop from a non-empty queue when we are waiting for work, instead of first noticing that there is a non-empty queue and then doing another round of random stealing to re-discover the non-empty queue. * Steal only 1 task from a remote queue instead of half of tasks.	2016-05-09 10:17:17 -07:00
Benoit Steiner	c54ae65c83	Marked a few tensor operations as read only	2016-05-05 17:18:47 -07:00
Benoit Steiner	910e013506	Relaxed an assertion that was tighter that necessary.	2016-05-05 15:38:16 -07:00
Benoit Steiner	28d5572658	Fixed some incorrect assertions	2016-05-05 10:02:26 -07:00
Benoit Steiner	a4d6e8fef0	Strongly hint but don't force the compiler to unroll a some loops in the tensor executor. This results in up to 27% faster code.	2016-05-05 09:25:55 -07:00
Benoit Steiner	f363e533aa	Added tests for full contractions using thread pools and gpu devices. Fixed a couple of issues in the corresponding code.	2016-05-05 09:05:45 -07:00
Benoit Steiner	06d774bf58	Updated the contraction code to ensure that full contraction return a tensor of rank 0	2016-05-05 08:37:47 -07:00
Christoph Hertzberg	dacb469bc9	Enable and fix -Wdouble-conversion warnings	2016-05-05 13:35:45 +02:00
Benoit Steiner	dd2b45feed	Removed extraneous 'explicit' keywords	2016-05-04 16:57:52 -07:00
Benoit Steiner	968ec1c2ae	Use numext::isfinite instead of std::isfinite	2016-05-03 19:56:40 -07:00
Benoit Steiner	aad9a04da4	Deleted superfluous explicit keyword.	2016-05-03 09:37:19 -07:00
Benoit Steiner	8a9228ed9b	Fixed compilation error	2016-05-01 14:48:01 -07:00
Benoit Steiner	d6c9596fd8	Added missing accessors to fixed sized tensors	2016-04-29 18:51:33 -07:00
Benoit Steiner	17fe7f354e	Deleted trailing commas	2016-04-29 18:39:01 -07:00
Benoit Steiner	e5f71aa6b2	Deleted useless trailing commas	2016-04-29 18:36:10 -07:00
Benoit Steiner	44f592dceb	Deleted unnecessary trailing commas.	2016-04-29 18:33:46 -07:00
Benoit Steiner	f100d1494c	Return the proper size (ie 1) for tensors of rank 0	2016-04-29 18:14:33 -07:00
Benoit Steiner	a8c0405cf5	Deleted unused default values for template parameters	2016-04-29 16:34:43 -07:00
Benoit Steiner	c07404f6a1	Restore Tensor support for non c++11 compilers	2016-04-29 15:19:19 -07:00
Benoit Steiner	ba32ded021	Fixed include path	2016-04-29 15:11:09 -07:00
Gael Guennebaud	318e65e0ae	Fix missing inclusion of Eigen/Core	2016-04-27 23:05:40 +02:00
Rasmus Munk Larsen	463738ccbe	Use computeProductBlockingSizes to compute blocking for both ShardByCol and ShardByRow cases.	2016-04-27 12:26:18 -07:00
Gael Guennebaud	3dddd34133	Refactor the unsupported CXX11/Core module to internal headers only.	2016-04-26 11:20:25 +02:00
Benoit Steiner	4a164d2c46	Fixed the partial evaluation of non vectorizable tensor subexpressions	2016-04-25 10:43:03 -07:00
Benoit Steiner	fd9401f260	Refined the cost of the striding operation.	2016-04-25 09:16:08 -07:00
Benoit Steiner	4bbc97be5e	Provide access to the base threadpool classes	2016-04-21 17:59:33 -07:00
Benoit Steiner	33adce5c3a	Added the ability to switch to the new thread pool with a #define	2016-04-21 11:59:58 -07:00
Benoit Steiner	f670613e4b	Fixed several compilation warnings	2016-04-21 11:03:02 -07:00
Benoit Steiner	2dde1b1028	Don't crash when attempting to reduce empty tensors.	2016-04-20 18:08:20 -07:00
Benoit Steiner	c7c2054bb5	Started to implement a portable way to yield.	2016-04-19 17:59:58 -07:00
Benoit Steiner	2b72163028	Implemented a more portable version of thread local variables	2016-04-19 15:56:02 -07:00
Benoit Steiner	5b1106c56b	Fixed a compilation error with nvcc 7.	2016-04-19 14:57:57 -07:00
Benoit Steiner	7129d998db	Simplified the code that launches cuda kernels.	2016-04-19 14:55:21 -07:00
Benoit Steiner	b9ea40c30d	Don't take the address of a kernel on CUDA devices that don't support this feature.	2016-04-19 14:35:11 -07:00
Benoit Steiner	884c075058	Use numext::ceil instead of std::ceil	2016-04-19 14:33:30 -07:00
Benoit Steiner	a278414d1b	Avoid an unnecessary copy of the evaluator.	2016-04-19 13:54:28 -07:00
Benoit Steiner	50968a0a3e	Use DenseIndex in the MeanReducer to avoid overflows when processing very large tensors.	2016-04-19 11:53:58 -07:00
Benoit Steiner	c8e8f93d6c	Move the evalGemm method into the TensorContractionEvaluatorBase class to make it accessible from both the single and multithreaded contraction evaluators.	2016-04-15 16:48:10 -07:00
Benoit Steiner	7cff898e0a	Deleted unnecessary variable	2016-04-15 15:46:14 -07:00
Benoit Steiner	6c43c49e4a	Fixed a few compilation warnings	2016-04-15 15:34:34 -07:00
Benoit Steiner	eb669f989f	Merged in rmlarsen/eigen (pull request PR-178) Eigen Tensor cost model part 2: Thread scheduling for standard evaluators and reductions.	2016-04-15 14:53:15 -07:00
Rasmus Munk Larsen	3718bf654b	Get rid of void* casting when calling EvalRange::run.	2016-04-15 12:51:33 -07:00
Benoit Steiner	a62e924656	Added ability to access the cache sizes from the tensor devices	2016-04-14 21:25:06 -07:00
Benoit Steiner	18e6f67426	Added support for exclusive or	2016-04-14 20:37:46 -07:00
Rasmus Munk Larsen	07ac4f7e02	Eigen Tensor cost model part 2: Thread scheduling for standard evaluators and reductions. The cost model is turned off by default.	2016-04-14 18:28:23 -07:00
Benoit Steiner	9624a1ea3d	Added missing definition of PacketSize in the gpu evaluator of convolution	2016-04-14 17:16:58 -07:00
Benoit Steiner	6fbedf5a4e	Merged in rmlarsen/eigen (pull request PR-177) Eigen Tensor cost model part 1.	2016-04-14 17:13:19 -07:00
Benoit Steiner	9c064b5a97	Cleanup	2016-04-14 16:41:31 -07:00
Benoit Steiner	1372156c41	Prepared the migration to the new non blocking thread pool	2016-04-14 16:16:42 -07:00
Rasmus Munk Larsen	aeb5494a0b	Improvements to cost model.	2016-04-14 15:52:58 -07:00
Benoit Steiner	78a51abc12	Added a more scalable non blocking thread pool	2016-04-14 15:23:10 -07:00
Rasmus Munk Larsen	d2e95492e7	Merge upstream updates.	2016-04-14 13:59:50 -07:00
Rasmus Munk Larsen	235e83aba6	Eigen cost model part 1. This implements a basic recursive framework to estimate the cost of evaluating tensor expressions.	2016-04-14 13:57:35 -07:00
Benoit Steiner	5912ad877c	Silenced a compilation warning	2016-04-14 11:40:14 -07:00
Benoit Steiner	c7167fee0e	Added support for fp16 to the sigmoid function	2016-04-14 10:08:33 -07:00
Benoit Steiner	3b76df64fc	Defer the decision to vectorize tensor CUDA code to the meta kernel. This makes it possible to decide to vectorize or not depending on the capability of the target cuda architecture. In particular, this enables us to vectorize the processing of fp16 when running on device of capability >= 5.3	2016-04-12 10:58:51 -07:00
Benoit Steiner	7d5b17087f	Added missing EIGEN_DEVICE_FUNC to the tensor conversion code.	2016-04-07 20:01:19 -07:00
Benoit Steiner	48308ed801	Added support for isinf, isnan, and isfinite checks to the tensor api	2016-04-07 09:48:36 -07:00
Benoit Steiner	cfb34d808b	Fixed a possible integer overflow.	2016-04-07 08:46:52 -07:00
Benoit Steiner	7be1eaad1e	Fixed typos in the implementation of the zeta and polygamma ops.	2016-04-06 14:15:37 -07:00
tillahoffmann	726bd5f077	Merged eigen/eigen into default	2016-04-05 18:21:05 +01:00
Gael Guennebaud	4d7e230d2f	bug #1189 : fix pow/atan2 compilation for AutoDiffScalar	2016-04-05 14:49:41 +02:00
Till Hoffmann	80eba21ad0	Merge upstream.	2016-04-01 18:18:49 +01:00
Till Hoffmann	ffd770ce94	Fixed CUDA signature.	2016-04-01 17:58:24 +01:00
tillahoffmann	49960adbdd	Merged eigen/eigen into default	2016-04-01 14:36:15 +01:00
Till Hoffmann	57239f4a81	Added polygamma function.	2016-04-01 14:35:21 +01:00
Till Hoffmann	dd5d390daf	Added zeta function.	2016-04-01 13:32:29 +01:00
Benoit Steiner	3da495e6b9	Relaxed the condition used to gate the fft code.	2016-03-31 18:11:51 -07:00

1 2 3 4 5 ...

1461 Commits