* The scheduling of computation is moved out the the assignment code and into a new TensorExecutor class
* The assignment itself is now a regular node on the expression tree
* The expression evaluators start by recursively evaluating all their subexpressions if needed
Added the ability to parallelize the evaluation of a tensor expression over multiple cpu cores.
Added the ability to offload the evaluation of a tensor expression to a GPU.
* Added ability to map a region of the memory to a tensor
* Added basic support for unary and binary coefficient wise expressions, such as addition or square root
* Provided an emulation layer to make it possible to compile the code with compilers (such as nvcc) that don't support cxx11.
This commit adds an initial implementation of a class template Tensor
that allows for the storage of objects with more than two indices.
Currently, only storing data and setting the object to zero for POD
data types are implemented.