packet access, it is not certain that it will bring a performance
improvement: benchmarking needed.
* improve logic choosing slice vectorization.
* fix typo in SSE packet math, causing crash in unaligned case.
* fix bug in Product, causing crash in unaligned case.
* add TEST_SSE3 CMake option.