parallel_for_each currently requires each thread to process at least 10 elements. However, when indexing, it's fine for a thread to handle just a single CU. This patch parameterizes this, and updates the one user.