syclFlow Algorithms » Parallel Reduction

tf::syclFlow provides two template methods, tf::syclFlow::reduce and tf::syclFlow::uninitialized_reduce, for creating tasks to perform parallel reductions over a range of items.

Reduce Items with an Initial Value

The reduction task created by tf::syclFlow::reduce(I first, I last, T* result, C&& bop) performs parallel reduction over a range of elements specified by [first, last) using the binary operator bop and stores the reduced result in result. It represents the parallel execution of the following reduction loop on a SYCL device:

while (first != last) {
  *result = op(*result, *first++);
}

The variable result participates in the reduction loop and must be initialized with an initial value. The following code performs a parallel reduction to sum all the numbers in the given range with an initial value 1000:

const size_t N = 1000000;

int* soln = sycl::malloc_shared<int>(1);  // solution
int* data = sycl::malloc_shared<int>(N);  // data

std::for_each(data, data+N, [](int& v){ d = 1; });
*soln = 1000;

// create a syclflow to perform parallel reduction on a SYCL device
sycl::queue queue;
tf::syclFlow syclflow(queue);
syclflow.reduce(data, data+N, soln, [] (int a, int b) { return a + b; });
syclflow.offload();

assert(sol == N + 1000);

Reduce Items without an Initial Value

You can use tf::syclFlow::uninitialized_reduce to perform parallel reduction without any initial value. This method represents a parallel execution of the following reduction loop on a SYCL device that does not assum any initial value to reduce.

*result = *first++;  // no initial values participate in the reduction loop
while (first != last) {
  *result = op(*result, *first++);
}

The variable result is overwritten with the reduced value and no initial values participate in the reduction loop. The following code performs a parallel reduction to sum all the numbers in the given range without any initial value:

const size_t N = 1000000;

int* soln = sycl::malloc_shared<int>(1);  // solution
int* data = sycl::malloc_shared<int>(N);  // data 

std::for_each(data, data+N, [](int& v){ d = 1; });
*soln = 1000;                             // no effect

// create a syclflow to perform parallel reduction on a SYCL device
sycl::queue queue;
tf::syclFlow syclflow(queue);
syclflow.uninitialized_reduce(
  data, data+N, soln, [] (int a, int b) { return a + b; }
);
syclflow.offload();

assert(sol == N);