syclFlow Algorithms » Parallel Iterations

tf::syclFlow provides two template methods, tf::syclFlow::for_each and tf::syclFlow::for_each_index, for creating tasks to perform parallel iterations over a range of items.

Index-based Parallel Iterations

Index-based parallel-for performs parallel iterations over a range [first, last) with the given step size. These indices must be integral type. The task created by tf::syclFlow::for_each_index(I first, I last, I step, C&& callable) represents a kernel of parallel execution for the following loop:

// positive step: first, first+step, first+2*step, ...
for(auto i=first; i<last; i+=step) {
  callable(i);
}
// negative step: first, first-step, first-2*step, ...
for(auto i=first; i>last; i+=step) {
  callable(i);
}

Each iteration i is independent of each other and is assigned one kernel thread to run the callable. The following example creates a kernel that assigns each element of gpu_data to 1 over the range [0, 100) with step size 1.

taskflow.emplace_on([&](tf::syclFlow& sf){
  // ... create other gpu tasks
  // assigns each element in gpu_data to 1 over the range [0, 100) with step size 1
  sf.for_each_index(0, 100, 1, [gpu_data] (int idx) {
    gpu_data[idx] = 1;
  });
}, sycl_queue);

Iterator-based Parallel Iterations

Iterator-based parallel-for performs parallel iterations over a range specified by two STL-styled iterators, first and last. The task created by tf::syclFlow::for_each(I first, I last, C&& callable) represents a parallel execution of the following loop:

for(auto i=first; i<last; i++) {
  callable(*i);
}

The two iterators, first and last, are typically two raw pointers to the first element and the next to the last element in the range in GPU memory space. The following example creates a for_each kernel that assigns each element in gpu_data to 1 over the range [gpu_data, gpu_data + 1000).

taskflow.emplace_on([&](tf::syclFlow& cf){
  // ... create gpu tasks
  // assigns each element to 1 over the range [gpu_data, gpu_data + 1000)
  cf.for_each(gpu_data, gpu_data + 1000, [] (int& item) {
    item = 1;
  }); 
}, sycl_queue);

Each iteration is independent of each other and is assigned one kernel thread to run the callable.