Parallel Reduction
Contents
cudaFlow provides template methods to create parallel reduction tasks on a CUDA GPU.
Include the Header
You need to include the header file, taskflow/cuda/algorithm/reduce.hpp
, for creating a parallel-reduction task.
Reduce Items with an Initial Value
The reduction task created by tf::cudaFlow::reduce(I first, I last, T* result, C&& bop) performs parallel reduction over a range of elements specified by [first, last)
using the binary operator bop
and stores the reduced result in result
. It represents the parallel execution of the following reduction loop on a GPU:
while (first != last) { *result = bop(*result, *first++); }
The variable result
participates in the reduction loop and must be initialized with an initial value. The following code performs a parallel reduction to sum all the numbers in the given range with an initial value 1000
:
const size_t N = 1000000; std::vector<int> cpu(N, -1); int sol; // solution on CPU int* res = tf::cuda_malloc_device<int>(1); // solution on GPU int* gpu = tf::cuda_malloc_device<int>(N); // data on GPU tf::cudaFlow cf; tf::cudaTask d2h = cf.copy(&sol, res, 1); tf::cudaTask h2d = cf.copy(gpu, cpu.data(), N); tf::cudaTask set = cf.single_task([res] __device__ () mutable { *res = 1000; }); tf::cudaTask kernel = cf.reduce( gpu, gpu+N, res, [] __device__ (int a, int b) { return a + b; } ); kernel.succeed(h2d, set) .precede(d2h); cf.offload(); assert(sol == N + 1000);
Reduce Items without an Initial Value
You can use tf::
*result = *first++; // no initial values to participate in the reduction loop while (first != last) { *result = bop(*result, *first++); }
The variable result
is directly assigned the reduced value without any initial value to participate in the reduction loop. The following code performs a parallel reduction to sum all the numbers in the given range without any initial value:
const size_t N = 1000000; std::vector<int> cpu(N, -1); int sol; // solution on CPU int* res = tf::cuda_malloc_device<int>(1); // solution on GPU int* gpu = tf::cuda_malloc_device<int>(N); // data on GPU tf::cudaFlow cf; tf::cudaTask d2h = cf.copy(&sol, res, 1); tf::cudaTask h2d = cf.copy(gpu, cpu.data(), N); tf::cudaTask kernel = cf.uninitialized_reduce( gpu, gpu+N, res, [] __device__ (int a, int b) { return a + b; } ); kernel.succeed(h2d) .precede(d2h); cf.offload(); assert(sol == N);
Reduce a Range of Transformed Items with an Initial Value
tf::[first, last)
using a binary reduce operator bop
and a unary transform operator uop
. It represents the parallel execution of the following reduction loop on a GPU:
while (first != last) { *result = bop(*result, uop(*first++)); }
The variable result
participates in the reduction loop and must be initialized with an initial value. The following code performs a parallel reduction to sum all the transformed numbers multiplied by 10
in the given range with an initial value 1000
:
const size_t N = 1000000; int* res = tf::cuda_malloc_shared<int>(1); // result int* vec = tf::cuda_malloc_shared<int>(N); // vector // initializes the data *res = 1000; std::iota(vec, vec + N, 0); tf::cudaFlow cf; cf.transform_reduce( vec, vec + N, res, [] __device__ (int a, int b) { return a + b; }, [] __device__ (int a) { return a*10; } ); cf.offload(); // *res = 1000 + (0*10 + 1*10 + 2*10 + 3*10 + 4*10 + ... + (N-1)*10)
Reduce a Range of Transformed Items without an Initial Value
tf::
*result = *first++; // no initial values to participate in the reduction loop while (first != last) { *result = bop(*result, uop(*first++)); }
The variable result
is directly assigned the reduced value without any initial value participating in the reduction loop. The following code performs a parallel reduction to sum all the transformed numbers multiplied by 10
in the given range without any initial value:
const size_t N = 1000000; int* res = tf::cuda_malloc_shared<int>(1); // result int* vec = tf::cuda_malloc_shared<int>(N); // vector // initializes the data *res = 1000; std::iota(vec, vec + N, 0); tf::cudaFlow cf; cf.transform_reduce( vec, vec + N, res, [] __device__ (int a, int b) { return a + b; }, [] __device__ (int a) { return a*10; } ); cf.offload(); // *res = 0*10 + 1*10 + 2*10 + 3*10 + 4*10 + ... + (N-1)*10
Miscellaneous Items
Parallel-reduction algorithms are also available in tf::