Parallel Reduction
Contents
Taskflow provides standard template methods for reducing a range of items on a CUDA GPU.
Include the Header
You need to include the header file, taskflow/cuda/algorithm/reduce.hpp
, for using the parallel-reduction algorithm.
Reduce a Range of Items with an Initial Value
tf::[first, last)
using the binary operator bop
and stores the reduced result in result
. It represents the parallel execution of the following reduction loop on a GPU:
while (first != last) { *result = bop(*result, *first++); }
The variable result
participates in the reduction loop and must be initialized with an initial value. The following code performs a parallel reduction to sum all the numbers in the given range with an initial value 1000
:
const size_t N = 1000000; int* res = tf::cuda_malloc_shared<int>(1); // result int* vec = tf::cuda_malloc_shared<int>(N); // vector // initializes the data *res = 1000; for(size_t i=0; i<N; i++) vec[i] = i; } // queries the required buffer size to reduce N elements using the given policy tf::cudaDefaultExecutionPolicy policy; auto bytes = tf::cuda_reduce_buffer_size<tf::cudaDefaultExecutionPolicy, int>(N); auto buffer = tf::cuda_malloc_device<std::byte>(bytes); // *res = 1000 + (0 + 1 + 2 + 3 + 4 + ... + N-1) tf::cuda_reduce(policy, vec, vec + N, res, [] __device__ (int a, int b) { return a + b; }, buffer ); // synchronize the execution policy.synchronize(); // delete the memory tf::cuda_free(buffer); tf::cuda_free(res); tf::cuda_free(vec);
The reduce algorithm runs asynchronously through the stream specified in the execution policy. You need to synchronize the stream to obtain correct results. Since the GPU reduction algorithm may require extra buffer to store the temporary results, you must provide a buffer of size at least bytes returned from tf::
Reduce a Range of Items without an Initial Value
tf::
*result = *first++; // no initial values to participate in the reduction loop while (first != last) { *result = bop(*result, *first++); }
The variable result
is directly assigned the reduced value without any initial value participating in the reduction loop. The following code performs a parallel reduction to sum all the numbers in the given range without any initial value:
const size_t N = 1000000; int* res = tf::cuda_malloc_shared<int>(1); // result int* vec = tf::cuda_malloc_shared<int>(N); // vector // initializes the data for(size_t i=0; i<N; i++) vec[i] = i; } // queries the required buffer size to reduce N elements using the given policy tf::cudaDefaultExecutionPolicy policy; auto bytes = tf::cuda_reduce_buffer_size<tf::cudaDefaultExecutionPolicy, int>(N); auto buffer = tf::cuda_malloc_device<std::byte>(bytes); // *res = 0 + 1 + 2 + 3 + 4 + ... + N-1 tf::cuda_uninitialized_reduce(policy, vec, vec + N, res, [] __device__ (int a, int b) { return a + b; }, buffer ); // synchronize the execution policy.synchronize(); // delete the buffer tf::cuda_free(buffer);
Reduce a Range of Transformed Items with an Initial Value
tf::[first, last)
using a binary reduce operator bop
and a unary transform operator uop
. It represents the parallel execution of the following reduction loop on a GPU:
while (first != last) { *result = bop(*result, uop(*first++)); }
The variable result
participates in the reduction loop and must be initialized with an initial value. The following code performs a parallel reduction to sum all the transformed numbers multiplied by 10
in the given range with an initial value 1000
:
const size_t N = 1000000; int* res = tf::cuda_malloc_shared<int>(1); // result int* vec = tf::cuda_malloc_shared<int>(N); // vector // initializes the data *res = 1000; for(size_t i=0; i<N; i++) { vec[i] = i; } // queries the required buffer size to reduce N elements using the given policy tf::cudaDefaultExecutionPolicy policy; auto bytes = tf::cuda_reduce_buffer_size<tf::cudaDefaultExecutionPolicy, int>(N); auto buffer = tf::cuda_malloc_device<std::byte>(bytes); // *res = 1000 + (0*10 + 1*10 + 2*10 + 3*10 + 4*10 + ... + (N-1)*10) tf::cuda_transform_reduce(policy, vec, vec + N, res, [] __device__ (int a, int b) { return a + b; }, [] __device__ (int a) { return a*10; }, buffer ); // synchronize the execution policy.synchronize(); // delete the buffer tf::cuda_free(buffer);
Reduce a Range of Transformed Items without an Initial Value
tf::
*result = *first++; // no initial values to participate in the reduction loop while (first != last) { *result = bop(*result, uop(*first++)); }
The variable result
is directly assigned the reduced value without any initial value participating in the reduction loop. The following code performs a parallel reduction to sum all the transformed numbers multiplied by 10
in the given range without any initial value:
const size_t N = 1000000; int* res = tf::cuda_malloc_shared<int>(1); // result int* vec = tf::cuda_malloc_shared<int>(N); // vector // initializes the data for(size_t i=0; i<N; i++) { vec[i] = i; } // queries the required buffer size to reduce N elements using the given policy tf::cudaDefaultExecutionPolicy policy; auto bytes = tf::cuda_reduce_buffer_size<tf::cudaDefaultExecutionPolicy, int>(N); auto buffer = tf::cuda_malloc_device<std::byte>(bytes); // *res = 0*10 + 1*10 + 2*10 + 3*10 + 4*10 + ... + (N-1)*10 tf::cuda_uninitialized_reduce(policy, vec, vec + N, res, [] __device__ (int a, int b) { return a + b; }, [] __device__ (int a) { return a*10; }, buffer ); // synchronize the execution policy.synchronize(); // delete the data tf::cuda_free(buffer); tf::cuda_free(res); tf::cuda_free(vec);