Parallel Scan
Contents
cudaFlow provides template methods to create parallel scan tasks on a CUDA GPU.
Include the Header
You need to include the header file, taskflow/cuda/algorithm/scan.hpp
, for creating a parallel-scan task.
Scan a Range of Items
tf::[first, last)
. The term "inclusive" means that the i-th input element is included in the i-th sum. The following code computes the inclusive prefix sum over an input array and stores the result in an output array.
const size_t N = 1000000; int* input = tf::cuda_malloc_shared<int>(N); // input vector int* output = tf::cuda_malloc_shared<int>(N); // output vector // initializes the data for(size_t i=0; i<N; input[i++]=rand()); // creates a cudaFlow of one task to perform inclusive scan tf::cudaFlow cf; tf::cudaTask task = cf.inclusive_scan( input, input + N, output, [] __device__ (int a, int b) { return a + b; } ); cf.offload(); // verifies the result for(size_t i=1; i<N; i++) { assert(output[i] == output[i-1] + input[i]); }
On the other hand, tf::
// creates a cudaFlow of one task to perform exclusive scan tf::cudaFlow cf; tf::cudaTask task = cf.exclusive_scan( input, input + N, output, [] __device__ (int a, int b) { return a + b; } ); cf.offload(); // verifies the result for(size_t i=1; i<N; i++) { assert(output[i] == output[i-1] + input[i-1]); }
Scan a Range of Transformed Items
tf::[first, last)
and computes an inclusive prefix sum over these transformed items. The following code multiplies each item by 10 and then compute the inclusive prefix sum over 1000000 transformed items.
const size_t N = 1000000; int* input = tf::cuda_malloc_shared<int>(N); // input vector int* output = tf::cuda_malloc_shared<int>(N); // output vector // initializes the data for(size_t i=0; i<N; i++) input[i] = rand(); } // creates a cudaFlow of one task to inclusively scan over transformed input tf::cudaFlow cf; tf::cudaTask task = cf.transform_inclusive_scan( input, input + N, output, [] __device__ (int a, int b) { return a + b; }, // binary scan operator [] __device__ (int a) { return a*10; } // unary transform operator ); cf.offload(); // verifies the result for(size_t i=1; i<N; i++) { assert(output[i] == output[i-1] + input[i] * 10); }
Similarly, tf::
const size_t N = 1000000; int* input = tf::cuda_malloc_shared<int>(N); // input vector int* output = tf::cuda_malloc_shared<int>(N); // output vector // initializes the data for(size_t i=0; i<N; input[i++]=rand()); // creates a cudaFlow of one task to exclusively scan over transformed input tf::cudaFlow cf; tf::cudaTask task = cf.transform_exclusive_scan( input, input + N, output, [] __device__ (int a, int b) { return a + b; }, // binary scan operator [] __device__ (int a) { return a*10; } // unary transform operator ); cf.offload(); // verifies the result for(size_t i=1; i<N; i++) { assert(output[i] == output[i-1] + input[i-1] * 10); }
Miscellaneous Items
Parallel scan algorithms are also available in tf::