Parallel Sort
Contents
cudaFlow provides template methods to create parallel sort tasks on a CUDA GPU.
Include the Header
You need to include the header file, taskflow/cuda/algorithm/sort.hpp
, for creating a parallel-sort task.
Sort a Range of Items
tf::[first, last)
using the given comparator. The following code sorts one million random integers in an increasing order on a GPU.
const size_t N = 1000000; int* vec = tf::cuda_malloc_shared<int>(N); // vector // initializes the data for(size_t i=0; i<N; vec[i++]=rand()); // create a cudaFlow of one task to perform parallel sort tf::cudaFlow cf; tf::cudaTask task = cf.sort( vec, vec + N, []__device__(int a, int b) { return a < b; } ); cf.offload(); assert(std::is_sorted(vec, vec+N));
You can specify a different comparator to tf::
const size_t N = 1000000; int* vec = tf::cuda_malloc_shared<int>(N); // vector // initializes the data for(size_t i=0; i<N; vec[i++]=rand()); // create a cudaFlow of one task to perform parallel sort tf::cudaFlow cf; tf::cudaTask task = cf.sort( vec, vec + N, [] __device__ (int a, int b) { return a > b; } ); cf.offload(); assert(std::is_sorted(vec, vec+N, [](int a, int b){ return a > b; }));
Sort a Range of Key-Value Items
tf::i
and j
are any two valid iterators in [k_first, k_last)
such that i
precedes j
, and p
and q
are iterators in [v_first, v_first + (k_last - k_first))
corresponding to i
and j
respectively, then comp(*j, *i)
evaluates to false
. The following example sorts a range of items into ascending key order and swaps their corresponding values:
const size_t N = 4; auto vec = tf::cuda_malloc_shared<int>(N); // keys auto idx = tf::cuda_malloc_shared<int>(N); // values // initializes the data vec[0] = 1, vec[1] = 4, vec[2] = -5, vec[3] = 2; idx[0] = 0, idx[1] = 1, idx[2] = 2, idx[3] = 3; // sort keys (vec) and swap their corresponding values (idx) tf::cudaFlow cf; cf.sort_by_key(vec, vec+N, idx, [] __device__ (int a, int b) { return a < b; }); cf.offload(); // now vec = {-5, 1, 2, 4} // now idx = { 2, 0, 3, 1} // deletes the memory tf::cuda_free(buffer); tf::cuda_free(vec); tf::cuda_free(idx);
While you can capture the values into the lambda and sort them indirectly using plain tf::idx
indirectly using the captured keys in vec:
cf.sort(idx, idx+N, [vec] __device__ (int a, int b) { return vec[a] < vec[b]; });
The comparator here will frequently access the global memory of vec
, resulting in high memory latency. Instead, you should use tf::
Miscellaneous Items
Parallel sort algorithms are also available in tf::