cudaFlow Algorithms » Parallel Sort

cudaFlow provides template methods to create parallel sort tasks on a CUDA GPU.

Include the Header

You need to include the header file, taskflow/cuda/algorithm/sort.hpp, for creating a parallel-sort task.

Sort a Range of Items

tf::cudaFlow::sort performs an in-place parallel sort over a range of elements specified by [first, last) using the given comparator. The following code sorts one million random integers in an increasing order on a GPU.

const size_t N = 1000000;
int* vec = tf::cuda_malloc_shared<int>(N);  // vector

// initializes the data
for(size_t i=0; i<N; vec[i++]=rand());

// create a cudaFlow of one task to perform parallel sort
tf::cudaFlow cf;
tf::cudaTask task = cf.sort(
  vec, vec + N, []__device__(int a, int b) { return a < b; }
);
cf.offload();

assert(std::is_sorted(vec, vec+N));

You can specify a different comparator to tf::cudaFlow::sort to alter the sorting order. For example, the following code sorts one million random integers in an decreasing order on a GPU.

const size_t N = 1000000;
int* vec = tf::cuda_malloc_shared<int>(N);  // vector

// initializes the data
for(size_t i=0; i<N; vec[i++]=rand());

// create a cudaFlow of one task to perform parallel sort
tf::cudaFlow cf;
tf::cudaTask task = cf.sort(
  vec, vec + N, [] __device__ (int a, int b) { return a > b; }
);
cf.offload();

assert(std::is_sorted(vec, vec+N, [](int a, int b){ return a > b; }));

Sort a Range of Key-Value Items

tf::cudaFlow::sort_by_key sorts a range of key-value items into ascending key order. If i and j are any two valid iterators in [k_first, k_last) such that i precedes j, and p and q are iterators in [v_first, v_first + (k_last - k_first)) corresponding to i and j respectively, then comp(*j, *i) evaluates to false. The following example sorts a range of items into ascending key order and swaps their corresponding values:

const size_t N = 4;
auto vec = tf::cuda_malloc_shared<int>(N);  // keys
auto idx = tf::cuda_malloc_shared<int>(N);  // values

// initializes the data
vec[0] = 1, vec[1] = 4, vec[2] = -5, vec[3] = 2;
idx[0] = 0, idx[1] = 1, idx[2] = 2,  idx[3] = 3;

// sort keys (vec) and swap their corresponding values (idx)
tf::cudaFlow cf;
cf.sort_by_key(vec, vec+N, idx, [] __device__ (int a, int b) { return a < b; });
cf.offload();

// now vec = {-5, 1, 2, 4}
// now idx = { 2, 0, 3, 1}

// deletes the memory
tf::cuda_free(buffer);
tf::cuda_free(vec);
tf::cuda_free(idx);

While you can capture the values into the lambda and sort them indirectly using plain tf::cudaFlow::sort, this organization will result in frequent and costly access to the global memory. For example, we can sort idx indirectly using the captured keys in vec:

cf.sort(idx, idx+N, [vec] __device__ (int a, int b) { return vec[a] < vec[b]; });

The comparator here will frequently access the global memory of vec, resulting in high memory latency. Instead, you should use tf::cudaFlow::sort_by_key that has been optimized for this purpose.

Miscellaneous Items

Parallel sort algorithms are also available in tf::cudaFlowCapturer with the same API.