Parallel Sort
Taskflow provides standalone template methods for sorting ranges of items on a CUDA GPU.
Include the Header
You need to include the header file, taskflow/cuda/algorithm/sort.hpp
, for using the parallel-sort algorithm.
Sort a Range of Items
tf::[first, last)
using the given comparator. The following code sorts one million random integers in an increasing order on a GPU.
const size_t N = 1000000; int* vec = tf::cuda_malloc_shared<int>(N); // vector // initializes the data for(size_t i=0; i<N; i++) { vec[i] = rand(); } // queries the required buffer size to sort a vector tf::cudaDefaultExecutionPolicy policy; auto bytes = tf::cuda_sort_buffer_size<tf::cudaDefaultExecutionPolicy, int>(N); auto buffer = tf::cuda_malloc_device<std::byte>(bytes); // sorts the vector tf::cuda_sort( policy, vec, vec+N, [] __device__ (int a, int b) { return a < b; }, buffer ); // synchronizes the execution and verifies the result policy.synchronize(); assert(std::is_sorted(vec, vec+N)); // deletes the memory tf::cuda_free(buffer); tf::cuda_free(vec);
The sort algorithm runs asynchronously through the stream specified in the execution policy. You need to synchronize the stream to obtain correct results. Since the GPU sort algorithm may require extra buffer to store the temporary results, you need provide a buffer of size at least bytes returned from tf::
Sort a Range of Key-Value Items
tf::i
and j
are any two valid iterators in [k_first, k_last)
such that i
precedes j
, and p
and q
are iterators in [v_first, v_first + (k_last - k_first))
corresponding to i
and j
respectively, then comp(*i, *j)
is true
. The following example sorts a range of items into ascending key order and swaps their corresponding values:
const size_t N = 4; auto vec = tf::cuda_malloc_shared<int>(N); // keys auto idx = tf::cuda_malloc_shared<int>(N); // values // initializes the data vec[0] = 1, vec[1] = 4, vec[2] = -5, vec[3] = 2; idx[0] = 0, idx[1] = 1, idx[2] = 2, idx[3] = 3; // queries the required buffer size to sort a key-value range tf::cudaDefaultExecutionPolicy policy; auto bytes = tf::cuda_sort_buffer_size<decltype(policy), int, int>(N); auto buffer = tf::cuda_malloc_device<std::byte>(bytes); // sorts keys (vec) and swaps their corresponding values (idx) tf::cuda_sort_by_key( policy, vec, vec+N, idx, [] __device__ (int a, int b) { return a<b; }, buffer ); // synchronizes the execution and verifies the result policy.synchronize(); // now vec = {-5, 1, 2, 4} // now idx = { 2, 0, 3, 1} // deletes the memory tf::cuda_free(buffer); tf::cuda_free(vec); tf::cuda_free(idx);
The buffer size required by tf::idx
indirectly using the captured keys in vec:
tf::cuda_sort(policy, idx, idx+N, [vec] __device__ (int a, int b) { return vec[a] < vec[b]; }, buffer );
The comparator here will frequently access the global memory of vec
, resulting in high memory latency. Instead, you should use tf::