GPU Tasking (cudaFlow)
Contents
Modern scientific computing typically leverages GPU-powered parallel processing cores to speed up large-scale applications. This chapter discusses how to implement CPU-GPU heterogeneous tasking algorithms with Nvidia CUDA.
Include the Header
You need to include the header file, taskflow/cuda/cudaflow.hpp
, for creating a tf::
Create a cudaFlow
Taskflow leverages CUDA Graph to enable concurrent CPU-GPU tasking using a task graph model, tf::
1: #include <taskflow/cuda/cudaflow.hpp> 2: 3: // saxpy (single-precision A·X Plus Y) kernel 4: __global__ void saxpy(int n, float a, float *x, float *y) { 5: int i = blockIdx.x*blockDim.x + threadIdx.x; 6: if (i < n) { 7: y[i] = a*x[i] + y[i]; 8: } 9: } 10: 11: // main function begins 12: int main() { 13: 14: tf::Taskflow taskflow; 15: tf::Executor executor; 16: 17: const unsigned N = 1<<20; // size of the vector 18: 19: std::vector<float> hx(N, 1.0f); // x vector at host 20: std::vector<float> hy(N, 2.0f); // y vector at host 21: 22: float *dx{nullptr}; // x vector at device 23: float *dy{nullptr}; // y vector at device 24: 25: tf::Task allocate_x = taskflow.emplace( 26: [&](){ cudaMalloc(&dx, N*sizeof(float));} 27: ).name("allocate_x"); 28: 29: tf::Task allocate_y = taskflow.emplace( 30: [&](){ cudaMalloc(&dy, N*sizeof(float));} 31: ).name("allocate_y"); 32: 33: tf::Task cudaflow = taskflow.emplace([&](tf::cudaFlow& cf) { 34: // create data transfer tasks 35: tf::cudaTask h2d_x = cf.copy(dx, hx.data(), N).name("h2d_x"); 36: tf::cudaTask h2d_y = cf.copy(dy, hy.data(), N).name("h2d_y"); 37: tf::cudaTask d2h_x = cf.copy(hx.data(), dx, N).name("d2h_x"); 38: tf::cudaTask d2h_y = cf.copy(hy.data(), dy, N).name("d2h_y"); 39: 40: // launch saxpy<<<(N+255)/256, 256, 0>>>(N, 2.0f, dx, dy) 41: tf::cudaTask kernel = cf.kernel( 42: (N+255)/256, 256, 0, saxpy, N, 2.0f, dx, dy 43: ).name("saxpy"); 44: 45: kernel.succeed(h2d_x, h2d_y) 46: .precede(d2h_x, d2h_y); 48: }).name("saxpy"); 49: cudaflow.succeed(allocate_x, allocate_y); // overlap memory alloc 50: 51: executor.run(taskflow).wait(); 52: 53: taskflow.dump(std::cout); // dump the taskflow 54: }
Debrief:
- Lines 3-9 define a saxpy kernel using CUDA
- Lines 19-20 declare two host vectors,
hx
andhy
- Lines 22-23 declare two device vector pointers,
dx
anddy
- Lines 25-31 declare two tasks to allocate memory for
dx
anddy
on device, each ofN*sizeof(float)
bytes - Lines 33-48 create a cudaFlow to define a GPU task graph that contains:
- two host-to-device data transfer tasks
- one saxpy kernel task
- two device-to-host data transfer tasks
- Lines 49-53 define the task dependency between host tasks and the cudaFlow tasks and execute the taskflow
tf::
Compile a cudaFlow Program
Use nvcc to compile a cudaFlow program:
~$ nvcc -std=c++17 my_cudaflow.cu -I path/to/include/taskflow -O2 -o my_cudaflow ~$ ./my_cudaflow
Please visit the page Compile Taskflow with CUDA for more details.
Run a cudaFlow on Specific GPU
By default, a cudaFlow runs on the current CUDA GPU associated with the caller, which is typically GPU 0
. Each CUDA GPU has an integer identifier in the range of [0, N)
, where N
is the number of CUDA GPUs in a system. You can run a cudaFlow on a specific GPU using tf::2
.
taskflow.emplace_on([] (tf::cudaFlow& cudaflow) { // here, cudaflow is under GPU 2 // ... }, 2); // place the cudaFlow on GPU 2
An easy practice is to allocate unified shared memory using cudaMallocManaged
and let the CUDA runtime perform automatic memory migration between GPUs.
Create Memory Operation Tasks
tf::mem
, such as memcpy
and memset
, that operate in bytes. Typed data operations such as copy
, fill
, and zero
, take logical count of elements. For instance, the following three methods have the same result of zeroing sizeof(int)*count
bytes of the device memory area pointed to by target
.
int* target; cudaMalloc(&target, count*sizeof(int)); taskflow.emplace([&](tf::cudaFlow& cf){ tf::cudaTask memset_target = cf.memset(target, 0, sizeof(int) * count); tf::cudaTask same_as_above = cf.fill(target, 0, count); tf::cudaTask same_as_above_again = cf.zero(target, count); });
The method cudaFlow::T
, given that sizeof(T)
is 1, 2, or 4 bytes. For example, the following code sets each element in the array target
to 1234.
taskflow.emplace([&](tf::cudaFlow& cf){ cf.fill(target, 1234, count); });
Similar concept applies to cudaFlow::
taskflow.emplace([&](tf::cudaFlow& cf){ tf::cudaTask memcpy_target = cf.memcpy(target, source, sizeof(int) * count); tf::cudaTask same_as_above = cf.copy(target, source, count); });
Study the Granularity
Creating a cudaFlow has certain overhead, which means fine-grained tasking such as one GPU operation per cudaFlow may not give you any performance gain. You should aggregate as many GPU operations as possible in a cudaFlow to launch the entire graph once instead of separated graphs. For example, the following code creates a fine-grained saxpy task graph using one cudaFlow per GPU operation.
tf::Task h2d_x = taskflow.emplace([&](tf::cudaFlow& cf) { cf.copy(dx, hx.data(), N).name("h2d_x"); }).name("h2d_x"); // creates the 1st cudaFlow tf::Task h2d_y = taskflow.emplace([&](tf::cudaFlow& cf) { cf.copy(dy, hy.data(), N).name("h2d_y"); }).name("h2d_y"); // creates the 2nd cudaFlow tf::Task d2h_x = taskflow.emplace([&](tf::cudaFlow& cf) { cf.copy(hx.data(), dx, N).name("d2h_x"); }).name("d2h_x"); // creates the 3rd cudaFlow tf::Task d2h_y = taskflow.emplace([&](tf::cudaFlow& cf) { cf.copy(hy.data(), dy, N).name("d2h_y"); }).name("d2h_y"); // creates the 4th cudaFlow tf::Task kernel = taskflow.emplace([&](tf::cudaFlow& cf) { cf.kernel((N+255)/256, 256, 0, saxpy, N, 2.0f, dx, dy).name("saxpy"); }).name("kernel"); // creates the 5th cudaFlow kernel.succeed(h2d_x, h2d_y) .precede(d2h_x, d2h_y);
The following code aggregates the five GPU operations using one cudaFlow to achieve better performance.
tf::Task cudaflow = taskflow.emplace([&](tf::cudaFlow& cf) { tf::cudaTask h2d_x = cf.copy(dx, hx.data(), N).name("h2d_x"); tf::cudaTask h2d_y = cf.copy(dy, hy.data(), N).name("h2d_y"); tf::cudaTask d2h_x = cf.copy(hx.data(), dx, N).name("d2h_x"); tf::cudaTask d2h_y = cf.copy(hy.data(), dy, N).name("d2h_y"); tf::cudaTask saxpy = cf.kernel((N+255)/256, 256, 0, saxpy, N, 2.0f, dx, dy) .name("saxpy"); saxpy.succeed(h2d_x, h2d_y) .precede(d2h_x, d2h_y); }).name("saxpy"); // creates one cudaFlow
Offload a cudaFlow
By default, the executor offloads and executes the cudaFlow once, if the cudaFlow is never offloaded from its callable. During the execution, the executor first materializes the cudaFlow by mapping it to a native CUDA graph, creates an executable graph from the native CUDA graph, and then submit the executable graph to the CUDA runtime. Similar to tf::
taskflow.emplace([](tf::cudaFlow& cf) { // ... create CUDA tasks cf.offload(); // offload the cudaFlow and run it once cf.offload_n(10); // offload the cudaFlow and run it 10 times cf.offload_until([repeat=5] () mutable { return repeat-- == 0; }) // five times });
After you offload a cudaFlow, it is considered executed, and the executor will not run an offloaded cudaFlow after leaving the cudaFlow task callable. On the other hand, if a cudaFlow is not offloaded, the executor runs it once. For example, the following two versions represent the same execution logic.
// version 1: explicitly offload a cudaFlow once taskflow.emplace([](tf::cudaFlow& cf) { cf.single_task([] __device__ (){}); cf.offload(); }); // version 2 (same as version 1): executor offloads the cudaFlow once taskflow.emplace([](tf::cudaFlow& sf) { cf.single_task([] __device__ (){}); });
Update a cudaFlow
Many GPU applications require you to launch a cudaFlow multiple times and update node parameters (e.g., kernel parameters and memory addresses) between iterations. tf::
1: taskflow.emplace([&] (tf::cudaFlow& cf) { 2: tf::cudaTask task = cf.kernel(grid1, block1, shm1, my_kernel, args1...); 3: cf.offload(); // immediately run the cudaFlow once 4: 5: cf.kernel(task, grid2, block2, shm2, my_kernel, args2...); 6: cf.offload(); // run the cudaFlow again with the same graph topology 7: // but with different kernel parameters 8: });
Debrief:
- Line 2 creates a kernel task to run
my_kernel
with the given parameters. - Line 3 offloads the cudaFlow and performs an immediate execution.
- Line 5 updates the parameters of
my_kernel
through its task. - Line 6 executes the cudaFlow again with updated kernel parameters.
Between successive offloads (i.e., executions of a cudaFlow), you can update the task parameters, such as changing the kernel execution parameters and memory operation parameters. However, you must NOT change the topology of an offloaded cudaFlow. Each method of task creation in tf::
Use cudaFlow in a Standalone Environment
You can use tf::
tf::cudaFlow cf; // create a standalone cudaFlow tf::cudaTask h2d_x = cf.copy(dx, hx.data(), N).name("h2d_x"); tf::cudaTask h2d_y = cf.copy(dy, hy.data(), N).name("h2d_y"); tf::cudaTask d2h_x = cf.copy(hx.data(), dx, N).name("d2h_x"); tf::cudaTask d2h_y = cf.copy(hy.data(), dy, N).name("d2h_y"); tf::cudaTask saxpy = cf.kernel((N+255)/256, 256, 0, saxpy, N, 2.0f, dx, dy) .name("saxpy"); saxpy.succeed(h2d_x, h2d_y) // kernel runs after host-to-device copy .precede(d2h_x, d2h_y); // kernel runs before device-to-host copy cf.offload(); // offload and run the standalone cudaFlow once
When using cudaFlow in a standalone environment, it is your choice to decide its GPU context. The following example creates a cudaFlow and executes it on GPU 0.
tf::cudaScopedDevice gpu(0); tf::cudaFlow cf; // create a standalone cudaFlow on GPU 0 cf.offload(); // run the capturer once on GPU 0