GPU Tasking (syclFlow)
Contents
Taskflow supports SYCL, a general-purpose heterogeneous programming model, to program heterogeneous tasks in a single-source C++ environment. This chapter discusses how to write SYCL C++ kernel code with Taskflow based on SYCL 2020 Specification.
Include the Header
You need to include the header file, taskflow/sycl/syclflow.hpp
, for using tf::
Create a syclFlow
Taskflow introduces a task graph-based programming model, tf::saxpy.cpp
) implements the canonical saxpy (A·X Plus Y) task graph using tf::
1: #include <taskflow/syclflow.hpp> 2: 3: constexpr size_t N = 1000000; 4: 5: int main() { 6: 7: tf::Executor executor; 8: tf::Taskflow taskflow("saxpy example"); 9: 10: sycl::queue queue{sycl::gpu_selector{}}; 11: 12: // allocate shared memory that is accessible on both host and device 13: float* X = sycl::malloc_shared<float>(N, queue); 14: float* Y = sycl::malloc_shared<float>(N, queue); 15: 16: // create a syclFlow to perform the saxpy operation 17: taskflow.emplace_on([&](tf::syclFlow& sf){ 18: tf::syclTask fillX = sf.fill(X, 1.0f, N).name("fillX"); 19: tf::syclTask fillY = sf.fill(Y, 2.0f, N).name("fillY"); 20: tf::syclTask saxpy = sf.parallel_for(sycl::range<1>(N), 21: [=] (sycl::id<1> id) { 22: X[id] = 3.0f * X[id] + Y[id]; 23: } 24: ).name("saxpy"); 25: saxpy.succeed(fillX, fillY); 26: }, queue).name("syclFlow"); 27: 28: executor.run(taskflow).wait(); // run the taskflow 29: taskflow.dump(std::cout); // dump the taskflow 30: 31: // free the shared memory to avoid memory leak 32: sycl::free(X, queue); 33: sycl::free(Y, queue); 34: }
Debrief:
- Lines 7-8 create a taskflow and an executor
- Lines 10 creates a SYCL queue on a default-selected GPU device
- Lines 13-14 allocate shared memory that is accessible on both host and device
- Lines 17-26 creates a syclFlow to define the saxpy task graph that contains:
- one fill task to fill the memory area
X
with1.0f
- one fill task to fill the memory area
Y
with2.0f
- one kernel task to perform the saxpy operation on the GPU
- one fill task to fill the memory area
- Lines 28-29 executes the taskflow and dumps its graph to a DOT format
- Lines 32-33 deallocates the shared memory to avoid memory leak
tf::
Compile a syclFlow Program
Use DPC++ clang to compile a syclFlow program:
~$ clang++ -fsycl -fsycl-unnamed-lambda \ -fsycl-targets=nvptx64-nvidia-cuda \ # for CUDA target -I path/to/taskflow -pthread -std=c++17 saxpy.cpp -o saxpy ~$ ./saxpy
Please visit the page Compile Taskflow with SYCL for more details.
Create Memory Operation Tasks
tf::N/2
elements in the vector to -1
.
sycl::queue queue; size_t N = 1000; int* hvec = new int[N] (100); int* dvec = sycl::malloc_device<int>(N, queue); // create a syclflow task to set the first N/2 elements to -1 taskflow.emplace_on([&](tf::syclFlow& syclflow){ tf::syclTask ch2d = syclflow.copy(dvec, hvec, N); tf::syclTask fill = syclflow.fill(dvec, -1, N/2); tf::syclTask cd2h = syclflow.copy(hvec, dvec, N); fill.precede(cd2h) .succeed(ch2d); }, queue); executor.run(taskflow).wait(); // inspect the result for(size_t i=0; i<N/2; i++) { (i < N / 2) ? assert(hvec[i] == -1) : assert(hvec[i] == 100); }
Both tf::typed
data. You can use tf::untyped
data (i.e., array of bytes).
taskflow.emplace_on([&](tf::syclFlow& syclflow){ tf::syclTask ch2d = syclflow.memcpy(dvec, hvec, N*sizeof(int)); tf::syclTask mset = syclflow.memset(dvec, -1, N/2*sizeof(int)); tf::syclTask cd2h = syclflow.memcpy(hvec, dvec, N*sizeof(int)); fill.precede(cd2h) .succeed(ch2d); }, queue);
Create Kernel Tasks
SYCL allows a simple execution model in which a kernel is invoked over an N-dimensional index space defined by sycl::range<N>
, where N
is one, two or three. Each work item in such a kernel executes independently across a set of partitioned work groups. tf::sycl::range
and a sycl::id
to set each element in data
to 1.0f
when it is not necessary to query the global range of the index space being executed across.
tf::syclTask task = syclflow.parallel_for( sycl::range<1>(N), [data](sycl::id<1> id){ data[id] = 1.0f; } );
As the same example, the following variant enables low-level functionality of work items and work groups using sycl::nd_range
and sycl::nd_item
. This becomes valuable when an execution requires groups of work items that communicate and synchronize.
// partition the N-element range to N/M work groups each of M work items tf::syclTask task = syclflow.parallel_for( sycl::nd_range<1>{sycl::range<1>(N), sycl::range<1>(M)}, [data](sycl::nd_item<1> item){ auto id = item.get_global_linear_id(); data[id] = 1.0f; // query detailed work group information // item.get_group_linear_id(); // item.get_local_linear_id(); // ... } );
All the kernel methods defined in the SYCL queue are applicable for tf::
Create Command Group Function Object Tasks
SYCL provides a way to encapsulate a device-side operation and all its data and event dependencies in a single command group function object. The function object accepts an argument of command group handler constructed by the SYCL runtime. Command group handler is the heart of SYCL programming as it defines pretty much all kernel-related methods, including submission, execution, and synchronization. You can directly create a SYCL task from a command group function object using tf::
tf::syclTask task = syclflow.on( [=] (sycl::handler& handler) { handler.require(accessor); handler.single_task([=](){ // place a single-threaded kernel function data[0] = 1; ); } );
Offload a syclFlow
By default, the executor offloads and executes the syclFlow once. When a syclFlow is being executed, its task graph will be materialized by the Taskflow runtime and submitted to its associated SYCL queue in a topological order of task dependencies defined in that graph. You can explicitly execute a syclFlow using different offload methods:
taskflow.emplace_on([](tf::syclFlow& sf) { // ... create SYCL tasks sf.offload(); // offload the syclFlow and run it once sf.offload_n(10); // offload the syclFlow and run it 10 times sf.offload_until([repeat=5] () mutable { return repeat-- == 0; }) // five times }, queue);
After you offload a syclFlow, it is considered executed, and the executor will not run an offloaded syclFlow after leaving the syclFlow task callable. On the other hand, if a syclFlow is not offloaded, the executor runs it once. For example, the following two versions represent the same execution logic.
// version 1: explicitly offload a syclFlow once taskflow.emplace_on([](tf::syclFlow& sf) { sf.single_task([](){}); sf.offload(); }, queue); // version 2 (same as version 1): executor offloads the syclFlow once taskflow.emplace_on([](tf::syclFlow& sf) { sf.single_task([](){}); }, queue);
Update a syclFlow
You can update a SYCL task from an offloaded syclFlow and rebind it to another task type. For example, you can rebind a memory operation task to a parallel-for kernel task from an offloaded syclFlow and vice versa.
size_t N = 10000; sycl::queue queue; int* data = sycl::malloc_shared<int>(N, queue); taskflow.emplace_on([&](tf::syclFlow& syclflow){ // create a task to set each element to -1 tf::syclTask task = syclflow.fill(data, -1, N); syclflow.offload(); std::for_each(data, data+N, [](int i){ assert(data[i] == -1); }); // rebind the task to a parallel-for kernel task setting each element to 100 syclflow.rebind_parallel_for(task, sycl::range<1>(N), [](sycl::id<1> id){ data[id] = 100; }); syclflow.offload(); std::for_each(data, data+N, [data](int i){ assert(data[i] == 100); }); }, queue); executor.run(taskflow).wait();
Each method of task creation in tf::
Use syclFlow in a Standalone Environment
You can use tf::
sycl::queue queue; tf::syclFlow sf(queue); // create a standalone syclFlow tf::syclTask h2d_x = sf.copy(dx, hx.data(), N).name("h2d_x"); tf::syclTask h2d_y = sf.copy(dy, hy.data(), N).name("h2d_y"); tf::syclTask d2h_x = sf.copy(hx.data(), dx, N).name("d2h_x"); tf::syclTask d2h_y = sf.copy(hy.data(), dy, N).name("d2h_y"); tf::syclTask saxpy = sf.parallel_for( sycl::range<1>(N), [=] (sycl::id<1> id) { dx[id] = 2.0f * dx[id] + dy[id]; } ).name("saxpy"); saxpy.succeed(h2d_x, h2d_y) // kernel runs after host-to-device copy .precede(d2h_x, d2h_y); // kernel runs before device-to-host copy sf.offload(); // offload and run the standalone syclFlow once