Advanced / Other Programming Models Sathish Vadhiyar
OpenCL – Command Queues, Runtime Compilation, Multiple Devices Sources: OpenCL overview from AMD OpenCL learning kit from AMD
Introduction OpenCL is a programming framework for heterogeneous computing resources Resources include CPUs, GPUs, Cell Broadband Engine, FPGAs, DSPs Many similarities with CUDA
Command Queues A command queue is the mechanism for the host to request that an action be performed by the device Perform a memory transfer, begin executing, etc. Interesting concept of enqueuing kernels and satisfying dependencies using events A separate command queue is required for each device Commands within the queue can be synchronous or asynchronous Commands can execute in-order or out-of-order 5 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Example – Image Rotation
Slides 8, of lecture 5 in openCL University kit
Synchronization
Synchronization in OpenCL Synchronization is required if we use an out-of-order command queue or multiple command queues Coarse synchronization granularity Per command queue basis Finer synchronization granularity Per OpenCL operation basis using events 9 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
OpenCL Command Queue Control Command queue synchronization methods work on a per-queue basis Flush: clFlush( cl_commandqueue ) Send all commands in the queue to the compute device No guarantee that they will be complete when clFlush returns Finish: clFinish( cl_commandqueue ) Waits for all commands in the command queue to complete before proceeding (host blocks on this call) Barrier: clEnqueueBarrier( cl_commandqueue ) Enqueue a synchronization point that ensures all prior commands in a queue have completed before any further commands execute 10 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
OpenCL Events Previous OpenCL synchronization functions only operated on a per- command-queue granularity OpenCL events are needed to synchronize at a function granularity Explicit synchronization is required for Out-of-order command queues Multiple command queues 11 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Using User Events A simple example of user events being triggered and used in a command queue //Create user event which will start the write of buf1 user_event = clCreateUserEvent(ctx, NULL); clEnqueueWriteBuffer( cq, buf1, CL_FALSE,..., 1, &user_event, NULL); //The write of buf1 is now enqued and waiting on user_event X = foo(); //Lots of complicated host processing code clSetUserEventStatus(user_event, CL_COMPLETE); //The clEnqueueWriteBuffer to buf1 can now proceed as per OP of foo() 12 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Multiple Devices
Multiple Devices OpenCL can also be used to program multiple devices (CPU, GPU, Cell, DSP etc.) OpenCL does not assume that data can be transferred directly between devices, so commands only exists to move from a host to device, or device to host Copying from one device to another requires an intermediate transfer to the host OpenCL events are used to synchronize execution on different devices within a context
Compiling Code for Multiple Devices
Charm++ Source: Tutorial Slides from Parallel Programming Lab, UIUC Authors (Laxmikant Kale, Eric Bohm)
Virtualization: Object-based Decomposition In MPI, the number of processes is typically equal to the number of processors Virtualization: Divide the computation into a large number of pieces Independent of number of processors Typically larger than number of processors Let the system map objects to processors
The Charm++ Model Parallel objects (chares) communicate via asynchronous method invocations (entry methods). The runtime system maps chares onto processors and schedules execution of entry methods. Chares can be dynamically created on any available processor Can be accessed from remote processors 18Charm++ Basics
10/17/2015CS Processor Virtualization User View System implementation User is only concerned with interaction between objects (VPs)
20 Adaptive Overlap via Data-driven Objects Problem: Processors wait for too long at “receive” statements With Virtualization, you get Data-driven execution There are multiple entities (objects, threads) on each proc No single object or threads holds up the processor Each one is “continued” when its data arrives So: Achieves automatic and adaptive overlap of computation and communication
Load Balancing
10/17/2015CS Using Dynamic Mapping to Processors Migration Charm objects can migrate from one processor to another Migration creates a new object on the destination processor while destroying the original Use that for dynamic (and static, initial) load balancing Measurement based, predictive strategies Based on object communication patterns and computational loads
Summary: Primary Advantages Automatic mapping Migration and load balancing Asynchronous and message driven communications Computation-communication overlap
How it looks?
Asynchronous Hello World Program’s asynchronous flow Mainchare sends message to Hello object Hello object prints “Hello World!” Hello object sends message back to the mainchare Mainchare quits the application Charm++ Basics25
Code and Workflow Charm++ Basics26
Hello World: Array Version Main Code Charm++ Basics27
Array Code Charm++ Basics28
Result $./charmrun +p3./hello 10 Running “Hello World” with 10 elements using 3 processors. “Hello” from Hello chare #0 on processor 0 (told by -1) “Hello” from Hello chare #1 on processor 0 (told by 0) “Hello” from Hello chare #2 on processor 0 (told by 1) “Hello” from Hello chare #3 on processor 0 (told by 2) “Hello” from Hello chare #4 on processor 1 (told by 3) “Hello” from Hello chare #5 on processor 1 (told by 4) “Hello” from Hello chare #6 on processor 1 (told by 5) “Hello” from Hello chare #7 on processor 2 (told by 6) “Hello” from Hello chare #8 on processor 2 (told by 7) “Hello” from Hello chare #9 on processor 2 (told by 8) Charm++ Basics29