OpenCL Sathish Vadhiyar Sources: OpenCL quick overview from AMD OpenCL learning kit from AMD
Introduction OpenCL is a programming framework for heterogeneous computing resources Resources include CPUs, GPUs, Cell Broadband Engine, FPGAs, DSPs
OpenCL Platform Model Each OpenCL implementation (i.e. an OpenCL library from AMD, NVIDIA, etc.) defines platforms which enable the host system to interact with OpenCL-capable devices Currently each vendor supplies only a single platform per implementation
Many similarities with CUDA….
Command Queues A command queue is the mechanism for the host to request that an action be performed by the device Perform a memory transfer, begin executing, etc. Interesting concept of enquiring kernels and satisfying dependencies using events A separate command queue is required for each device Commands within the queue can be synchronous or asynchronous Commands can execute in-order or out-of-order 11 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Thereby providing asynchronous executions of multiple kernels on a device – a feature present in Fermi
Memory Objects Memory objects are OpenCL data that can be moved on and off devices Objects are classified as either buffers or images Buffers Contiguous chunks of memory – stored sequentially and can be accessed directly (arrays, pointers, structs) Read/write capable Images Opaque objects (2D or 3D) Can only be accessed via read_image() and write_image() Can either be read or written in a kernel, but not both 13 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Example: Vector Addition
Example Kernel Simple vector addition kernel: __kernel void vecadd(__global int* A, __global int* B, __global int* C) { int tid = get_global_id(0); C[tid] = A[tid] + B[tid]; } 15 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Executing the Kernel Need to set the dimensions of the index space, and (optionally) of the work-group sizes Kernels execute asynchronously from the host clEnqueueNDRangeKernel just adds is to the queue, but doesn’t guarantee that it will start executing 16 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Big Picture 17 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Example 2 – Image Rotation
Slides 8, of lecture 5 in openCL University kit
Synchronization
Synchronization in OpenCL Synchronization is required if we use an out-of-order command queue or multiple command queues Coarse synchronization granularity Per command queue basis Finer synchronization granularity Per OpenCL operation basis using events 21 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
OpenCL Command Queue Control Command queue synchronization methods work on a per-queue basis Flush: clFlush( cl_commandqueue ) Send all commands in the queue to the compute device No guarantee that they will be complete when clFlush returns Finish: clFinish( cl_commandqueue ) Waits for all commands in the command queue to complete before proceeding (host blocks on this call) Barrier: clEnqueueBarrier( cl_commandqueue ) Enqueue a synchronization point that ensures all prior commands in a queue have completed before any further commands execute 22 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
OpenCL Events Previous OpenCL synchronization functions only operated on a per-command-queue granularity OpenCL events are needed to synchronize at a function granularity Explicit synchronization is required for Out-of-order command queues Multiple command queues OpenCL events are data-types defined by the specification for storing timing information returned by the device 23 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
OpenCL Events Previous OpenCL synchronization functions only operated on a per- command-queue granularity OpenCL events are needed to synchronize at a function granularity Explicit synchronization is required for Out-of-order command queues Multiple command queues 24 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Using User Events A simple example of user events being triggered and used in a command queue //Create user event which will start the write of buf1 user_event = clCreateUserEvent(ctx, NULL); clEnqueueWriteBuffer( cq, buf1, CL_FALSE,..., 1, &user_event, NULL); //The write of buf1 is now enqued and waiting on user_event X = foo(); //Lots of complicated host processing code clSetUserEventStatus(user_event, CL_COMPLETE); //The clEnqueueWriteBuffer to buf1 can now proceed as per OP of foo() 25 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Events for Asynchronous I/O Two command queues created on the same device Different from asymptotic analysis case of dividing computation between queues In this case we use different queues for IO and compute We have no output data moving from Host to device for each image, so using separate command queues will also allow for latency hiding Compute Queue ComputeKernel(I mage0) ComputeKernel(Im age1) I/O Queue ComputeKernel(I mage2) Copy(Image1) Copy(Image2) Copy(Image0 26 Perhaad Mistry & Dana Schaa, Northeastern Univ Computer Architecture Research Lab, with Ben Gaster, AMD © 2011
Multiple Devices
Multiple Devices OpenCL can also be used to program multiple devices (CPU, GPU, Cell, DSP etc.) OpenCL does not assume that data can be transferred directly between devices, so commands only exists to move from a host to device, or device to host Copying from one device to another requires an intermediate transfer to the host OpenCL events are used to synchronize execution on different devices within a context
Compiling Code for Multiple Devices