OpenCL Ryan Renna. Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future.

1 OpenCL Ryan Renna

2 Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future 2

3 Goals  Knowledge that is transferable to all APIs  Overview of concepts rather than API specific terminology  Avoid coding examples as much as possible 3

4 Introduction

5 What is OpenCL A Language:  Open Computer Language, it’s C like!  Execute code across mixed platforms consisting of CPUs, GPUs and other processors. An API:  Runs on the “Host”, manipulate and control OpenCL objects and code.  Deals with devices as abstract processing units 5

6 Why Use GPUs?  Modern GPUs are made up of highly parallelizable processing units. Have been named “Stream Processors”  Modern pc’s all have dedicated GPUs which sit idle for most of the day to day processing  This strategy is known as “General-Purpose Computation on Graphical Processing Units” or GPGPU 6

7  Any device capable of Stream Processing, related to SIMD  Given a set of data (the stream) a series of functions (called Kernel functions) are applied to each element  On-chip memory is used, to minimize external memory bandwidth The Stream Processor Did you know: The Cell processor, invented by Toshiba, Sony & IBM is a Stream Processor? Did you know: The Cell processor, invented by Toshiba, Sony & IBM is a Stream Processor? 7

8 Streams  Most commonly 2D grids (Textures)  Maps well to Matrix Algebra, Image Processing, Physics simulations, etc Did you know: The latest ATI card has 1600 individual Stream Processors? Did you know: The latest ATI card has 1600 individual Stream Processors? 8

9 Kernel Functions for(int i = 0; i < 100 * 4; i++) { result[i] = source0[i] + source1[i]; } Traditional sequential method: for(int el = 0; el < 100; el++) { vector_sum(result[el],source0[el],source1[el]); } The same process, using the kernel “vector_sum” 9

10 An “Open” Computing Language  Multiple CPU machines with multiple GPUs, all from different vendors, can work together. 10

11 History

12 GPGPU  General-Purpose Computation on Graphical Processing Units  Coined in 2002, with the rise of using GPUs for non-graphics applications  Hardware specific GPGPU APIs have been created : CUDA NVidia 2007 Close To Metal ATI 2006 12

14 The next step  OpenCL:  Developed by Apple computers  Collaborated with AMD, Intel, IBM and NVidia to refine the proposal  Submitted to the Khronos Group  The specification for OpenCL 1.0 was finished 5 months later 14

15 You may remember me from such open standards as…  OpenGL – 2D and 3D graphics API  OpenAL – 3D audio API  OpenGL ES – OpenGL for embedded system. Used in all smartphones.  Collada – XML-based schema for storing 3D assets. 15

16 Anatomy of OpenCL

17 API – Platform Layer  Compute Device  A processor that executes data-parallel programs. Contains Compute Units  Compute Unit  A Processing element.  Example: a CORE of a CPU  Queues  Submits work to a compute device. Can be in-order or out-of-order.  Context  Collection of compute devices. Enables memory sharing across devices.  Host  Container of Contexts. Represents the computer itself. 17

18 Host Example  A host computer with one device group  A Dual-core CPU  A GPU with 8 Stream Processors 18

19 API – Runtime Layer  Memory Objects  Buffers  Blocks of memory, accessed as arrays, pointers or structs  Images  2D or 3D images  Executable Objects  Kernel  A data-parallel function that is executed by a compute device  Program  A group of kernels and functions  Synchronization:  Events Caveat: Each image can be read or written in a kernel, but not both. Caveat: Each image can be read or written in a kernel, but not both. 19

20 Example Flow Compile Code Create Data & Arguments Send to Execution Program Program with a collection of Kernels CPU & GPU Binaries Memory Objects BuffersImages Compute Device In-Order Queue Out-of-Order Queue 20

21 Execution Model of OpenCL

22  The N-Dimensional computation domain is called the N-D Space, defines the total number of elements of execution  Defines the Global Dimensions  Each element of execution, representing an instance of a kernel, is called a work-item  Work-items are grouped in local workgroups  Size is defined by Local Dimensions N-D Space 22

23  Global work-items don’t belong to a workgroup and run in parallel independently (no synchronization)  Local work-items can be synchronized within a workgroup, and share workgroup memory  Each work-item runs as it’s own thread  Thousands of lightweight threads can be running at a time, and are managed by the device  Each work-item is assigned a unique id, a local id within it’s workgroup and naturally each workgroup is assigned a workgroup id Work-Items 23

24 Example – Image Filter Executed on a 128 x 128 image, our Global Dimensions are 128, 128. We will have 16,384 work- items in total. We can then define a Local Dimensions of 30, 30. Since workgroups are executed together, and work-items can only be synchronized within workgroups, picking your Global and Local Dimensions is problem specific. If we asked for the local id of work-item 31, we’d receive 1. As it’s the 1 st work-item of the 2 nd workgroup. 24

25 Memory Model of OpenCL

26 Memory Model  Private  Per work-item  Local  Shared within a workgroup  Global/Constant  Not synchronized, per device  Host Memory Compute Device Host Host Memory Global / Constant Memory Local Memory.. Compute Unit 1.. Compute Unit 1 Work Item Private Work Item.. Compute Unit 2.. Compute Unit 2 Work Item Private Work Item 26

27 Intermission 27

28 Implementation

29  Key thoughts:  Work-items should be independent of each other  Workgroups share data, but are executed in sync, so they cannot depend on each others results  Find tasks that are independent and highly repeated, pay attention to loops  Transferring data over a PCI bus has overhead, parallelization is only justified for large data sets, or ones with lots of mathematical computations Identifying Parallelizable Routines 29

30 30 An Example – Class Average  Let’s imagine we were writing an application that computed the class average  There are two tasks we’d need to perform:  Compute the final grade for each student  Obtain a class average by averaging the final grades

32 Pseudo Code 32 Foreach(student in class) { grades = student.getGrades(); sum = 0; count = 0; foreach(grade in grades) { sum += grade; count++; } student.averageGrade = sum/count; }  Compute the final grade for each student

33 Foreach(student in class) { grades = student.getGrades(); sum = 0; count = 0; foreach(grade in grades) { sum += grade; count++; } student.averageGrade = sum/count; } Pseudo Code 33  This code can be isolated. _kernel void calcGrade (__global const float* input,__global float* output) { int i = get_global_id(0); //Do work on class[i] }

34  First decide how to represent your problem, this will tell you the dimensionality of your Global and Local dimensions.  Global dimensions are problem specific  Local dimensions are algorithm specific  Local dimensions must have the same number of dimensions as Global.  Local dimensions must divide the global space evenly  Passing NULL as a workgroup size argument will let OpenCL pick the most efficient setup, but no synchronization will be possible between work-items 34 Determining the Data Dimensions

35  An OpenCL calculation needs to perform 6 key steps:  Initialization  Allocate Resources  Creating Programs/Kernels  Execution  Read the Result(s)  Clean Up Execution Steps Warning! Code Ahead 35

36  Store Kernel in string/char array Initialization const char* Kernel_Source = "\n "__calcGrade(__global const float* input,__global float* output) { int i = get_global_id(0); //Do work on class[i] }”; 36

37  Selecting a device and creating a context in which to run the calculation Initialization cl_int err; Cl_context context; cl_device_id devices; cl_command_queue cmd_queue; err = clGetDeviceIDs(CL_DEVICE_TYPE_GPU,1,&devices,NULL); context = clCreateContext(0,1,&devices,NULL,NULL,&err); cmd_queue = clCreateCommandQueue(context,devices,0,NULL); 37

38  Allocation of memory/storage that will be used on the device and push it to the device Allocation cl_mem ax_mem = clCreateBuffer(context,CL_MEM_READ_ONLY,atom_buffer_size,NU LL,NULL); err = clEnqueueWriteBuffer(cmd_queue,ax_mem,CL_TRUE,0,atom_buffer _size,(void*)values,0,NULL,NULL); 38

39  Programs and Kernels are read in from source and loaded as binary Program/Kernel Creation cl_program program[1]; cl_kernel kernel[1]; Program[0] = clCreateProgramWithSource(context,1,(const char**)&kernel_source,NULL,&err); err = clBuildProgram(program[0],NULL,NULL,NULL,NULL); Kernel[0]= clCreateKernel(program[0],”calcGrade”,&err); 39

40  Arguments to the kernel are set and the kernel is executed on all data Execution size_t global_work_size[1],local_work_size[1]; global_work_size[0] = x; local_work_size[0] = x/2; err = clSetKernelArg(kernel[0],0,sizeof(cl_mem),&values); err = clEnqueueNDRangeKernel(cmd_queue,kernel[0],1,NULL,&global_w ork_size,&local_work_size,NULL,NULL); 40

41  We read back the results to the Host Read the Result(s) err = clEnqueueReadBuffer(cmd_queue,val_mem,CL_TRUE,0,grid_buffer _size,val,0,NULL,NULL); 41 Note: If we were working on images, the function clEnqueueReadImage() would be called instead. Note: If we were working on images, the function clEnqueueReadImage() would be called instead.

42  Clean up memory, release all OpenCL objects.  Can check OpenCL reference count and ensure it equals zero Clean Up clReleaseKernel(kernel); clReleaseProgram(program); clReleaseCommandQueue(cmd_queue); clReleaseContext(context); 42

43  Instead of finding the first GPU, we could create a context out of all OpenCL devices, or decide to use specific dimensions / devices which would perform best on the devices dynamically.  Debugging can be done more efficiently on the CPU then on a GPU, prinf functions will work inside a kernel Advanced Techniques 43

44 Applications

45  Raytracing  Weather forecasting, Climate research  Physics Simulations  Computational finance  Computer Vision  Signal processing, Speech processing  Cryptography / Cryptanalysis  Neural Networks  Database operations  …Many more! 45

46 The Future

47 OpenGL Interoperability  OpenCL + OpenGL  Efficient, inter-API communication  OpenCL efficiently shares resources with OpenGL (doesn’t copy)  OpenCL objects can be created from OpenGL objects  OpenGL 4.0 has been designed to align both standards to closely work together  Example Implementation: Vertex and Image data generated with OpenCL Rendered with OpenGL Post Processed with OpenCL Kernels 47

48 Competitor  DirectCompute by Microsoft  Bundled with DirectX 11  Requires a DX10 or 11 graphic card  Requires Windows Vista or 7  Close to OpenCL feature wise  Internet Explorer 9 and Firefox 3.7 both use DirectX to speed up dom tree rendering (Windows Only) 48

49 Overview  With OpenCL  Leverage CPUs, GPUs and other processors to accelerate parallel computation  Get dramatic speedups for computationally intensive applications  Write accelerated portable code across different devices and architectures 49

50 Getting Started…  ATI Stream SDK  Support for OpenCL/OpenGL interoperability  Support for OpenCL/DirectX interoperability   Cuda Toolkit   OpenCL.NET  OpenCL Wrapper for.NET languages  50

51 The End? No… The Beginning 51

52 References 52      opencl-test-part-1/ opencl-test-part-1/   ual/OpenCL_MacProgGuide/WhatisOpenCL/WhatisOpenCL.html ual/OpenCL_MacProgGuide/WhatisOpenCL/WhatisOpenCL.html  

