OpenCL Ryan Renna. Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future.

Slides:



Advertisements
Similar presentations
Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),
Advertisements

Intermediate GPGPU Programming in CUDA
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
 Open standard for parallel programming across heterogenous devices  Devices can consist of CPUs, GPUs, embedded processors etc – uses all the processing.
GPU Processing for Distributed Live Video Database Jun Ye Data Systems Group.
OpenCL Peter Holvenstot. OpenCL Designed as an API and language specification Standards maintained by the Khronos group  Currently 1.0, 1.1, and 1.2.
PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
OpenSSL acceleration using Graphics Processing Units
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
An Introduction to Programming with CUDA Paul Richmond
OpenCL Introduction A TECHNICAL REVIEW LU OCT
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
Instructor Notes This is a brief lecture which goes into some more details on OpenCL memory objects Describes various flags that can be used to change.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
Open CL Hucai Huang. Introduction Today's computing environments are becoming more multifaceted, exploiting the capabilities of a range of multi-core.
Computer Graphics Ken-Yi Lee National Taiwan University.
OpenCL Introduction AN EXAMPLE FOR OPENCL LU OCT
Algorithm Engineering „GPGPU“ Stefan Edelkamp. Graphics Processing Units  GPGPU = (GP)²U General Purpose Programming on the GPU  „Parallelism for the.
GPU in HPC Scott A. Friedman ATS Research Computing Technologies.
+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
Advanced / Other Programming Models Sathish Vadhiyar.
Instructor Notes This is a brief lecture which goes into some more details on OpenCL memory objects Describes various flags that can be used to change.
ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson,
Robert Liao Tracy Wang CS252 Spring Overview Traditional GPU Architecture The NVIDIA G80 Processor CUDA (Compute Unified Device Architecture) LAPACK.
GPU Architecture and Programming
OpenCL Sathish Vadhiyar Sources: OpenCL quick overview from AMD OpenCL learning kit from AMD.
Multi-Core Development Kyle Anderson. Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
OpenCL Programming James Perry EPCC The University of Edinburgh.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be.
OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.
Implementation and Optimization of SIFT on a OpenCL GPU Final Project 5/5/2010 Guy-Richard Kayombya.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
Would'a, CUDA, Should'a. CUDA: Compute Unified Device Architecture OU Supercomputing Symposium Highly-Threaded HPC.
CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.
Heterogeneous Computing with OpenCL Dr. Sergey Axyonov.
My Coordinates Office EM G.27 contact time:
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.
OpenCL The Open Standard for Heterogenous Parallel Programming.
Instructor Notes This is a straight-forward lecture. It introduces the OpenCL specification while building a simple vector addition program The Mona Lisa.
Introduction to CUDA Programming Introduction to OpenCL Andreas Moshovos Spring 2011 Based on:
Heterogeneous Computing using openCL lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
OpenCL. Sources Patrick Cozzi Spring 2011 NVIDIA CUDA Programming Guide CUDA by Example Programming Massively Parallel Processors.
Lecture 15 Introduction to OpenCL
Computer Engg, IIT(BHU)
GPU Architecture and Its Application
CS427 Multicore Architecture and Parallel Computing
Enabling machine learning in embedded systems
An Introduction to GPU Computing
Patrick Cozzi University of Pennsylvania CIS Spring 2011
GPU Programming using OpenCL
6- General Purpose GPU Programming
Presentation transcript:

OpenCL Ryan Renna

Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future 2

Goals  Knowledge that is transferable to all APIs  Overview of concepts rather than API specific terminology  Avoid coding examples as much as possible 3

Introduction

What is OpenCL A Language:  Open Computer Language, it’s C like!  Execute code across mixed platforms consisting of CPUs, GPUs and other processors. An API:  Runs on the “Host”, manipulate and control OpenCL objects and code.  Deals with devices as abstract processing units 5

Why Use GPUs?  Modern GPUs are made up of highly parallelizable processing units. Have been named “Stream Processors”  Modern pc’s all have dedicated GPUs which sit idle for most of the day to day processing  This strategy is known as “General-Purpose Computation on Graphical Processing Units” or GPGPU 6

 Any device capable of Stream Processing, related to SIMD  Given a set of data (the stream) a series of functions (called Kernel functions) are applied to each element  On-chip memory is used, to minimize external memory bandwidth The Stream Processor Did you know: The Cell processor, invented by Toshiba, Sony & IBM is a Stream Processor? Did you know: The Cell processor, invented by Toshiba, Sony & IBM is a Stream Processor? 7

Streams  Most commonly 2D grids (Textures)  Maps well to Matrix Algebra, Image Processing, Physics simulations, etc Did you know: The latest ATI card has 1600 individual Stream Processors? Did you know: The latest ATI card has 1600 individual Stream Processors? 8

Kernel Functions for(int i = 0; i < 100 * 4; i++) { result[i] = source0[i] + source1[i]; } Traditional sequential method: for(int el = 0; el < 100; el++) { vector_sum(result[el],source0[el],source1[el]); } The same process, using the kernel “vector_sum” 9

An “Open” Computing Language  Multiple CPU machines with multiple GPUs, all from different vendors, can work together. 10

History

GPGPU  General-Purpose Computation on Graphical Processing Units  Coined in 2002, with the rise of using GPUs for non-graphics applications  Hardware specific GPGPU APIs have been created : CUDA NVidia 2007 Close To Metal ATI

GPGPU  General-Purpose Computation on Graphical Processing Units  Coined in 2002, with the rise of using GPUs for non-graphics applications  Hardware specific GPGPU APIs have been created : CUDA NVidia 2007 Close To Metal ATI

The next step  OpenCL:  Developed by Apple computers  Collaborated with AMD, Intel, IBM and NVidia to refine the proposal  Submitted to the Khronos Group  The specification for OpenCL 1.0 was finished 5 months later 14

You may remember me from such open standards as…  OpenGL – 2D and 3D graphics API  OpenAL – 3D audio API  OpenGL ES – OpenGL for embedded system. Used in all smartphones.  Collada – XML-based schema for storing 3D assets. 15

Anatomy of OpenCL

API – Platform Layer  Compute Device  A processor that executes data-parallel programs. Contains Compute Units  Compute Unit  A Processing element.  Example: a CORE of a CPU  Queues  Submits work to a compute device. Can be in-order or out-of-order.  Context  Collection of compute devices. Enables memory sharing across devices.  Host  Container of Contexts. Represents the computer itself. 17

Host Example  A host computer with one device group  A Dual-core CPU  A GPU with 8 Stream Processors 18

API – Runtime Layer  Memory Objects  Buffers  Blocks of memory, accessed as arrays, pointers or structs  Images  2D or 3D images  Executable Objects  Kernel  A data-parallel function that is executed by a compute device  Program  A group of kernels and functions  Synchronization:  Events Caveat: Each image can be read or written in a kernel, but not both. Caveat: Each image can be read or written in a kernel, but not both. 19

Example Flow Compile Code Create Data & Arguments Send to Execution Program Program with a collection of Kernels CPU & GPU Binaries Memory Objects BuffersImages Compute Device In-Order Queue Out-of-Order Queue 20

Execution Model of OpenCL

 The N-Dimensional computation domain is called the N-D Space, defines the total number of elements of execution  Defines the Global Dimensions  Each element of execution, representing an instance of a kernel, is called a work-item  Work-items are grouped in local workgroups  Size is defined by Local Dimensions N-D Space 22

 Global work-items don’t belong to a workgroup and run in parallel independently (no synchronization)  Local work-items can be synchronized within a workgroup, and share workgroup memory  Each work-item runs as it’s own thread  Thousands of lightweight threads can be running at a time, and are managed by the device  Each work-item is assigned a unique id, a local id within it’s workgroup and naturally each workgroup is assigned a workgroup id Work-Items 23

Example – Image Filter Executed on a 128 x 128 image, our Global Dimensions are 128, 128. We will have 16,384 work- items in total. We can then define a Local Dimensions of 30, 30. Since workgroups are executed together, and work-items can only be synchronized within workgroups, picking your Global and Local Dimensions is problem specific. If we asked for the local id of work-item 31, we’d receive 1. As it’s the 1 st work-item of the 2 nd workgroup. 24

Memory Model of OpenCL

Memory Model  Private  Per work-item  Local  Shared within a workgroup  Global/Constant  Not synchronized, per device  Host Memory Compute Device Host Host Memory Global / Constant Memory Local Memory.. Compute Unit 1.. Compute Unit 1 Work Item Private Work Item.. Compute Unit 2.. Compute Unit 2 Work Item Private Work Item 26

Intermission 27

Implementation

 Key thoughts:  Work-items should be independent of each other  Workgroups share data, but are executed in sync, so they cannot depend on each others results  Find tasks that are independent and highly repeated, pay attention to loops  Transferring data over a PCI bus has overhead, parallelization is only justified for large data sets, or ones with lots of mathematical computations Identifying Parallelizable Routines 29

30 An Example – Class Average  Let’s imagine we were writing an application that computed the class average  There are two tasks we’d need to perform:  Compute the final grade for each student  Obtain a class average by averaging the final grades

 Let’s imagine we were writing an application that computed the class average  There are two tasks we’d need to perform:  Compute the final grade for each student  Obtain a class average by averaging the final grades 31 An Example – Class Average

Pseudo Code 32 Foreach(student in class) { grades = student.getGrades(); sum = 0; count = 0; foreach(grade in grades) { sum += grade; count++; } student.averageGrade = sum/count; }  Compute the final grade for each student

Foreach(student in class) { grades = student.getGrades(); sum = 0; count = 0; foreach(grade in grades) { sum += grade; count++; } student.averageGrade = sum/count; } Pseudo Code 33  This code can be isolated. _kernel void calcGrade (__global const float* input,__global float* output) { int i = get_global_id(0); //Do work on class[i] }

 First decide how to represent your problem, this will tell you the dimensionality of your Global and Local dimensions.  Global dimensions are problem specific  Local dimensions are algorithm specific  Local dimensions must have the same number of dimensions as Global.  Local dimensions must divide the global space evenly  Passing NULL as a workgroup size argument will let OpenCL pick the most efficient setup, but no synchronization will be possible between work-items 34 Determining the Data Dimensions

 An OpenCL calculation needs to perform 6 key steps:  Initialization  Allocate Resources  Creating Programs/Kernels  Execution  Read the Result(s)  Clean Up Execution Steps Warning! Code Ahead 35

 Store Kernel in string/char array Initialization const char* Kernel_Source = "\n "__calcGrade(__global const float* input,__global float* output) { int i = get_global_id(0); //Do work on class[i] }”; 36

 Selecting a device and creating a context in which to run the calculation Initialization cl_int err; Cl_context context; cl_device_id devices; cl_command_queue cmd_queue; err = clGetDeviceIDs(CL_DEVICE_TYPE_GPU,1,&devices,NULL); context = clCreateContext(0,1,&devices,NULL,NULL,&err); cmd_queue = clCreateCommandQueue(context,devices,0,NULL); 37

 Allocation of memory/storage that will be used on the device and push it to the device Allocation cl_mem ax_mem = clCreateBuffer(context,CL_MEM_READ_ONLY,atom_buffer_size,NU LL,NULL); err = clEnqueueWriteBuffer(cmd_queue,ax_mem,CL_TRUE,0,atom_buffer _size,(void*)values,0,NULL,NULL); 38

 Programs and Kernels are read in from source and loaded as binary Program/Kernel Creation cl_program program[1]; cl_kernel kernel[1]; Program[0] = clCreateProgramWithSource(context,1,(const char**)&kernel_source,NULL,&err); err = clBuildProgram(program[0],NULL,NULL,NULL,NULL); Kernel[0]= clCreateKernel(program[0],”calcGrade”,&err); 39

 Arguments to the kernel are set and the kernel is executed on all data Execution size_t global_work_size[1],local_work_size[1]; global_work_size[0] = x; local_work_size[0] = x/2; err = clSetKernelArg(kernel[0],0,sizeof(cl_mem),&values); err = clEnqueueNDRangeKernel(cmd_queue,kernel[0],1,NULL,&global_w ork_size,&local_work_size,NULL,NULL); 40

 We read back the results to the Host Read the Result(s) err = clEnqueueReadBuffer(cmd_queue,val_mem,CL_TRUE,0,grid_buffer _size,val,0,NULL,NULL); 41 Note: If we were working on images, the function clEnqueueReadImage() would be called instead. Note: If we were working on images, the function clEnqueueReadImage() would be called instead.

 Clean up memory, release all OpenCL objects.  Can check OpenCL reference count and ensure it equals zero Clean Up clReleaseKernel(kernel); clReleaseProgram(program); clReleaseCommandQueue(cmd_queue); clReleaseContext(context); 42

 Instead of finding the first GPU, we could create a context out of all OpenCL devices, or decide to use specific dimensions / devices which would perform best on the devices dynamically.  Debugging can be done more efficiently on the CPU then on a GPU, prinf functions will work inside a kernel Advanced Techniques 43

Applications

 Raytracing  Weather forecasting, Climate research  Physics Simulations  Computational finance  Computer Vision  Signal processing, Speech processing  Cryptography / Cryptanalysis  Neural Networks  Database operations  …Many more! 45

The Future

OpenGL Interoperability  OpenCL + OpenGL  Efficient, inter-API communication  OpenCL efficiently shares resources with OpenGL (doesn’t copy)  OpenCL objects can be created from OpenGL objects  OpenGL 4.0 has been designed to align both standards to closely work together  Example Implementation: Vertex and Image data generated with OpenCL Rendered with OpenGL Post Processed with OpenCL Kernels 47

Competitor  DirectCompute by Microsoft  Bundled with DirectX 11  Requires a DX10 or 11 graphic card  Requires Windows Vista or 7  Close to OpenCL feature wise  Internet Explorer 9 and Firefox 3.7 both use DirectX to speed up dom tree rendering (Windows Only) 48

Overview  With OpenCL  Leverage CPUs, GPUs and other processors to accelerate parallel computation  Get dramatic speedups for computationally intensive applications  Write accelerated portable code across different devices and architectures 49

Getting Started…  ATI Stream SDK  Support for OpenCL/OpenGL interoperability  Support for OpenCL/DirectX interoperability   Cuda Toolkit   OpenCL.NET  OpenCL Wrapper for.NET languages 

The End? No… The Beginning 51

References 52      opencl-test-part-1/ opencl-test-part-1/   ual/OpenCL_MacProgGuide/WhatisOpenCL/WhatisOpenCL.html ual/OpenCL_MacProgGuide/WhatisOpenCL/WhatisOpenCL.html  