Objective To Understand the OpenCL programming model

Slides:

Advertisements

Similar presentations

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Advertisements

Intermediate GPGPU Programming in CUDA

INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.

© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

OpenCL Introduction A TECHNICAL REVIEW LU OCT

Illinois UPCRC Summer School 2010 The OpenCL Programming Model Part 1: Basic Concepts Wen-mei Hwu and John Stone with special contributions from Deepthi.

Instructor Notes This is a brief lecture which goes into some more details on OpenCL memory objects Describes various flags that can be used to change.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {

Open CL Hucai Huang. Introduction Today's computing environments are becoming more multifaceted, exploiting the capabilities of a range of multi-core.

OpenCL Introduction AN EXAMPLE FOR OPENCL LU OCT

ME964 High Performance Computing for Engineering Applications CUDA Memory Model & CUDA API Sept. 16, 2008.

Advanced / Other Programming Models Sathish Vadhiyar.

Instructor Notes This is a brief lecture which goes into some more details on OpenCL memory objects Describes various flags that can be used to change.

OpenCL Sathish Vadhiyar Sources: OpenCL overview from AMD OpenCL learning kit from AMD.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson,

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 28, 2013, OpenCL.ppt OpenCL These notes will introduce OpenCL.

OpenCL Sathish Vadhiyar Sources: OpenCL quick overview from AMD OpenCL learning kit from AMD.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

OpenCL Programming James Perry EPCC The University of Edinburgh.

CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,

Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.

Parallel Programming Basics  Things we need to consider:  Control  Synchronization  Communication  Parallel programming languages offer different.

CS/EE 217 GPU Architecture and Parallel Programming Lecture 17: Data Transfer and CUDA Streams.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Heterogeneous Computing with OpenCL Dr. Sergey Axyonov.

OpenCL The Open Standard for Heterogenous Parallel Programming.

Instructor Notes This is a straight-forward lecture. It introduces the OpenCL specification while building a simple vector addition program The Mona Lisa.

Introduction to CUDA Programming Introduction to OpenCL Andreas Moshovos Spring 2011 Based on:

Heterogeneous Computing using openCL lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

Heterogeneous Computing using openCL lecture 3 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

OpenCL. Sources Patrick Cozzi Spring 2011 NVIDIA CUDA Programming Guide CUDA by Example Programming Massively Parallel Processors.

Lecture 15 Introduction to OpenCL

Computer Engg, IIT(BHU)

An Introduction to OpenCL

An Introduction to GPU Computing

Prof. Fred CS 6068 Parallel Computing Fall 2015 Lecture 3 – Sept 14 Data Parallelism Cuda Programming Parallel Communication Patterns.

Patrick Cozzi University of Pennsylvania CIS Spring 2011

Basic CUDA Programming

Lecture 11 – Related Programming Models: OpenCL

ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80

GPU Programming using OpenCL

Slides from “PMPP” book

Introduction to CUDA C Slide credit: Slides adapted from

ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2

ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I

© 2012 Elsevier, Inc. All rights reserved.

Mattan Erez The University of Texas at Austin

CUDA Execution Model – III Streams and Events

ECE 8823A GPU Architectures Module 2: Introduction to CUDA C

© David Kirk/NVIDIA and Wen-mei W. Hwu,

6- General Purpose GPU Programming

Parallel Computing 18: CUDA - I

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

CS/EE 217 GPU Architecture and Parallel Programming Lecture 22: Introduction to OpenCL

Objective To Understand the OpenCL programming model basic concepts and data types OpenCL application programming interface - basic Simple examples to illustrate basic concepts and functionalities © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

OpenCL Programs An OpenCL “program” contains one or more “kernels” and any supporting routines that run on a target device An OpenCL kernel is the basic unit of parallel code that can be executed on a target device Kernel A Kernel B Kernel C Misc support functions OpenCL Program © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

OpenCL Execution Model Integrated host+device app C program Serial or modestly parallel parts in host C code Highly parallel parts in device SPMD kernel C code Serial Code (host)‏ . . . Parallel Kernel (device)‏ KernelA<<< nBlk, nTid >>>(args); Serial Code (host)‏ . . . Parallel Kernel (device)‏ KernelB<<< nBlk, nTid >>>(args); © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

OpenCL Kernels Code that actually executes on target devices Kernel body is instantiated once for each work item An OpenCL work item is equivalent to a CUDA thread Each OpenCL work item gets a unique index __kernel void vadd(__global const float *a, __global const float *b, __global float *result) { int id = get_global_id(0); result[id] = a[id] + b[id]; } © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

Array of Parallel Work Items An OpenCL kernel is executed by an array of work items All work items run the same code (SPMD)‏ Each work item has an index that it uses to compute memory addresses and make control decisions … int id = get_global_id(0); result[id] = a[id] + b [id]; threads 7 6 5 4 3 2 1 6 © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

Work Groups: Scalable Cooperation Divide monolithic work item array into work groups Work items within a work group cooperate via shared memory, atomic operations and barrier synchronization Work items in different work groups cannot cooperate work group 0 work group 1 work group 7 work items 7 6 5 4 3 2 1 15 14 13 12 11 10 9 8 63 62 61 60 59 58 57 56 … int id = get_global_id(0); result[id] = a[id] + b [id]; … int id = get_global_id(0); result[id] = a[id] + b [id]; … int id = get_global_id(0); result[id] = a[id] + b [id]; … © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

OpenCL

Multidimensional Work indexing

OpenCL Data Parallel Model Summary Parallel work is submitted to devices by launching kernels Kernels run over global dimension index ranges (NDRange), broken up into “work groups”, and “work items” Work items executing within the same work group can synchronize with each other with barriers or memory fences Work items in different work groups can’t sync with each other, except by launching a new kernel

OpenCL Host Code Prepare and trigger device code execution Create and manage device context(s) and associate work queue(s), etc… Memory allocations, memory copies, etc Kernel launch OpenCL programs are normally compiled entirely at runtime, which must be managed by host code © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

OpenCL Hardware Abstraction OpenCL exposes CPUs, GPUs, and other Accelerators as “devices” Each “device” contains one or more “compute units”, i.e. cores, SMs, etc... Each “compute unit” contains one or more SIMD “processing elements” OpenCL Device Compute Unit PE © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

An Example of Physical Reality Behind OpenCL Abstraction GPU w/ local DRAM (device) CPU (host) © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

OpenCL Context Contains one or more devices OpenCL memory objects are associated with a context, not a specific device clCreateBuffer() is the main data object allocation function error if an allocation is too large for any device in the context Each device needs its own work queue(s) Memory transfers are associated with a command queue (thus a specific device) OpenCL Context OpenCL Device OpenCL Device © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

OpenCL Context Setup Code (simple) cl_int clerr = CL_SUCCESS; cl_context clctx = clCreateContextFromType(0, CL_DEVICE_TYPE_ALL, NULL, NULL, &clerr); size_t parmsz; clerr = clGetContextInfo(clctx, CL_CONTEXT_DEVICES, 0, NULL, &parmsz); cl_device_id* cldevs = (cl_device_id *) malloc(parmsz); clerr = clGetContextInfo(clctx, CL_CONTEXT_DEVICES, parmsz, cldevs, NULL); cl_command_queue clcmdq = clCreateCommandQueue(clctx, cldevs[0], 0, &clerr); © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

OpenCL Memory Model Overview Global memory Main means of communicating R/W Data between host and device Contents visible to all threads Long latency access We will focus on global memory for now NDRange Work Group (0, 0)‏ Work Group (1, 0)‏ local Memory local Memory private private private private Thread (0, 0)‏ Thread (1, 0)‏ Thread (0, 0)‏ Thread (1, 0)‏ Host Global Memory

OpenCL Device Memory Allocation clCreateBuffer(); Allocates object in the device Global Memory Returns a pointer to the object Requires five parameters OpenCL context pointer Flags for access type by device Size of allocated object Host memory pointer, if used in copy- from-host mode Error code clReleaseMemObject()‏ Frees object Pointer to freed object Grid Block (0, 0)‏ Block (1, 0)‏ Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0)‏ Thread (1, 0)‏ Thread (0, 0)‏ Thread (1, 0)‏ Global Memory Host © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

OpenCL Device Memory Allocation (cont.)‏ Code example: Allocate a 1024 single precision float array Attach the allocated storage to d_a “d_” is often used to indicate a device data structure VECTOR_SIZE = 1024; cl_mem d_a; int size = VECTOR_SIZE* sizeof(float); d_a = clCreateBuffe(clctx, CL_MEM_READ_ONLY, size, NULL, NULL); clReleaseMemObject(d_a); © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

OpenCL Device Command Execution Application Command Cmd Queue Cmd Queue Command OpenCL Device OpenCL Device OpenCL Context © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

OpenCL Host-to-Device Data Transfer clEnqueueWriteBuffer(); memory data transfer to device Requires nine parameters OpenCL command queue pointer Destination OpenCL memory buffer Blocking flag Offset in bytes Size (in bytes) of written data Host memory pointer List of events to be completed before execution of this command Event object tied to this command Asynchronous transfer later Grid Block (0, 0)‏ Block (1, 0)‏ Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0)‏ Thread (1, 0)‏ Thread (0, 0)‏ Thread (1, 0)‏ Global Memory Host © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

OpenCL Device-to-Host Data Transfer clEnqueueReadBuffer(); memory data transfer to host requires nine parameters OpenCL command queue pointer Source OpenCL memory buffer Blocking flag Offset in bytes Size of bytes of read data Destination host memory pointer List of events to be completed before execution of this command Event object tied to this command Asynchronous transfer later Grid Block (0, 0)‏ Block (1, 0)‏ Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0)‏ Thread (1, 0)‏ Thread (0, 0)‏ Thread (1, 0)‏ The blocking flag is a boolean type, which specifies when control returns from the function. If it is TRUE (means it is a blocking call), it means that control will not return until the data has been read and copied from device memory. If it is FALSE, then it means control may return and the following statements may be executed, but there is no guarantee that the data copy from device memory has completed. Same applicable to WriteBuffer, except that it’s a write operation. The offset specifies the offset in bytes from the start of the buffer that you want the data to be read/written to. The event object (last param) can be tested to see if this command has completed before proceeding with other operations. Global Memory Host © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

OpenCL Host-Device Data Transfer (cont.)‏ Code example: Transfer a 64 * 64 single precision float array a is in host memory and d_a is in device memory int mem_size = 64*64*sizeof(float); clEnqueueWriteBuffer(clcmdq, d_a, CL_FALSE, 0, mem_size, (const void * )a, 0, 0, NULL); clEnqueueReadBuffer(clcmdq, d_result, CL_FALSE, 0, mem_size, (void * ) host_result, 0, 0, NULL); © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

OpenCL Host-Device Data Transfer (cont.)‏ clCreateBuffer and clEnqueueWriteBuffer can be combined into a single command using special flags. Eg: d_A=clCreateBuffer(clctxt, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, mem_size, h_A, NULL); Combination of 2 flags here. CL_MEM_COPY_HOST_PTR to be used only if a valid host pointer is specified. This creates a memory buffer on the device, and copies data from h_A into d_A. Includes an implicit clEnqueueWriteBuffer operation, for all devices/command queues tied to the context clctxt. This combination of flags typically translates to a simple createBuffer operation for that context, followed by multiple clEnqueueWriteBuffer operations for all command queues tied to that context. Thus, for multiple devices tied to that context, using this command will copy data to all devices. There is no limitation in the spec as to the nature of this call (blocking or not). So, it is upto the driver to choose what to do. Since the blocking behaviour is not defined, and (more importantly), there are no events tied to this command, it is best to reserve its use for read-only data.

OpenCL Memories __global – large, long latency __private – on-chip device registers __local – memory accessible from multiple PEs or work items. May be SRAM or DRAM, must query… __constant – read-only constant cache Device memory is managed explicitly by the programmer, as with CUDA Pinned memory buffer allocations are created using the CL_MEM_USE_HOST_PTR flag

OpenCL Kernel Execution Launch Application Kernel Cmd Queue Cmd Queue Kernel OpenCL Device OpenCL Device OpenCL Context © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

OpenCL Kernel Compilation Example OpenCL kernel source code as a big string const char* vaddsrc = “__kernel void vadd(__global float *d_A, __global float *d_B, __global float *d_C, int N) { \n“ […etc and so forth…] cl_program clpgm; clpgm = clCreateProgramWithSource(clctx, 1, &vaddsrc, NULL, &clerr); char clcompileflags[4096]; sprintf(clcompileflags, “-cl-mad-enable"); clerr = clBuildProgram(clpgm, 0, NULL, clcompileflags, NULL, NULL); cl_kernel clkern = clCreateKernel(clpgm, “vadd", &clerr); Gives raw source code string(s) to OpenCL Set compiler flags, compile source, and retreive a handle to the “vadd” kernel © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

Summary: Host code for vadd int main() { // allocate and initialize host (CPU) memory float *h_A = …, *h_B = …; // allocate device (GPU) memory cl_mem d_A, d_B, d_C; d_A = clCreateBuffer(clctx, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, N *sizeof(float), h_A, NULL); d_B = clCreateBuffer(clctx, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, N *sizeof(float), h_B, NULL); d_C = clCreateBuffer(clctx, CL_MEM_WRITE_ONLY, N *sizeof(float), NULL, NULL); clkern=clCreateKernel(clpgm, “vadd", NULL); clerr= clSetKernelArg(clkern, 0, sizeof(cl_mem), (void *)&d_A); clerr= clSetKernelArg(clkern, 1, sizeof(cl_mem), (void *)&d_B); clerr= clSetKernelArg(clkern, 2, sizeof(cl_mem), (void *)&d_C); clerr= clSetKernelArg(clkern, 3, sizeof(int), &N); There is no error-handling code shown in this example (due to lack of space, as they are quite verbose). However, they are very important, especially during debugging errors. I will be demonstrating an example during the lab introduction.

Summary of Host Code (cont.) cl_event event=NULL; clerr= clEnqueueNDRangeKernel(clcmdq, clkern, 2, NULL, Gsz,Bsz, 0, NULL, &event); clerr= clWaitForEvents(1, &event); clEnqueueReadBuffer(clcmdq, d_C, CL_TRUE, 0, N*sizeof(float), h_C, 0, NULL, NULL); clReleaseMemObject(d_A); clReleaseMemObject(d_B); clReleaseMemObject(d_C); }

Any MORE Questions? Read Chapter 14