Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be.

Slides:



Advertisements
Similar presentations
Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),
Advertisements

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
 Open standard for parallel programming across heterogenous devices  Devices can consist of CPUs, GPUs, embedded processors etc – uses all the processing.
OPENCL OVERVIEW. Copyright Khronos 2009 OpenCL Commercial Objectives Grow the market for parallel computing For vendors of systems, silicon, middleware,
CS 791v Fall # a simple makefile for building the sample program. # I use multiple versions of gcc, but cuda only supports # gcc 4.4 or lower. The.
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
National Tsing Hua University ® copyright OIA National Tsing Hua University OpenCL Tutorial.
CUDA Grids, Blocks, and Threads
A Source-to-Source OpenACC compiler for CUDA Akihiro Tabuchi †1 Masahiro Nakao †2 Mitsuhisa Sato †1 †1. Graduate School of Systems and Information Engineering,
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
GPU Programming EPCC The University of Edinburgh.
An Introduction to Programming with CUDA Paul Richmond
OpenCL Introduction A TECHNICAL REVIEW LU OCT
Illinois UPCRC Summer School 2010 The OpenCL Programming Model Part 1: Basic Concepts Wen-mei Hwu and John Stone with special contributions from Deepthi.
Instructor Notes This is a brief lecture which goes into some more details on OpenCL memory objects Describes various flags that can be used to change.
Nvidia CUDA Programming Basics Xiaoming Li Department of Electrical and Computer Engineering University of Delaware.
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
Open CL Hucai Huang. Introduction Today's computing environments are becoming more multifaceted, exploiting the capabilities of a range of multi-core.
GPU Programming with CUDA – Optimisation Mike Griffiths
Computer Graphics Ken-Yi Lee National Taiwan University.
OpenCL Introduction AN EXAMPLE FOR OPENCL LU OCT
Краткое введение в OpenCL. Примеры использования в научных вычислениях А.С. Айриян 7 мая 2015 Лаборатория информационных технологий.
Advanced / Other Programming Models Sathish Vadhiyar.
Instructor Notes This is a brief lecture which goes into some more details on OpenCL memory objects Describes various flags that can be used to change.
OpenCL Sathish Vadhiyar Sources: OpenCL overview from AMD OpenCL learning kit from AMD.
ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson,
Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015.
GPU Architecture and Programming
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 4, 2013 Zero-Copy Host Memory These notes will introduce “zero-copy” memory. “Zero-copy”
UNIX Files File organization and a few primitives.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 28, 2013, OpenCL.ppt OpenCL These notes will introduce OpenCL.
OpenCL Sathish Vadhiyar Sources: OpenCL quick overview from AMD OpenCL learning kit from AMD.
Instructor Notes Discusses synchronization, timing and profiling in OpenCL Coarse grain synchronization covered which discusses synchronizing on a command.
CUDA - 2.
CPS4200 Unix Systems Programming Chapter 2. Programs, Processes and Threads A program is a prepared sequence of instructions to accomplish a defined task.
OpenCL Programming James Perry EPCC The University of Edinburgh.
CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.
OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.
Implementation and Optimization of SIFT on a OpenCL GPU Final Project 5/5/2010 Guy-Richard Kayombya.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
CS/EE 217 GPU Architecture and Parallel Programming Lecture 23: Introduction to OpenACC.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
1 ITCS 5/4010 Parallel computing, B. Wilkinson, Jan 14, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One dimensional.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
Heterogeneous Computing with OpenCL Dr. Sergey Axyonov.
My Coordinates Office EM G.27 contact time:
C Programming Day 2. 2 Copyright © 2005, Infosys Technologies Ltd ER/CORP/CRS/LA07/003 Version No. 1.0 Union –mechanism to create user defined data types.
OpenCL The Open Standard for Heterogenous Parallel Programming.
1 ITCS 4/5145GPU Programming, UNC-Charlotte, B. Wilkinson, Nov 4, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU programming.
Instructor Notes This is a straight-forward lecture. It introduces the OpenCL specification while building a simple vector addition program The Mona Lisa.
OpenCL. #include int main() { cl_platform_id platform[10]; cl_uint num_platforms; clGetPlatformIDs(10, platform, &num_platforms); cl_int clGetPlatformIDs.
Introduction to CUDA Programming Introduction to OpenCL Andreas Moshovos Spring 2011 Based on:
Heterogeneous Computing using openCL lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
Heterogeneous Computing using openCL lecture 3 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
OpenCL. Sources Patrick Cozzi Spring 2011 NVIDIA CUDA Programming Guide CUDA by Example Programming Massively Parallel Processors.
Lecture 15 Introduction to OpenCL
Computer Engg, IIT(BHU)
NFV Compute Acceleration APIs and Evaluation
Objective To Understand the OpenCL programming model
An Introduction to GPU Computing
Patrick Cozzi University of Pennsylvania CIS Spring 2011
Lecture 11 – Related Programming Models: OpenCL
GPU Programming using OpenCL
© 2012 Elsevier, Inc. All rights reserved.
CUDA Grids, Blocks, and Threads
© David Kirk/NVIDIA and Wen-mei W. Hwu,
6- General Purpose GPU Programming
Multicore and GPU Programming
Presentation transcript:

Portability with OpenCL 1

High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be nice to write your code once and have it be portable across these platforms The selection of a programming model can impact your portability 2

Platforms Multi-core CPU –An 8-core Sandy Bridge with AVX instructions has 64 GPU core-equivalents (each core can issue 5 instructions per clock cycle and can store the state of two threads. The standard CPUs will execute at a higher clock rate than a standard GPU –The Intel Xeon Phi Coprocessor (IXPC) has up to 61 cores which perform 16 single precision operations in a single instruction or 976 GPU cores. –NVIDIA Kepler SMX has 2880 GPU cores –An AMD Radeon GPU has up to 32 compute units that can issue 4 instructions per cycle which could be treated as 128 cores 3

CPU+Accelerator architecture 4

Memory Management CUDA C requires the programmer to allocate device memory and explicitly copy data between host and device. OpenCL requires the programmer to allocate buffers and copy data to the buffers. It hides some important details, however, in that it doesn't expose exactly where the buffer lives at various points during the program execution. OpenACC allow the programmer to rely entirely on the compiler for memory management to get started, but offer optional data constructs and clauses to control and optimize when data is allocated on the device and moved between host and device 5

Parallelism Scheduling CUDA and OpenCL have thread, block, grid abstractions OpenMP provides core control, but most of the process is automated. OpenACC exposes the three levels of parallelism as gang, worker and vector parallelism 6

Multithreading All three devices require oversubscription to keep the compute units busy; that is, the program must expose extra (slack) parallelism so a compute unit can swap in another active thread when a thread stalls on memory or other long latency operations In CUDA and OpenCL, slack parallelism comes from creating blocks or workgroups larger than the number of cores. OpenMP allocates multiple iterations per core. OpenACC worker-level parallelism is intended to address this issue directly. On the GPUs, iterations of a worker-parallel loop will run on the same core 7

Caching and Scratchpad Memories In CUDA and OpenCL, the programmer must manage the scratchpad memory explicitly, using CUDA __shared__ or OpenCL __local memory OpenACC has a cache directive to allow the programmer to tell the implementation what data has enough reuse to cache locally 8

Portability There are three levels of portability. First is language portability, meaning a programmer can use the same language to write a program for different targets, even if the programs must be different. Second is functional portability, meaning a programmer can write one program that will run on different targets, though not all targets will get the best performance. Third is performance portability, meaning a programmer can write one program that gives good performance across many targets. 9

Portability CUDA provides reasonable portability across NVIDIA devices but there is no pretense that these provide cross-vendor portability, or even performance portability of CUDA source code. OpenCL is designed to provide language and functionality portability. Research has demonstrated that even across similar devices, like NVIDIA and AMD GPUs, retuning or rewriting a program can have a significant impact on performance. OpenACC is also intended to provide performance portability across devices, and there is some initial evidence to support this claim. 10

11 OpenCL Parallelism ConceptCUDA Equivalent kernel host program NDRange (index space)grid work itemthread work groupblock OpenCL to CUDA Data Parallelism Model Mapping

12 Overview of OpenCL Execution Model

13 OpenCL API CallExplanationCUDA Equivalent get_global_id(0);global index of the work item in the x dimension blockIdx.x×blockDim.x+threadIdx.x get_local_id(0)local index of the work item within the work group in the x dimension blockIdx.x get_global_size(0);size of NDRange in the x dimension gridDim.x ×blockDim.x get_local_size(0);Size of each work group in the x dimension blockDim.x Mapping of OpenCL Dimensions and Indices to CUDA

14 Conceptual OpenCL Device Architecture

15 OpenCL Memory TypesCUDA Equivalent global memory constant memory local memoryshared memory private memoryLocal memory Mapping OpenCL Memory Types to CUDA

16 OpenCL Context for Device Management

17 Structure of OpenCL main program Get information about platform and devices available on system Select devices to use Create an OpenCL command queue Create memory buffers on device Create kernel program object Build (compile) kernel in-line (or load precompiled binary) Create OpenCL kernel object Set kernel arguments Execute kernel Read kernel memory and copy to host memory. Transfer data from host to device memory buffers

18 Platform "The host plus a collection of devices managed by the OpenCL framework that allow an application to share resources and execute kernels on devices in the platform." Platforms represented by a cl_platform object, initialized with clGetPlatformID()

19 Simple code for identifying platform //Platform cl_platform_id platform; clGetPlatformIDs (1, &platform, NULL); List of OpenCL platforms found. (Platform IDs) In our case just one platform, identified by &platform Number of platform entries Returns number of OpenCL platforms available. If NULL, ignored.

20 Context “The environment within which the kernels execute and the domain in which synchronization and memory management is defined. The context includes a set of devices, the memory accessible to those devices, the corresponding memory properties and one or more command-queues used to schedule execution of a kernel(s) or operations on memory objects.” The OpenCL Specification version 1.1

21 //Context cl_context_properties props[3]; props[0] = (cl_context_properties) CL_CONTEXT_PLATFORM; props[1] = (cl_context_properties) platform; props[2] = (cl_context_properties) 0; cl_context GPUContext = clCreateContextFromType(props,CL_DEVICE_TYPE_GPU,NULL,NULL,NULL); //Context info size_t ParmDataBytes; clGetContextInfo(GPUContext,CL_CONTEXT_DEVICES,0,NULL,&ParmDataBytes); cl_device_id* GPUDevices = (cl_device_id*)malloc(ParmDataBytes); clGetContextInfo(GPUContext,CL_CONTEXT_DEVICES,ParmDataBytes,GPUDevices,N ULL); Code for context

22 Command Queue “An object that holds commands that will be executed on a specific device. The command-queue is created on a specific device in a context. Commands to a command-queue are queued in-order but may be executed in-order or out-of-order....” The OpenCL Specification version 1.1

23 // Create command-queue cl_command_queue GPUCommandQueue = clCreateCommandQueue(GPUContext,GPUDevices[0],0,N ULL); Simple code for creating a command queue

24 Allocating memory on device Use clCreatBuffer: cl_mem clCreateBuffer(cl_context context, cl_mem_flags flags, size_t size, void *host_ptr, cl_int *errcode_ret) OpenCL context, from clCreateContextFromType() Bit field to specify type of allocation/usage (CL_MEM_READ_WRITE,…) No of bytes in buffer memory object Returns error code if an error Ptr to buffer data (May be previously allocated.) Returns memory object

25 Sample code for allocating memory on device for source data // source data on host, two vectors int *A, *B; A = new int[N]; B = new int[N]; for(int i = 0; i < N; i++) { A[i] = rand()%1000; B[i] = rand()%1000; } … // Allocate GPU memory for source vectors cl_mem GPUVector1 = clCreateBuffer(GPUContext,CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,sizeof(int)*N, A, NULL); cl_mem GPUVector2 = clCreateBuffer(GPUContext,CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,sizeof(int)*N, B, NULL);

26 Sample code for allocating memory on device for results on GPU // Allocate GPU memory for output vector cl_mem GPUOutputVector = clCreateBuffer(GPUContext,CL_MEM_WRITE_ONLY,sizeof(int)*N, NULL,NULL);

27 Kernel Program Simple programs might be in the same file as the host code as our CUDA examples. In that case need to formed into strings in a character array. If in a separate file, can read that file into host program as a character string

28 Kernel program const char* OpenCLSource[] = { "__kernel void vectorAdd (const __global int* a,", " const __global int* b,", " __global int* c)", "{", " unsigned int gid = get_global_id(0);", " c[gid] = a[gid] + b[gid];", "}" }; … int main(int argc, char **argv){ … } If in same program as host, kernel needs to be strings (I think it can be a single string) OpenCL qualifier to indicate kernel code OpenCL qualifier to indicate kernel memory (Memory objects allocated from global memory pool) Double underscores optional in OpenCL qualifiers Returns global work-item ID in given dimension (0 here)

29 Kernel in a separate file // Load the kernel source code into the array source_str FILE *fp; char *source_str; size_t source_size; fp = fopen("vector_add_kernel.cl", "r"); if (!fp) { fprintf(stderr, "Failed to load kernel.\n"); exit(1); } source_str = (char*)malloc(MAX_SOURCE_SIZE); source_size = fread( source_str, 1, MAX_SOURCE_SIZE, fp); fclose( fp );

30 Create kernel program object const char* OpenCLSource[] = { … }; int main(int argc, char **argv) … // Create OpenCL program object cl_program OpenCLProgram = clCreateProgramWithSource(GPUContext,7,OpenCLSource,NULL,NULL); Number of strings in kernel program array Used if strings not null-terminated to given length of strings Used to return error code if error This example uses a single file for both host and kernel code. Can use clCreateprogramWithSource() with a separate kernel file read into host program

31 Build kernel program // Build the program (OpenCL JIT compilation) clBuildProgram(OpenCLProgram,0,NULL,NULL,NULL,NULL); Program object from clCreateProgramwithSource Number of devices List of devices, if more than one Build options Arguments for notification routine Function ptr to notification routine called with build complete. Then clBuildProgram will return immediately, otherwise only when build complete

32 Creating Kernel Objects // Create a handle to the compiled OpenCL function cl_kernel OpenCLVectorAdd = clCreateKernel(OpenCLProgram, "vectorAdd", NULL); Built prgram from clBuildProgram Function name with __kernel qualifier Return error code

33 Set Kernel Arguments // Set kernel arguments clSetKernelArg(OpenCLVectorAdd,0,sizeof(cl_mem), (void*)&GPUVector1); clSetKernelArg(OpenCLVectorAdd,1,sizeof(cl_mem), (void*)&GPUVector2); clSetKernelArg(OpenCLVectorAdd,2,sizeof(cl_mem), (void*)&GPUOutputVector); Kernel object from clCreateKernel() Which argument Size of argument Pointer to data for argument, from clCreateBuffer()

Number of events to complete before this commands Array describing no of global work items Array describing no of work items that make up a work group 34 Enqueue a command to execute kernel on device // Launch the kernel size_t WorkSize[1] = {N}; // Total number of work items size_t localWorkSize[1]={256}; //No of work items in work group // Launch the kernel clEnqueueNDRangeKernel(GPUCommandQueue,OpenCLVectorAdd,1,NULL, WorkSize, localWorkSize, 0, NULL, NULL); Kernel object from clCreatKernel() Dimensions of work items Offset used with work item Event wait list Event

35 Function to copy from buffer object to host memory The following function enqueue commands to read from a buffer object to host memory: cl_int clEnqueueReadBuffer (cl_command_queue command_queue, cl_mem buffer, cl_bool blocking_read, size_t offset, size_t cb, void *ptr, cl_uint num_events_in_wait_list, const cl_event *event_wait_list, cl_event *event) The OpenCL Specification version 1.1

36 Function to copy from host memory to buffer object The following function enqueue commands to write to a buffer object from host memory: cl_int clEnqueueWriteBuffer (cl_command_queue command_queue, cl_mem buffer, cl_bool blocking_write, size_t offset, size_t cb, const void *ptr, cl_uint num_events_in_wait_list, const cl_event *event_wait_list, cl_event *event) The OpenCL Specification version 1.1

37 Copy data back from kernel // Copy the output back to CPU memory int *C; C = new int[N]; clEnqueueReadBuffer(GPUCommandQueue,GPUOutputVector, CL_TRUE, 0, N*sizeof(int), C, 0, NULL, NULL); Read is blocking Byte offset in buffer Size of data to read in bytes Pointer to buffer in host to write data Number of events to complete before this commands Event wait list Event Command queue from clCreateCommandQueue Device buffer from clCreateBuffer

38 Results from GPU cout << "C[“ << 0 << "]: " << A[0] <<"+"<< B[0] <<"=" << C[0] << "\n"; cout << "C[“ << N-1 << "]: “ << A[N-1] << "+“ << B[N-1] << "=" << C[N- 1] << "\n"; C++ here

39 Clean-up // Cleanup free(GPUDevices); clReleaseKernel(OpenCLVectorAdd); clReleaseProgram(OpenCLProgram); clReleaseCommandQueue(GPUCommandQueue); clReleaseContext(GPUContext); clReleaseMemObject(GPUVector1); clReleaseMemObject(GPUVector2); clReleaseMemObject(GPUOutputVector);

40 Compiling Need OpenCL header: #include (For mac: #include ) and link to the OpenCL library. Compile OpenCL host program main.c using gcc, two phases: gcc -c -I /path-to-include-dir-with-cl.h/ main.c -o main.o gcc -L /path-to-lib-folder-with-OpenCL-libfile/ -l OpenCL main.o - o host Ref: