ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson,

Slides:

Advertisements

Similar presentations

Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),

Advertisements

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

ITCS 3181 Logic and Computer Systems 2015 B. Wilkinson slides3.ppt Modification date: March 16, Addressing Modes The methods used in machine instructions.

1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

Chapter 7 Process Environment Chien-Chung Shen CIS, UD

 Open standard for parallel programming across heterogenous devices  Devices can consist of CPUs, GPUs, embedded processors etc – uses all the processing.

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 26, 2013, DyanmicParallelism.ppt CUDA Dynamic Parallelism These notes will outline CUDA.

OPENCL OVERVIEW. Copyright Khronos 2009 OpenCL Commercial Objectives Grow the market for parallel computing For vendors of systems, silicon, middleware,

CS 791v Fall # a simple makefile for building the sample program. # I use multiple versions of gcc, but cuda only supports # gcc 4.4 or lower. The.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 14, 2011 Streams.pptx CUDA Streams These notes will introduce the use of multiple CUDA.

National Tsing Hua University ® copyright OIA National Tsing Hua University OpenCL Tutorial.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 20, 2011 CUDA Programming Model These notes will introduce: Basic GPU programming model.

CUDA Programming continued ITCS 4145/5145 Nov 24, 2010 © Barry Wilkinson Revised.

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

OpenCL Introduction A TECHNICAL REVIEW LU OCT

Illinois UPCRC Summer School 2010 The OpenCL Programming Model Part 1: Basic Concepts Wen-mei Hwu and John Stone with special contributions from Deepthi.

Instructor Notes This is a brief lecture which goes into some more details on OpenCL memory objects Describes various flags that can be used to change.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {

Open CL Hucai Huang. Introduction Today's computing environments are becoming more multifaceted, exploiting the capabilities of a range of multi-core.

Computer Graphics Ken-Yi Lee National Taiwan University.

OpenCL Introduction AN EXAMPLE FOR OPENCL LU OCT

1 ITCS 4/5010 GPU Programming, UNC-Charlotte, B. Wilkinson, Jan 14, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU.

Краткое введение в OpenCL. Примеры использования в научных вычислениях А.С. Айриян 7 мая 2015 Лаборатория информационных технологий.

Advanced / Other Programming Models Sathish Vadhiyar.

Instructor Notes This is a brief lecture which goes into some more details on OpenCL memory objects Describes various flags that can be used to change.

OpenCL Sathish Vadhiyar Sources: OpenCL overview from AMD OpenCL learning kit from AMD.

GPU Architecture and Programming

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 4, 2013 Zero-Copy Host Memory These notes will introduce “zero-copy” memory. “Zero-copy”

UNIX Files File organization and a few primitives.

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 28, 2013, OpenCL.ppt OpenCL These notes will introduce OpenCL.

OpenCL Sathish Vadhiyar Sources: OpenCL quick overview from AMD OpenCL learning kit from AMD.

Instructor Notes Discusses synchronization, timing and profiling in OpenCL Coarse grain synchronization covered which discusses synchronizing on a command.

CPS4200 Unix Systems Programming Chapter 2. Programs, Processes and Threads A program is a prepared sequence of instructions to accomplish a defined task.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

OpenCL Programming James Perry EPCC The University of Edinburgh.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be.

OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.

Implementation and Optimization of SIFT on a OpenCL GPU Final Project 5/5/2010 Guy-Richard Kayombya.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.

Heterogeneous Computing with OpenCL Dr. Sergey Axyonov.

C Programming Day 2. 2 Copyright © 2005, Infosys Technologies Ltd ER/CORP/CRS/LA07/003 Version No. 1.0 Union –mechanism to create user defined data types.

OpenCL The Open Standard for Heterogenous Parallel Programming.

1 ITCS 4/5145GPU Programming, UNC-Charlotte, B. Wilkinson, Nov 4, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU programming.

Instructor Notes This is a straight-forward lecture. It introduces the OpenCL specification while building a simple vector addition program The Mona Lisa.

OpenCL. #include int main() { cl_platform_id platform[10]; cl_uint num_platforms; clGetPlatformIDs(10, platform, &num_platforms); cl_int clGetPlatformIDs.

Introduction to CUDA Programming Introduction to OpenCL Andreas Moshovos Spring 2011 Based on:

Chapter 7 Process Environment Chien-Chung Shen CIS/UD

Heterogeneous Computing using openCL lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

Heterogeneous Computing using openCL lecture 3 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

OpenCL. Sources Patrick Cozzi Spring 2011 NVIDIA CUDA Programming Guide CUDA by Example Programming Massively Parallel Processors.

Lecture 15 Introduction to OpenCL

CUDA Programming Model

Objective To Understand the OpenCL programming model

An Introduction to GPU Computing

Patrick Cozzi University of Pennsylvania CIS Spring 2011

Lecture 11 – Related Programming Models: OpenCL

GPU Programming using OpenCL

Operation System Program 4

Programming with Shared Memory

OpenCL introduction II.

CUDA Programming Model

Programming with Shared Memory

OpenCL introduction.

Presentation transcript:

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, OpenCL These notes will introduce OpenCL ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, April 7, 2011, OpenCL.ppt

(Open Computing Language) OpenCL (Open Computing Language) A standard based upon C for portable parallel applications Task parallel and data parallel applications Focuses on multi platform support (multiple CPUs, GPUs, …) Development initiated by Apple. Developed by Khromos group who also managed OpenGL OpenCL 1.0 2008. Released with Max OS 10.6 (Snow Leopard) OpenCL 1.1 June 2010 Similarities with CUDA Implementation available for NVIDIA GPUs Wikipedia “OpenCL http://en.wikipedia.org/wiki/OpenCL

OpenCL Programming Model Uses data parallel programming model, similar to CUDA Host program launches kernel routines as in CUDA, but allows for just-in-time compilation during host execution. OpenCL “work items” corresponds to CUDA threads OpenCL “work groups” corresponds to CUDA thread blocks Work items in same work group can be synchronized with a barrier as in CUDA.

Sample OpenCL code to add two vectors To illustrate OpenCL commands, will used OpenCl code to add two vectors, A and B which are transferred to the device (GPU) and the result, C, returned to host (CPU), similar to CUDA vector addition

Structure of OpenCL main program Get information about platform and devices available on system Select devices to use Create an OpenCL command queue Create memory buffers on device Transfer data from host to device memory buffers Create kernel program object Build (compile) kernel in-line (or load precompiled binary) Create OpenCL kernel object Set kernel arguments Execute kernel Read kernel memory and copy to host memory.

Platform "The host plus a collection of devices managed by the OpenCL framework that allow an application to share resources and execute kernels on devices in the platform." Platforms represented by a cl_platform object, initialized with clGetPlatformID() http://opencl.codeplex.com/wikipage?title=OpenCL%20Tutorials%20-%201

Simple code for identifying platform cl_platform_id platform; clGetPlatformIDs (1, &platform, NULL); Returns number of OpenCL platforms available. If NULL, ignored. Number of platform entries List of OpenCL platforms found. (Platform IDs) In our case just one platform, identified by &platform

Context “The environment within which the kernels execute and the domain in which synchronization and memory management is defined. The context includes a set of devices, the memory accessible to those devices, the corresponding memory properties and one or more command-queues used to schedule execution of a kernel(s) or operations on memory objects.” The OpenCL Specification version 1.1 http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf

Code for context //Context cl_context_properties props[3]; props[0] = (cl_context_properties) CL_CONTEXT_PLATFORM; props[1] = (cl_context_properties) platform; props[2] = (cl_context_properties) 0; cl_context GPUContext = clCreateContextFromType(props,CL_DEVICE_TYPE_GPU,NULL,NULL,NULL); //Context info size_t ParmDataBytes; clGetContextInfo(GPUContext,CL_CONTEXT_DEVICES,0,NULL,&ParmDataBytes); cl_device_id* GPUDevices = (cl_device_id*)malloc(ParmDataBytes); clGetContextInfo(GPUContext,CL_CONTEXT_DEVICES,ParmDataBytes,GPUDevices,NULL);

Command Queue “An object that holds commands that will be executed on a specific device. The command-queue is created on a specific device in a context. Commands to a command-queue are queued in-order but may be executed in-order or out-of-order. ...” The OpenCL Specification version 1.1 http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf

Simple code for creating a command queue // Create command-queue cl_command_queue GPUCommandQueue = clCreateCommandQueue(GPUContext,GPUDevices[0],0,NULL);

Allocating memory on device OpenCL context, from clCreateContextFromType() Use clCreatBuffer: cl_mem clCreateBuffer(cl_context context, cl_mem_flags flags, size_t size, void *host_ptr, cl_int *errcode_ret) Bit field to specify type of allocation/usage (CL_MEM_READ_WRITE ,…) No of bytes in buffer memory object Ptr to buffer data (May be previously allocated.) Returns memory object Returns error code if an error

Sample code for allocating memory on device for source data // source data on host, two vectors int *A, *B; A = new int[N]; B = new int[N]; for(int i = 0; i < N; i++) { A[i] = rand()%1000; B[i] = rand()%1000; } … // Allocate GPU memory for source vectors cl_mem GPUVector1 = clCreateBuffer(GPUContext,CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,sizeof(int)*N, A, NULL); cl_mem GPUVector2 = clCreateBuffer(GPUContext,CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,sizeof(int)*N, B, NULL);

Sample code for allocating memory on device for results on GPU // Allocate GPU memory for output vector cl_mem GPUOutputVector = clCreateBuffer(GPUContext,CL_MEM_WRITE_ONLY,sizeof(int)*N, NULL,NULL);

Kernel Program Simple programs might be in the same file as the host code as our CUDA examples. In that case need to formed into strings in a character array. If in a separate file, can read that file into host program as a character string

Kernel program const char* OpenCLSource[] = { If in same program as host, kernel needs to be strings (I think it can be a single string) OpenCL qualifier to indicate kernel code const char* OpenCLSource[] = { "__kernel void vectorAdd (const __global int* a,", " const __global int* b,", " __global int* c)", "{", " unsigned int gid = get_global_id(0);", " c[gid] = a[gid] + b[gid];", "}" }; … int main(int argc, char **argv){ } OpenCL qualifier to indicate kernel memory (Memory objects allocated from global memory pool) Returns global work-item ID in given dimension (0 here) Double underscores optional in OpenCL qualifiers

Kernel in a separate file // Load the kernel source code into the array source_str FILE *fp; char *source_str; size_t source_size; fp = fopen("vector_add_kernel.cl", "r"); if (!fp) { fprintf(stderr, "Failed to load kernel.\n"); exit(1); } source_str = (char*)malloc(MAX_SOURCE_SIZE); source_size = fread( source_str, 1, MAX_SOURCE_SIZE, fp); fclose( fp ); http://mywiki-science.wikispaces.com/OpenCL

Create kernel program object const char* OpenCLSource[] = { … }; int main(int argc, char **argv) // Create OpenCL program object cl_program OpenCLProgram = clCreateProgramWithSource(GPUContext,7,OpenCLSource,NULL,NULL); This example uses a single file for both host and kernel code. Can use clCreateprogramWithSource() with a separate kernel file read into host program Used to return error code if error Number of strings in kernel program array Used if strings not null-terminated to given length of strings

Build kernel program // Build the program (OpenCL JIT compilation) clBuildProgram(OpenCLProgram,0,NULL,NULL,NULL,NULL); Arguments for notification routine Build options Number of devices Program object from clCreateProgramwithSource Function ptr to notification routine called with build complete. Then clBuildProgram will return immediately, otherwise only when build complete List of devices, if more than one

Creating Kernel Objects // Create a handle to the compiled OpenCL function cl_kernel OpenCLVectorAdd = clCreateKernel(OpenCLProgram, "vectorAdd", NULL); Built prgram from clBuildProgram Function name with __kernel qualifier Return error code

Set Kernel Arguments // Set kernel arguments clSetKernelArg(OpenCLVectorAdd,0,sizeof(cl_mem), (void*)&GPUVector1); clSetKernelArg(OpenCLVectorAdd,1,sizeof(cl_mem), (void*)&GPUVector2); clSetKernelArg(OpenCLVectorAdd,2,sizeof(cl_mem), (void*)&GPUOutputVector); Which argument Size of argument Pointer to data for argument, from clCreateBuffer() Kernel object from clCreateKernel()

Enqueue a command to execute kernel on device // Launch the kernel size_t WorkSize[1] = {N}; // Total number of work items size_t localWorkSize[1]={256}; //No of work items in work group clEnqueueNDRangeKernel(GPUCommandQueue,OpenCLVectorAdd,1,NULL, WorkSize, localWorkSize, 0, NULL, NULL); Dimensions of work items Kernel object from clCreatKernel() Offset used with work item Number of events to complete before this commands Array describing no of global work items Array describing no of work items that make up a work group Event wait list Event

Function to copy from buffer object to host memory The following function enqueue commands to read from a buffer object to host memory: cl_int clEnqueueReadBuffer (cl_command_queue command_queue, cl_mem buffer, cl_bool blocking_read, size_t offset, size_t cb, void *ptr, cl_uint num_events_in_wait_list, const cl_event *event_wait_list, cl_event *event) The OpenCL Specification version 1.1 http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf

Function to copy from host memory to buffer object The following function enqueue commands to write to a buffer object from host memory: cl_int clEnqueueWriteBuffer (cl_command_queue command_queue, cl_mem buffer, cl_bool blocking_write, size_t offset, size_t cb, const void *ptr, cl_uint num_events_in_wait_list, const cl_event *event_wait_list, cl_event *event) The OpenCL Specification version 1.1 http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf

Copy data back from kernel // Copy the output back to CPU memory int *C; C = new int[N]; clEnqueueReadBuffer(GPUCommandQueue,GPUOutputVector,CL_TRUE, 0, N*sizeof(int), C, 0, NULL, NULL); Command queue from clCreateCommandQueue Device buffer from clCreateBuffer Number of events to complete before this commands Read is blocking Byte offset in buffer Pointer to buffer in host to write data Event wait list Event Size of data to read in bytes

Results from GPU C++ here cout << "C[“ << 0 << "]: " << A[0] <<"+"<< B[0] <<"=" << C[0] << "\n"; cout << "C[“ << N-1 << "]: “ << A[N-1] << "+“ << B[N-1] << "=" << C[N-1] << "\n"; C++ here

Clean-up // Cleanup free(GPUDevices); clReleaseKernel(OpenCLVectorAdd); clReleaseProgram(OpenCLProgram); clReleaseCommandQueue(GPUCommandQueue); clReleaseContext(GPUContext); clReleaseMemObject(GPUVector1); clReleaseMemObject(GPUVector2); clReleaseMemObject(GPUOutputVector);

Compiling Need OpenCL header: #include <CL/cl.h> (For mac: #include <OpenCL/opencl.h> ) and link to the OpenCL library. Compile OpenCL host program main.c using gcc, two phases: gcc -c -I /path-to-include-dir-with-cl.h/ main.c -o main.o gcc -L /path-to-lib-folder-with-OpenCL-libfile/ -l OpenCL main.o -o host Ref: http://www.thebigblob.com/getting-started-with-opencl-and-gpu-computing/

(Program called scalarmulocl) Make File (Program called scalarmulocl) CC = g++ LD = g++ -lm CFLAGS = -Wall -shared CDEBUG = LIBOCL = -L/nfs-home/mmishra2/NVIDIA_GPU_Computing_SDK/OpenCL/common/lib INCOCL = -I/nfs-home/mmishra2/NVIDIA_GPU_Computing_SDK/OpenCL/common/inc SRCS = scalarmulocl.cpp OBJS = scalarmulocl.o EXE = scalarmulocl.a all: $(EXE) $(OBJS): $(SRCS) $(CC) $(CFLAGS) $(INCOCL) -I/usr/include -c $(SRCS) $(EXE): $(OBJS) $(LD) -L/usr/local/lib $(OBJS) $(LIBOCL) -o $(EXE) -l OpenCL clea: rm -f $(OBJS) *~ clear References: http://mywiki-science.wikispaces.com/OpenCL Submitted by: Manisha Mishra

Compiling and Executing the program To compile: make To Run: ./scalarmulocl.a Snapshot: Submitted by: Manisha Mishra

Questions

More Information Chapter 11 of Programming Massively Parallel Processors by D. B. Kirk and W-M W. Hwu, Morgan Kaufmann, 2010

Obtain the list of platforms available. clGetPlatformIDs Obtain the list of platforms available. cl_int clGetPlatformIDs(cl_uint num_entries, cl_platform_id *platforms, cl_uint *num_platforms) Parameters num_entries The number of cl_platform_id entries that can be added to platforms. If platforms is not NULL, the num_entries must be greater than zero. platforms Returns a list of OpenCL platforms found. The cl_platform_id values returned in platforms can be used to identify a specific OpenCL platform. If platforms argument is NULL, this argument is ignored. The number of OpenCL platforms returned is the mininum of the value specified by num_entries or the number of OpenCL platforms available. num_platforms Returns the number of OpenCL platforms available. If num_platforms is NULL, this argument is ignored. http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/

Includes #include <stdio.h> #include <stdlib.h> #include <CL/cl.h> //OpenCL header for C #include <iostream> //C++ input/output using namespace std;