Heterogeneous Computing using openCL lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

Slides:



Advertisements
Similar presentations
A C++ Crash Course Part II UW Association for Computing Machinery Questions & Feedback.
Advertisements

Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
C Language.
Unions The storage referenced by a union variable can hold data of different types subject to the restriction that at any one time, the storage holds data.
What is a pointer? First of all, it is a variable, just like other variables you studied So it has type, storage etc. Difference: it can only store the.
APARAPI Java™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU Gary Frost AMD PMTS Java Runtime Team.
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Chapter 7 Process Environment Chien-Chung Shen CIS, UD
 Open standard for parallel programming across heterogenous devices  Devices can consist of CPUs, GPUs, embedded processors etc – uses all the processing.
Parallel and Distributed Simulation Time Warp: Other Mechanisms.
OpenCL Peter Holvenstot. OpenCL Designed as an API and language specification Standards maintained by the Khronos group  Currently 1.0, 1.1, and 1.2.
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
6/10/2015C++ for Java Programmers1 Pointers and References Timothy Budd.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Memory Arrangement Memory is arrange in a sequence of addressable units (usually bytes) –sizeof( ) return the number of units it takes to store a type.
CIS 101: Computer Programming and Problem Solving Lecture10 Usman Roshan Department of Computer Science NJIT.
Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.
CS100A, Fall 1997, Lectures 221 CS100A, Fall 1997 Lecture 22, Tuesday 18 November Introduction To C Goal: Acquire a reading knowledge of basic C. Concepts:
1 CS 201 Array Debzani Deb. 2 Having trouble linking math.h? Link with the following option gcc –lm –o test test.o.
1 CSC 1401 S1 Computer Programming I Hamid Harroud School of Science and Engineering, Akhawayn University
. Memory Management. Memory Organization u During run time, variables can be stored in one of three “pools”  Stack  Static heap  Dynamic heap.
Instructor Notes This is a brief lecture which goes into some more details on OpenCL memory objects Describes various flags that can be used to change.
By Sidhant Garg.  C was developed between by Dennis Ritchie at Bell Laboratories for use with the Unix Operating System.  Unlike previously.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {
Open CL Hucai Huang. Introduction Today's computing environments are becoming more multifaceted, exploiting the capabilities of a range of multi-core.
Instructor Notes GPU debugging is still immature, but being improved daily. You should definitely check to see the latest options available before giving.
OpenCL Introduction AN EXAMPLE FOR OPENCL LU OCT
1 ITCS 4/5010 GPU Programming, UNC-Charlotte, B. Wilkinson, Jan 14, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU.
CMPSC 16 Problem Solving with Computers I Spring 2014 Instructor: Tevfik Bultan Lecture 12: Pointers continued, C strings.
Chapter 0.2 – Pointers and Memory. Type Specifiers  const  may be initialised but not used in any subsequent assignment  common and useful  volatile.
Instructor Notes This is a brief lecture which goes into some more details on OpenCL memory objects Describes various flags that can be used to change.
ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson,
Pointers OVERVIEW.
GPU Architecture and Programming
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 28, 2013, OpenCL.ppt OpenCL These notes will introduce OpenCL.
OpenCL Sathish Vadhiyar Sources: OpenCL quick overview from AMD OpenCL learning kit from AMD.
Instructor Notes Discusses synchronization, timing and profiling in OpenCL Coarse grain synchronization covered which discusses synchronizing on a command.
Copyright 2005, The Ohio State University 1 Pointers, Dynamic Data, and Reference Types Review on Pointers Reference Variables Dynamic Memory Allocation.
CPS4200 Unix Systems Programming Chapter 2. Programs, Processes and Threads A program is a prepared sequence of instructions to accomplish a defined task.
OpenCL Programming James Perry EPCC The University of Edinburgh.
C Programming – Part 3 Arrays and Strings.  Collection of variables of the same type  Individual array elements are identified by an integer index 
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be.
OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.
CSCI 330 UNIX and Network Programming
Implementation and Optimization of SIFT on a OpenCL GPU Final Project 5/5/2010 Guy-Richard Kayombya.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
Advanced Pointer Topics. Pointers to Pointers u A pointer variable is a variable that takes some memory address as its value. Therefore, you can have.
Multi-dimensional Arrays and other Array Oddities Rudra Dutta CSC Spring 2007, Section 001.
Variables in C Topics  Naming Variables  Declaring Variables  Using Variables  The Assignment Statement Reading  Sections
C is a high level language (HLL)
Heterogeneous Computing with OpenCL Dr. Sergey Axyonov.
My Coordinates Office EM G.27 contact time:
OpenCL The Open Standard for Heterogenous Parallel Programming.
1 ITCS 4/5145GPU Programming, UNC-Charlotte, B. Wilkinson, Nov 4, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU programming.
Instructor Notes This is a straight-forward lecture. It introduces the OpenCL specification while building a simple vector addition program The Mona Lisa.
Introduction to CUDA Programming Introduction to OpenCL Andreas Moshovos Spring 2011 Based on:
Chapter 7 Process Environment Chien-Chung Shen CIS/UD
Heterogeneous Computing using openCL lecture 3 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
BIL 104E Introduction to Scientific and Engineering Computing Lecture 4.
Lecture 15 Introduction to OpenCL
User-Written Functions
Objective To Understand the OpenCL programming model
Patrick Cozzi University of Pennsylvania CIS Spring 2011
Lecture 11 – Related Programming Models: OpenCL
Presentation transcript:

Heterogeneous Computing using openCL lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz

Example: Squares 1 result[i] = data[i] * data[i]

Selecting a Platform Each OpenCL implementation (i.e. an OpenCL library from AMD, NVIDIA, etc.) defines platforms which enable the host system to interact with OpenCL-capable devices – Currently each vendor supplies only a single platform per implementation 2

Selecting Devices Once a platform is selected, we can then query for the devices that it knows how to interact with We can specify which types of devices we are interested in (e.g. all devices, CPUs only, GPUs only) This call is performed twice as with clGetPlatformIDs – The first call is to determine the number of devices, the second retrieves the device objects 3

Contexts Is an abstract execution environment 4 Context Empty context

Command Queues Command queues associate a context with a device 5 Context Command queues

Lucky You ! /********************************************************************* * initGPU : sets up the openCL environment for using a GPU. * Note that the system may have more than one GPU in which case * the one that has been pre-configured will be chosen. * If anything goes wrong in the course, error messages will be * printed to stderr and the last error encountered will be returned. * *********************************************************************/ extern cl_int initGPU (); 6 Choses platform and device and creates a context and a command queue for you

Memory Objects 7 Context Uninitialized OpenCL memory objects—the original data will be transferred later to/from these objects Original input/output data (not OpenCL memory objects)

Transferring Data 8 Context Images are written to a device The images are redundant here to show that they are both part of the context (on the host) and physically on the device

Programs 9 Context Program

Compiling Programs This function compiles and links an executable from the program object for each device in the context – If device_list is supplied, then only those devices are targeted Optional preprocessor, optimization, and other options can be supplied by the options argument 10

Kernels 11 Conte xt Kernels

Runtime Compilation There is a high overhead for compiling programs and creating kernels – Each operation only has to be performed once (at the beginning of the program) The kernel objects can be reused any number of times by setting different arguments 12 clCreateProgramWithSource clCreateProgramWithBinary clBuildProgramclCreateKernel Read source code into an array

Kernel Arguments 13 Context Data (e.g. images) are set as kernel arguments

Lucky you II /******************************************************************************* * * setupKernel : this routine prepares a kernel for execution. It takes the * following arguments: * - the kernel source as a string * - the name of the kernel function as string * - the number of arguments (must match those specified in the * kernel source!) * - followed by the actual arguments. Each argument to the kernel * results in two or three arguments to this function, depending * on whether these are pointers to float-arrays or integer values: * * legal argument sets are: * FloatArr::clarg_type, num_elems::int, pointer::float *, and * IntConst::clarg_type, number::int * typedef enum { FloatArr, IntConst } clarg_type; 14

Lucky you II cont. * If anything goes wrong in the course, error messages will be * printed to stderr. The pointer to the fully prepared kernel * will be returned. * * Note that this function actually performs quite a few openCL * tasks. It compiles the source, it allocates memory on the * device and it copies over all float arrays. If a more * sophisticated behaviour is needed you may have to fall back to * using openCL directly. * ******************************************************************************/ extern cl_kernel setupKernel( const char *kernel_source, char *kernel_name, int num_args,...); count = 1024; kernel = setupKernel( KernelSource, "square", 3, FloatArr, count, data, FloatArr, count, results, IntConst, count); 15

Lucky you II cont. const char *KernelSource = "\n" "__kernel void square( \n" " __global float* input, \n" " __global float* output, \n" " const unsigned int count) \n" "{ \n" " int i = get_global_id(0); \n" " output[i] = input[i] * input[i]; \n" "} \n”; data = (float *) malloc (count * sizeof (float)); results = (float *) malloc (count * sizeof (float)); for (int i = 0; i < count; i++) data[i] = rand () / (float) RAND_MAX; kernel = setupKernel( KernelSource, "square", 3, FloatArr, count, data, FloatArr, count, results, IntConst, count); 16

Executing the Kernel 17 Context An index space of threads is created (dimensions match the data)

Executing the Kernel A thread structure defined by the index-space that is created – Each thread executes the same kernel on different data 18 Context Each thread executes the kernel SIMT = S ingle I nstruction M ultiple Threads

Copying Data Back 19 Context Copied back from GPU

Lucky you III /******************************************************************************* * * runKernel : this routine executes the kernel given as first argument. * The thread-space is defined through the next two arguments: * identifies the dimensionality of the thread-space and * is a vector of length that gives the upper * bounds for all axes. The argument specifies the size * of the individual warps which need to have the same dimensionality * as the overall range. * If anything gows wrong in the course, error messages will be * printed to stderr and the last error encountered will be returned. * * Note that this function not only executes the kernel with the given * range and warp-size, it also copies back *all* arguments from the * kernel after the kernel's completion. If a more sophisticated * behaviour is needed you may have to fall back to using openCL directly. * ******************************************************************************/ extern cl_int runKernel( cl_kernel kernel, int dim, size_t *global, size_t *local); size_t global[1] = {1024}; size_t local[1] = {32}; runKernel( kernel, 1, global, local); 20

Finally: Release the Resources /******************************************************************************* * * freeDevice : this routine releases all acquired ressources. * If anything gows wrong in the course, error messages will be * printed to stderr and the last error encountered will be returned. * ******************************************************************************/ extern cl_int freeDevice(); 21

OpenCL Timing OpenCL provides “events” which can be used for timing kernels – Events will be discussed in detail in Lecture 11 We pass an event to the OpenCL enqueue kernel function to capture timestamps Code snippet provided can be used to time a kernel – Add profiling enable flag to create command queue – By taking differences of the start and end timestamps we discount overheads like time spent in the command queue 22 clGetEventProfilingInfo( event_time, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &starttime, NULL); clGetEventProfilingInfo(event_time, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &endtime, NULL); unsigned long elapsed = (unsigned long)(endtime - starttime); cl_event event_timer; clEnqueueNDRangeKernel( myqueue, myKernel, 2, 0, globalws, localws, 0, NULL, &event_timer); cl_event event_timer; clEnqueueNDRangeKernel( myqueue, myKernel, 2, 0, globalws, localws, 0, NULL, &event_timer); unsigned long starttime, endtime;