Lecture 15 Introduction to OpenCL

Slides:

Advertisements

Similar presentations

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Advertisements

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

 Open standard for parallel programming across heterogenous devices  Devices can consist of CPUs, GPUs, embedded processors etc – uses all the processing.

OPENCL OVERVIEW. Copyright Khronos 2009 OpenCL Commercial Objectives Grow the market for parallel computing For vendors of systems, silicon, middleware,

GPU Processing for Distributed Live Video Database Jun Ye Data Systems Group.

OpenCL Peter Holvenstot. OpenCL Designed as an API and language specification Standards maintained by the Khronos group  Currently 1.0, 1.1, and 1.2.

CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.

National Tsing Hua University ® copyright OIA National Tsing Hua University OpenCL Tutorial.

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

An Introduction to Programming with CUDA Paul Richmond

OpenCL Introduction A TECHNICAL REVIEW LU OCT

Instructor Notes This is a brief lecture which goes into some more details on OpenCL memory objects Describes various flags that can be used to change.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

Open CL Hucai Huang. Introduction Today's computing environments are becoming more multifaceted, exploiting the capabilities of a range of multi-core.

Computer Graphics Ken-Yi Lee National Taiwan University.

Instructor Notes GPU debugging is still immature, but being improved daily. You should definitely check to see the latest options available before giving.

OpenCL Introduction AN EXAMPLE FOR OPENCL LU OCT

Instructor Notes This is a brief lecture which goes into some more details on OpenCL memory objects Describes various flags that can be used to change.

CIS 565 Fall 2011 Qing Sun

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson,

GPU Architecture and Programming

FIGURE 11.1 Mapping between OpenCL and CUDA data parallelism model concepts. KIRK CH:11 “Programming Massively Parallel Processors: A Hands-on Approach.

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 28, 2013, OpenCL.ppt OpenCL These notes will introduce OpenCL.

OpenCL Sathish Vadhiyar Sources: OpenCL quick overview from AMD OpenCL learning kit from AMD.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

OpenCL Programming James Perry EPCC The University of Edinburgh.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.

Implementation and Optimization of SIFT on a OpenCL GPU Final Project 5/5/2010 Guy-Richard Kayombya.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Heterogeneous Computing with OpenCL Dr. Sergey Axyonov.

My Coordinates Office EM G.27 contact time:

OpenCL The Open Standard for Heterogenous Parallel Programming.

Instructor Notes This is a straight-forward lecture. It introduces the OpenCL specification while building a simple vector addition program The Mona Lisa.

Lecture 14 Introduction to OpenACC Kyu Ho Park May 12, 2016 Ref: 1.David Kirk and Wen-mei Hwu, Programming Massively Parallel Processors, MK and NVIDIA.

OpenCL. #include int main() { cl_platform_id platform[10]; cl_uint num_platforms; clGetPlatformIDs(10, platform, &num_platforms); cl_int clGetPlatformIDs.

Introduction to CUDA Programming Introduction to OpenCL Andreas Moshovos Spring 2011 Based on:

Heterogeneous Computing using openCL lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

Heterogeneous Computing using openCL lecture 3 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.

OpenCL. Sources Patrick Cozzi Spring 2011 NVIDIA CUDA Programming Guide CUDA by Example Programming Massively Parallel Processors.

Computer Engg, IIT(BHU)

CUDA C/C++ Basics Part 2 - Blocks and Threads

CUDA Introduction Martin Kruliš by Martin Kruliš (v1.1)

Lecture 13 Sparse Matrix-Vector Multiplication and CUDA Libraries

Objective To Understand the OpenCL programming model

An Introduction to GPU Computing

Patrick Cozzi University of Pennsylvania CIS Spring 2011

Basic CUDA Programming

ECE408 Fall 2015 Applied Parallel Programming Lecture 21: Application Case Study – Molecular Dynamics.

CUDA and OpenCL Kernels

Lecture 11 – Related Programming Models: OpenCL

GPU Programming using OpenCL

ECE408 / CS483 Applied Parallel Programming Lecture 23: Application Case Study – Electrostatic Potential Calculation.

ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I

ECE 8823A GPU Architectures Module 2: Introduction to CUDA C

© David Kirk/NVIDIA and Wen-mei W. Hwu,

6- General Purpose GPU Programming

Presentation transcript:

Lecture 15 Introduction to OpenCL Kyu Ho Park May 17, 2016 Lecture 15 Introduction to OpenCL Ref: 1.David Kirk and Wen-mei Hwu, Programming Massively Parallel Processors, MK and NVIDIA. 2. David Kaeli et al. Heterogeneous Computing with OpenCL2.0, MK. 3.https://www.fixstars.com/en/openCL/Book/OpenCLProgrammingBook 4.OpenCl Introduction, PRACE/LinkSCEEM Winter School, CAPS

OpenCL Motivation of OpenCL: To standardize development platform for the fast growing parallel computing platform. For shared memory type multiprocessor: openMP But for CPU-GPU heterogeneous computing model, openCL is needed. It was initiated by Apple and managed by Khronos Group that manages the OpenGL standard. OpenCL supports AMD/ATI, NVIDIA GPUs, X86 CPUS,DSPs, and FPGAs.

Data Parallelism Model OpenCL Parallelism CUDA Equivalent Kernel Host Program NDRange(index space) Grid Work Item Thread Work Group Block

OpenCL execution The host launch kernel functions as like the CUDA. When a kernel function is launched , its code is run by work items. A work item of the OpenCL corresponds to the thread of CUDA. The work items are identified by global dimension index ranges(NDRanges). Work items form work groups.

OpenCL Execution Model Global Size(0) Local Size(0) (0,0) work group ID (0,1) (0,2) (0,3) (0,4) (1,0) (1,1) (2,0) (4,0) (4,4) Local Size( ) 1 Global Size(1) A work item

Indexing work items In CUDA, the global index of a thread = blockIdx.x*blockDim.x + threadIdx.x In OpenCL, a work item can get its global index value by calling an API function get_global_id(). get_global_id(0) returns the global thread index in the x dimension. get_global_id(1) returns the id in the y dimension.

Indexing in OpenCL To get the work item index in a work group, get_local_id(0) which returns the local index of the work item within the work group in the x dimension. : threadIdx.x in CUDA. To get the size of a NDRange in the y dimension, get_global_size(1). : gridDim.y*blockDim.y To get the size of each work group, get_local_size(0). : blockDim.x

Kernels and OpenCL Programming Model Dimension of NDRange: size_t indexSpace[3]={1024, 1,1}; Dimension of the work-group size: size_t workgroupSize[3]={64, 1,1};

Platform Model Platform model specifies one host and one or more devices. The devices execute OpenCL kernels. Device 0 Compute Unit Host PE PE PE … Device 1,….

API for platform clGetPlatformIDs( ); clGetDeviceIDs( );

From [OpenCl Introduction, PRACE/LinkSCEEM Winter School, CAPS]

From [OpenCl Introduction, PRACE/LinkSCEEM Winter School, CAPS]

OpenCL Memory From [OpenCl Introduction, PRACE/LinkSCEEM Winter School, CAPS]

Device Architecture The Compute Unit can be CPU cores, DSPs, or FPGAs. Each compute unit consists of one or more PEs, which corresponds to SPs in CUDA.

Memory Types OpenCL CUDA Global Memory Global Memory __global Constant Memory Constant Memory __constant Local Memory Shared Memory __local Private Memory Registers and __private Local Memory

OpenCL Memory Model OpenCL defines 3 types of memory objects: Buffers: Buffers, images, and pipes. Buffers: Buffers are equivalent to arrays in C created using malloc(), where data elements are stored contiguously in memory. cl_mem clCreateBuffer() allocates space for the buffer and returns a memory object. It is similar to malloc in C.

images Image: It is an OpenCL memory object that abstract ther storage of physical data to allow device-specific optimizations. Ti is to allow the hardware to take advantage of spatial locatlity. cl_mem clCreateImage();

Pipes Pipe: It is for an ordered sequence of data that are stored on the basis of FIFO. cl_mem clCreatePipe();

Kernel Functions OpenCL kernel: __kernel function( ) {…..} CUDA kernel: __global__ function( ) { …}

Kernel Example void c_mul(int n, float *A, float *B, float *C) { for( int i=0;i<n; i++) C[i]=A[i] *B[i]; } __kernel void cl_mul(__global float *A, __global float *B, __global float *C) int id=get_global_id(0); C[id]=A[id]*B[id];

Creating a kernel 1.The source code of a program is turned into a program objects by the API, clCreateProgramWithSource(); 2.Compile it, clBuildProgram(); 3.Kernel created, cl_kernel clCreateKernel( cl_program program, cpnst char *kernel_name, cl_int *errcode_ret); 4.Set kernel arguments, cl_int clSetKernelArg( cl_kernel kernel,cl_uint arg_index, size_t arg_size, const void *arg_value); 5. Execute a kernel on a device, cl_int clEnqueueNDRangeKernel()

Execution Model In order to execute a kernel on a device, a context must be configured. That enables the host to pass commands and data to the device. Context: A context is an abstract environment within which coordination and memory management for kernel execution is well defined.

OpenCL Context To execute an openCL prgram we need, A context A program A kernel A command Queus A buffer Write to Buffer Execute Kernel Read from Buffer Release resources

Context Command Queue Program Kernel Kernel Device Device Device Buffer Device Buffer Device Buffer

API to create a context cl_context clCreateContext( const cl_context_properties *properties, cl_uint num_devices, const cl_device_id *devices, void (CL_CALLBACK *pfn_notify)( const char *errinfo, const void *private_info,size_t cb, void *user_data), void *user_data, cl_int *errcode_ret); or clCreateContextFromType();//to create a context that automatically includes all devices of the specified type(CPUs,GPUs and all devices). To query information of a system clGetContextInfo( );

Command Queues The execution model specifies that devices perform tasks based on commands which are sent from the host to the device. Commands: Executing kernels, data transfers and synchronization. Command queue: The host uses to request action by a device. When the host decided to work with a device, one command queue should be created per device. Command-queue creation: cl_command_queue clCreateCommandQueueWithProperties(cl_context context, cl_device_id device,cl_command_queue_properties properties,cl_int * errcode_ret)

Request the device to send data to the host: clEnqueueReadBuffer(); Request that a kernel is executed on the device: clEnqueueNDRangeKernel(); Barrier Operations: cl_int clFlush(cl_command_queue command_queue); cl_int clFinish(cl_command_queue command_queue);

Kernels and OpenCL Programming Model void c_mul(int n, float *A, float *B, float *C) { for( int i=0;i<n; i++) C[i]=A[i] *B[i]; } __kernel void cl_mul(__global float *A, __global float *B, __global float *C) int id=get_global_id(0); C[id]=A[id]*B[id];

Procedure to execute an OpenCL program 1.Discovering the platform and devices 2.Creating a context 3.Creating a command-queue per device 4.Creating memory objects(buffers) to hold data 5.Copying the input data onto device 6.Creating and compiling a program from the OpenCL source code 7.Extracting the kernel from the program 8.Executing the kernel 9.Copying output data back to the host 10.Releasing the OpenCl resources.

Initialization [3.https://www.fixstars.com/en/openCL/Book/OpenCLProgrammingBook]

Discovering the platform and devices [3.https://www.fixstars.com/en/openCL/Book/OpenCLProgrammingBook]

Creating a context and a command-queue [3.https://www.fixstars.com/en/openCL/Book/OpenCLProgrammingBook]

Build a kernel [3.https://www.fixstars.com/en/openCL/Book/OpenCLProgrammingBook]

Execute the kernel and Releasing

Compiling an OpenCL Program

Second Term Reading List 10. Parallel Genetic Algorithm on CUDA Architecture, Petr Pospichal ,Jiri Jaros, and Josef Schwarz, 2010.: 양은주, May 19,2016 11.Texture Memory, Chap 7 of CUDA by Example.:전민수, May 19,2016 12.Atomics, Chap 9 of CUDA by Example.:이상록, May 26,2016 13.Sparse Matrix-Vector Product.:장형욱,May 26, 2016 14.Solving Ordinary Differential Equations on GPUs.:윤종민, June 2,2016 15.Fast Fourier Transform on GPUs.:이한섭,June 2, 2016 16. Building an Efficient Hash Table on GPU.:김태우,June 9,2016 17.Efficient Batch LU and QR Decomposition on GPU, Brouwer and Taunay :채종욱,June 9,2016 18.CUDA vs OpenACC, Hoshino et al. : 홍찬솔, June 9,2016

Description of Term Project(Homework#6) 1. Evaluation Guideline: Homework( 5 Homeworks) : 30% Term Project : 20% Presentations : 10% Mid-term Exam : 15% Final Exam : 25% 2.Schedule: (1) May 26: Proposal Submission (2)June 7: Progress Report Submission (3)June 24: Final Report Submission 3.Project Guidelines: Team base( 2 students/team) Subject: Free to choose Show your implementation a. in C b. in CUDA C c. in openCL( optional, Bonus Points) (4) You have to analyze and explain the performance of each implementation with the detailed description of your design.