Patrick Cozzi University of Pennsylvania CIS Spring 2011

Slides:

Advertisements

Similar presentations

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Advertisements

Intermediate GPGPU Programming in CUDA

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

OpenCL Peter Holvenstot. OpenCL Designed as an API and language specification Standards maintained by the Khronos group  Currently 1.0, 1.1, and 1.2.

CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

Basic CUDA Programming Shin-Kai Chen VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao.

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

GPU Programming EPCC The University of Edinburgh.

An Introduction to Programming with CUDA Paul Richmond

Illinois UPCRC Summer School 2010 The OpenCL Programming Model Part 1: Basic Concepts Wen-mei Hwu and John Stone with special contributions from Deepthi.

The Open Standard for Parallel Programming of Heterogeneous systems James Xu.

Instructor Notes This is a brief lecture which goes into some more details on OpenCL memory objects Describes various flags that can be used to change.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

Open CL Hucai Huang. Introduction Today's computing environments are becoming more multifaceted, exploiting the capabilities of a range of multi-core.

Computer Graphics Ken-Yi Lee National Taiwan University.

OpenCL Introduction AN EXAMPLE FOR OPENCL LU OCT

Basic CUDA Programming Computer Architecture 2015 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

Instructor Notes This is a brief lecture which goes into some more details on OpenCL memory objects Describes various flags that can be used to change.

CIS 565 Fall 2011 Qing Sun

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson,

Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015.

GPU Architecture and Programming

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 28, 2013, OpenCL.ppt OpenCL These notes will introduce OpenCL.

Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

OpenCL Programming James Perry EPCC The University of Edinburgh.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be.

1 GPU programming Dr. Bernhard Kainz. 2 Dr Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.

Martin Kruliš by Martin Kruliš (v1.0)1.

Implementation and Optimization of SIFT on a OpenCL GPU Final Project 5/5/2010 Guy-Richard Kayombya.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Heterogeneous Computing with OpenCL Dr. Sergey Axyonov.

My Coordinates Office EM G.27 contact time:

OpenCL The Open Standard for Heterogenous Parallel Programming.

Instructor Notes This is a straight-forward lecture. It introduces the OpenCL specification while building a simple vector addition program The Mona Lisa.

Introduction to CUDA Programming Introduction to OpenCL Andreas Moshovos Spring 2011 Based on:

Heterogeneous Computing using openCL lecture 3 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.

OpenCL. Sources Patrick Cozzi Spring 2011 NVIDIA CUDA Programming Guide CUDA by Example Programming Massively Parallel Processors.

CUDA C/C++ Basics Part 3 – Shared memory and synchronization

Lecture 15 Introduction to OpenCL

Computer Engg, IIT(BHU)

CUDA C/C++ Basics Part 2 - Blocks and Threads

CUDA Introduction Martin Kruliš by Martin Kruliš (v1.1)

Objective To Understand the OpenCL programming model

An Introduction to GPU Computing

Basic CUDA Programming

CUDA and OpenCL Kernels

Lecture 11 – Related Programming Models: OpenCL

GPU Programming using OpenCL

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

ECE 8823A GPU Architectures Module 2: Introduction to CUDA C

© David Kirk/NVIDIA and Wen-mei W. Hwu,

Chapter 4:Parallel Programming in CUDA C

GPU Scheduling on the NVIDIA TX2:

6- General Purpose GPU Programming

Presentation transcript:

Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011 OpenCL Patrick Cozzi University of Pennsylvania CIS 565 - Spring 2011

Administrivia Assignment 5 Posted Project Due Friday, 03/25, at 11:59pm Project One page pitch due Sunday, 03/20, at 11:59pm 10 minute pitch in class on Monday, 03/21 Email me your time slot preference

From Monday Page-Locked Host Memory cudaMemcpy() vs cudaMemcpyAsync() Can be mapped into the address space of the device on some systems cudaMemcpy() vs cudaMemcpyAsync()

Image from: http://www. khronos

OpenCL Open Compute Language For heterogeneous parallel-computing systems Cross-platform Implementations for ATI GPUs NVIDIA GPUs x86 CPUs Is cross-platform really one size fits all? A high performance OpenCL application that is cross-platform will need runtime checks to fully take advantage of the system. Likewise, the workloads for GPU and CPU parallel algorithms are quite different. Image from: http://developer.apple.com/softwarelicensing/agreements/opencl.html

OpenCL Standardized Initiated by Apple Developed by the Khronos Group

Image from: http://www. khronos

Image from: http://www. khronos

Image from: http://www. khronos

OpenCL API similar to OpenGL Based on the C language Easy transition form CUDA to OpenCL

OpenCL and CUDA Many OpenCL features have a one to one mapping to CUDA features OpenCL More complex platform and device management More complex kernel launch OpenCL has more complexity due to its support for multiplatform and multivendor portability

OpenCL and CUDA Compute Unit (CU) correspond to CUDA streaming multiprocessors (SMs) CPU core etc. Processing Element correspond to CUDA streaming processor (SP) CPU ALU

OpenCL and CUDA Image from: http://developer.amd.com/zones/OpenCLZone/courses/pages/Introductory-OpenCL-SAAHPC10.aspx

OpenCL and CUDA CUDA OpenCL Kernel Host program Thread Work item Block Work group Grid NDRange (index space)

OpenCL and CUDA Work Item (CUDA thread) – executes kernel code Index Space (CUDA grid) – defines work items and how data is mapped to them Work Group (CUDA block) – work items in a work group can synchronize

OpenCL and CUDA CUDA: threadIdx and blockIdx Combine to create a global thread ID Example blockIdx.x * blockDim.x + threadIdx.x

OpenCL and CUDA OpenCL: each thread has a unique global index Retrieve with get_global_id() CUDA OpenCL threadIdx.x get_local_id(0) blockIdx.x * blockDim.x + threadIdx.x get_global_id(0)

OpenCL and CUDA CUDA OpenCL gridDim.x get_num_groups(0) blockIdx.x get_group_id(0) blockDim.x get_local_size(0) gridDim.x * blockDim.x get_global_size(0)

OpenCL and CUDA Recall CUDA: Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

OpenCL and CUDA In OpenCL: Index Space Work Group (0, 0) Work Group get_global_size(0) Index Space Work Group (0, 0) Work Group (1, 0) Work Group (2, 0) get_local_size(0) Work Group (0,0) get_ global_ size(1) Work Item (0, 0) Work Item (1, 0) Work Item (2, 0) Work Item (3, 0) Work Item (4, 0) Work Group (0, 1) Work Group (1, 1) Work Group (2, 1) get_ local_ size(1) Work Item (0, 1) Work Item (1, 1) Work Item (2, 1) Work Item (3, 1) Work Item (4, 1) Work Item (0, 2) Work Item (1, 2) Work Item (2, 2) Work Item (3, 2) Work Item (4, 2)

Image from http://developer. amd

OpenCL and CUDA Mapping to NVIDIA hardware: Image from http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf

OpenCL and CUDA Recall the CUDA memory model: Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

OpenCL and CUDA In OpenCL: Image from http://developer.amd.com/zones/OpenCLZone/courses/pages/Introductory-OpenCL-SAAHPC10.aspx

OpenCL and CUDA CUDA OpenCL Global memory Constant memory Shared memory Local memory Private memory

OpenCL and CUDA Both also have Fences In CL mem_fence() __syncthreads() __barrier() Both also have Fences In CL mem_fence() read_mem_fence() write_mem_fence() Compiler may reorder memory reads/writes to better exploit local (shared) and global memory

Image from: http://www. khronos

OpenCL and CUDA Kernel functions. Recall CUDA: __global__ void vecAdd(float *a, float *b, float *c) { int i = threadIdx.x; c[i] = a[i] + b[i]; }

OpenCL and CUDA In OpenCL: __kernel void vecAdd(__global const float *a, __global const float *b, __global float *c) { int i = get_global_id(0); c[i] = a[i] + b[i]; }

OpenCL and CUDA In OpenCL: __kernel void vecAdd(__global const float *a, __global const float *b, __global float *c) { int i = get_global_id(0); c[i] = a[i] + b[i]; }

Slide from: http://developer. amd

Slide from: http://developer. amd

Slide from: http://developer. amd

Slide from: http://developer. amd

Slide from: http://developer. amd

Slide from: http://developer. amd

OpenGL Shader Programs OpenGL Buffers CUDA Streams Slide from: http://developer.amd.com/zones/OpenCLZone/courses/pages/Introductory-OpenCL-SAAHPC10.aspx

OpenCL API Walkthrough OpenCL host code for running our vecAdd kernel: __kernel void vecAdd(__global const float *a, __global const float *b, __global float *c) { int i = get_global_id(0); c[i] = a[i] + b[i]; } See NVIDIA OpenCL JumpStart Guide for full code example: http://developer.download.nvidia.com/OpenCL/NVIDIA_OpenCL_JumpStart_Guide.pdf

OpenCL API // create OpenCL device & context cl_context hContext; hContext = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, 0, 0, 0);

Create a context for a GPU OpenCL API // create OpenCL device & context cl_context hContext; hContext = clCreateContextFromType(0, CL_DEVICE_TYPE_GPU, 0, 0, 0); Create a context for a GPU

OpenCL API // query all devices available to the context size_t nContextDescriptorSize; clGetContextInfo(hContext, CL_CONTEXT_DEVICES, 0, 0, &nContextDescriptorSize); cl_device_id aDevices = malloc(nContextDescriptorSize); clGetContextInfo(hContext, CL_CONTEXT_DEVICES, nContextDescriptorSize, aDevices, 0);

Retrieve an array of each GPU OpenCL API // query all devices available to the context size_t nContextDescriptorSize; clGetContextInfo(hContext, CL_CONTEXT_DEVICES, 0, 0, &nContextDescriptorSize); cl_device_id aDevices = malloc(nContextDescriptorSize); clGetContextInfo(hContext, CL_CONTEXT_DEVICES, nContextDescriptorSize, aDevices, 0); Retrieve an array of each GPU

OpenCL API // create a command queue for first // device the context reported cl_command_queue hCmdQueue; hCmdQueue = clCreateCommandQueue(hContext, aDevices[0], 0, 0);

Create a command queue (CUDA stream) for the first GPU OpenCL API // create a command queue for first // device the context reported cl_command_queue hCmdQueue; hCmdQueue = clCreateCommandQueue(hContext, aDevices[0], 0, 0); Create a command queue (CUDA stream) for the first GPU

OpenCL API // create & compile program cl_program hProgram; hProgram = clCreateProgramWithSource(hContext, 1, source, 0, 0); clBuildProgram(hProgram, 0, 0, 0, 0, 0);

OpenCL API // create & compile program cl_program hProgram; hProgram = clCreateProgramWithSource(hContext, 1, source, 0, 0); clBuildProgram(hProgram, 0, 0, 0, 0, 0); A program contains one or more kernels. Think dll. Provide kernel source as a string Can also compile offline

OpenCL API // create kernel cl_kernel hKernel; hKernel = clCreateKernel(hProgram, “vecAdd”, 0);

OpenCL API // create kernel cl_kernel hKernel; hKernel = clCreateKernel(hProgram, “vecAdd”, 0); Create kernel from program

OpenCL API // allocate host vectors float* pA = new float[cnDimension]; float* pB = new float[cnDimension]; float* pC = new float[cnDimension]; // initialize host memory randomInit(pA, cnDimension); randomInit(pB, cnDimension);

OpenCL API cl_mem hDeviceMemA = clCreateBuffer( hContext, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, cnDimension * sizeof(cl_float), pA, 0); cl_mem hDeviceMemB = /* ... */

OpenCL API cl_mem hDeviceMemA = clCreateBuffer( hContext, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, cnDimension * sizeof(cl_float), pA, 0); cl_mem hDeviceMemB = /* ... */ Create buffers for kernel input. Read only in the kernel. Written by the host.

OpenCL API hDeviceMemC = clCreateBuffer(hContext, CL_MEM_WRITE_ONLY, cnDimension * sizeof(cl_float), 0, 0);

OpenCL API hDeviceMemC = clCreateBuffer(hContext, CL_MEM_WRITE_ONLY, cnDimension * sizeof(cl_float), 0, 0); Create buffer for kernel output.

OpenCL API // setup parameter values clSetKernelArg(hKernel, 0, sizeof(cl_mem), (void *)&hDeviceMemA); clSetKernelArg(hKernel, 1, sizeof(cl_mem), (void *)&hDeviceMemB); clSetKernelArg(hKernel, 2, sizeof(cl_mem), (void *)&hDeviceMemC);

OpenCL API // setup parameter values clSetKernelArg(hKernel, 0, sizeof(cl_mem), (void *)&hDeviceMemA); clSetKernelArg(hKernel, 1, sizeof(cl_mem), (void *)&hDeviceMemB); clSetKernelArg(hKernel, 2, sizeof(cl_mem), (void *)&hDeviceMemC); Kernel arguments set by index

OpenCL API // execute kernel clEnqueueNDRangeKernel(hCmdQueue, hKernel, 1, 0, &cnDimension, 0, 0, 0, 0); // copy results from device back to host clEnqueueReadBuffer(hContext, hDeviceMemC, CL_TRUE, 0, cnDimension * sizeof(cl_float), pC, 0, 0, 0);

OpenCL API // execute kernel Let OpenCL pick work group size // execute kernel clEnqueueNDRangeKernel(hCmdQueue, hKernel, 1, 0, &cnDimension, 0, 0, 0, 0); // copy results from device back to host clEnqueueReadBuffer(hContext, hDeviceMemC, CL_TRUE, 0, cnDimension * sizeof(cl_float), pC, 0, 0, 0); Blocking read

OpenCL API delete [] pA; delete [] pB; delete [] pC; clReleaseMemObj(hDeviceMemA); clReleaseMemObj(hDeviceMemB); clReleaseMemObj(hDeviceMemC);