© David Kirk/NVIDIA and Wen-mei W. Hwu,

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

CUDA Grids, Blocks, and Threads
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Illinois UPCRC Summer School 2010 The OpenCL Programming Model Part 1: Basic Concepts Wen-mei Hwu and John Stone with special contributions from Deepthi.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC OpenCL: Molecular Modeling on Heterogeneous.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.
ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson,
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395: CUDA Lecture 5 Memory coalescing (from.
FIGURE 11.1 Mapping between OpenCL and CUDA data parallelism model concepts. KIRK CH:11 “Programming Massively Parallel Processors: A Hands-on Approach.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
OpenCL Programming James Perry EPCC The University of Edinburgh.
Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Introduction to CUDA C (Part 2)
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.
An Introduction to OpenCL
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 19: Atomic.
Heterogeneous Computing with OpenCL Dr. Sergey Axyonov.
Introduction to CUDA Programming CUDA Programming Introduction Andreas Moshovos Winter 2009 Some slides/material from: UIUC course by Wen-Mei Hwu and David.
Introduction to CUDA Programming Introduction to OpenCL Andreas Moshovos Spring 2011 Based on:
© David Kirk/NVIDIA and Wen-mei W
© Wen-mei W. Hwu and John Stone, Urbana July 22, Illinois UPCRC Summer School 2010 The OpenCL Programming Model Part 2: Case Studies Wen-mei Hwu.
OpenCL. Sources Patrick Cozzi Spring 2011 NVIDIA CUDA Programming Guide CUDA by Example Programming Massively Parallel Processors.
Lecture 15 Introduction to OpenCL
An Introduction to OpenCL
© David Kirk/NVIDIA and Wen-mei W
Objective To Understand the OpenCL programming model
ECE 498AL Lectures 8: Bank Conflicts and Sample PTX code
An Introduction to GPU Computing
ECE408 Fall 2015 Applied Parallel Programming Lecture 11 Parallel Computation Patterns – Reduction Trees © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al,
Patrick Cozzi University of Pennsylvania CIS Spring 2011
ECE408/CS483 Fall 2015 Applied Parallel Programming Lecture 7: DRAM Bandwidth ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University.
ECE408 Fall 2015 Applied Parallel Programming Lecture 21: Application Case Study – Molecular Dynamics.
CUDA and OpenCL Kernels
Lecture 11 – Related Programming Models: OpenCL
ECE408 / CS483 Applied Parallel Programming Lecture 23: Application Case Study – Electrostatic Potential Calculation.
CUDA Parallelism Model
© David Kirk/NVIDIA and Wen-mei W. Hwu,
© David Kirk/NVIDIA and Wen-mei W. Hwu,
L4: Memory Hierarchy Optimization II, Locality and Data Placement
Programming Massively Parallel Processors Lecture Slides for Chapter 9: Application Case Study – Electrostatic Potential Calculation © David Kirk/NVIDIA.
Memory and Data Locality
ECE408 Applied Parallel Programming Lecture 14 Parallel Computation Patterns – Parallel Prefix Sum (Scan) Part-2 © David Kirk/NVIDIA and Wen-mei W.
L6: Memory Hierarchy Optimization IV, Bandwidth Optimization
ECE 498AL Lecture 15: Reductions and Their Implementation
ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2
ECE 8823A GPU Architectures Module 4: Memory Model and Locality
ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I
© 2012 Elsevier, Inc. All rights reserved.
ECE 8823A GPU Architectures Module 2: Introduction to CUDA C
OpenCL introduction II.
ECE 498AL Lecture 10: Control Flow
GPU Programming Lecture 5: Convolution
© David Kirk/NVIDIA and Wen-mei W. Hwu,
© David Kirk/NVIDIA and Wen-mei W. Hwu,
Mattan Erez The University of Texas at Austin
ECE 498AL Spring 2010 Lecture 10: Control Flow
Chapter 4:Parallel Programming in CUDA C
Quiz Questions CUDA ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson, 2013, QuizCUDA.ppt Nov 12, 2014.
6- General Purpose GPU Programming
© David Kirk/NVIDIA and Wen-mei W
Presentation transcript:

Programming Massively Parallel Processors Lecture Slides for Chapter 11: OpenCL for CUDA Programmers © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign

Programming Massively Parallel Processors Lecture Slides for Chapter 11: OpenCL for CUDA Programmers © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign

OpenCL to CUDA Data Parallelism Model Mapping OpenCL Parallelism Concept CUDA Equivalent kernel host program NDRange (index space) grid work item thread work group block © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign

Overview of OpenCL Execution Model © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign

Mapping of OpenCL Dimensions and Indices to CUDA OpenCL API Call Explanation CUDA Equivalent get_global_id(0); global index of the work item in the x dimension blockIdx.x×blockDim.x+threadIdx.x get_local_id(0) local index of the work item within the work group in the x dimension blockIdx.x get_global_size(0); size of NDRange in the x dimension gridDim.x ×blockDim.x get_local_size(0); Size of each work group in the x dimension blockDim.x © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign

Conceptual OpenCL Device Architecture © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign 6

Mapping OpenCL Memory Types to CUDA CUDA Equivalent global memory constant memory local memory shared memory private memory Local memory © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign

A Simple OpenCL Kernel Example __kernel void vadd(__global const float *a, __global const float *b, __global float *result) { int id = get_global_id(0); result[id] = a[id] + b[id]; } © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign

OpenCL Context for Device Management © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign

Creating OpenCL Context and Device Queue … 1. cl_int clerr = CL_SUCCESS; 2. cl_context clctx=clCreateContextFromType(0, CL_DEVICE_TYPE_ALL, NULL, NULL, &clerr); 3. size_t parmsz; 4. clerr= clGetContextInfo(clctx, CL_CONTEXT_DEVICES, 0, NULL, &parmsz); 5. cl_device_id* cldevs= (cl_device_id *) malloc(parmsz); 6. clerr= clGetContextInfo(clctx, CL_CONTEXT_DEVICES, parmsz,cldevs, NULL); 7. cl_command_queue clcmdq=clCreateCommandQueue(clctx,cldevs[0], 0, &clerr); © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign

OpenCL Version of DCS Kernel 3 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign

Figure 11.10 Mapping DCS NDRange to OpenCL Device © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign

Data Access Indexing: OpenCL vs. CUDA __kernel voidclenergy( …) { unsigned int xindex= (get_global_id(0) / get_local_id(0))* UNROLLX + get_local_id(0) ; unsigned int yindex= get_global_id(1); unsigned int outaddr= get_global_size(0) * UNROLLX *yindex+xindex; CUDA: __global__ void cuenergy(…) { Unsigned int xindex= blockIdx.x *blockDim.x +threadIdx.x; unsigned int yindex= blockIdx.y *blockDim.y +threadIdx.y; unsigned int outaddr= gridDim.x *blockDim.x * UNROLLX*yindex+xindex © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign

Inner Loop of the OpenCL DCS Kernel …for (atomid=0; atomid<numatoms; atomid++) { float dy = coory -atominfo[atomid].y; float dyz2= (dy * dy) + atominfo[atomid].z; float dx1 = coorx –atominfo[atomid].x; float dx2 = dx1 + gridspacing_coalesce; float dx3 = dx2 + gridspacing_coalesce; float dx4 = dx3 + gridspacing_coalesce; float charge = atominfo[atomid].w; energyvalx1 += charge* native_rsqrt(dx1*dx1 + dyz2); energyvalx2 += charge* native_rsqrt(dx2*dx2 + dyz2); energyvalx3 += charge* native_rsqrt(dx3*dx3 + dyz2); energyvalx4 += charge* native_rsqrt(dx4*dx4 + dyz2); } © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign

Host Code for Building an OpenCL Kernel 1 2 3 4 5 6 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign

Host Code for OpenCL Kernel Launch 1. doutput= clCreateBuffer(clctx, CL_MEM_READ_WRITE,volmemsz, NULL, NULL); 2. datominfo= clCreateBuffer(clctx, CL_MEM_READ_ONLY, MAXATOMS *sizeof(cl_float4), NULL, NULL); … 3. clerr= clSetKernelArg(clkern, 0,sizeof(int), &runatoms); 4. clerr= clSetKernelArg(clkern, 1,sizeof(float), &zplane); 5. clerr= clSetKernelArg(clkern, 2,sizeof(cl_mem), &doutput); 6. clerr= clSetKernelArg(clkern, 3,sizeof(cl_mem), &datominfo); 7. cl_event event; 8. clerr= clEnqueueNDRangeKernel(clcmdq,clkern, 2, NULL, Gsz,Bsz, 0, NULL, &event); 9. clerr= clWaitForEvents(1, &event); 10. clerr= clReleaseEvent(event); 11. clEnqueueReadBuffer(clcmdq,doutput, CL_TRUE, 0, volmemsz, energy, 0, NULL, NULL); 12. clReleaseMemObject(doutput); 13. clReleaseMemObject(datominfo); © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign