© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.

Slides:



Advertisements
Similar presentations
ECE 598HK Computational Thinking for Many-core Computing Lecture 2: Many-core GPU Performance Considerations © Wen-mei W. Hwu and David Kirk/NVIDIA,
Advertisements

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Structuring Parallel Algorithms.
L8: Memory Hierarchy Optimization, Bandwidth CS6963.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Illinois UPCRC Summer School 2010 The OpenCL Programming Model Part 1: Basic Concepts Wen-mei Hwu and John Stone with special contributions from Deepthi.
Instructor Notes This is a brief lecture which goes into some more details on OpenCL memory objects Describes various flags that can be used to change.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
Open CL Hucai Huang. Introduction Today's computing environments are becoming more multifaceted, exploiting the capabilities of a range of multi-core.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA Threads.
NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC OpenCL: Molecular Modeling on Heterogeneous.
Instructor Notes This is a brief lecture which goes into some more details on OpenCL memory objects Describes various flags that can be used to change.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.
ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson,
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395: CUDA Lecture 5 Memory coalescing (from.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow.
1 ECE 8823A GPU Architectures Module 3: CUDA Execution Model © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois,
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.
FIGURE 11.1 Mapping between OpenCL and CUDA data parallelism model concepts. KIRK CH:11 “Programming Massively Parallel Processors: A Hands-on Approach.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.
OpenCL Programming James Perry EPCC The University of Edinburgh.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.
Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Introduction to CUDA C (Part 2)
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.
An Introduction to OpenCL
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana Champaign 1 Programming Massively Parallel Processors CUDA Memories.
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 CUDA Threads.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 / CS483 Applied Parallel Programming Lecture 27:
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 19: Atomic.
Heterogeneous Computing with OpenCL Dr. Sergey Axyonov.
OpenCL The Open Standard for Heterogenous Parallel Programming.
Introduction to CUDA Programming CUDA Programming Introduction Andreas Moshovos Winter 2009 Some slides/material from: UIUC course by Wen-Mei Hwu and David.
Introduction to CUDA Programming Introduction to OpenCL Andreas Moshovos Spring 2011 Based on:
© David Kirk/NVIDIA and Wen-mei W
© Wen-mei W. Hwu and John Stone, Urbana July 22, Illinois UPCRC Summer School 2010 The OpenCL Programming Model Part 2: Case Studies Wen-mei Hwu.
OpenCL. Sources Patrick Cozzi Spring 2011 NVIDIA CUDA Programming Guide CUDA by Example Programming Massively Parallel Processors.
Lecture 15 Introduction to OpenCL
An Introduction to OpenCL
Objective To Understand the OpenCL programming model
An Introduction to GPU Computing
Patrick Cozzi University of Pennsylvania CIS Spring 2011
ECE408/CS483 Fall 2015 Applied Parallel Programming Lecture 7: DRAM Bandwidth ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University.
ECE408 Fall 2015 Applied Parallel Programming Lecture 21: Application Case Study – Molecular Dynamics.
CUDA and OpenCL Kernels
Lecture 11 – Related Programming Models: OpenCL
ECE408 / CS483 Applied Parallel Programming Lecture 23: Application Case Study – Electrostatic Potential Calculation.
© David Kirk/NVIDIA and Wen-mei W. Hwu,
Programming Massively Parallel Processors Lecture Slides for Chapter 9: Application Case Study – Electrostatic Potential Calculation © David Kirk/NVIDIA.
ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I
© 2012 Elsevier, Inc. All rights reserved.
OpenCL introduction III.
OpenCL introduction II.
© David Kirk/NVIDIA and Wen-mei W. Hwu,
Chapter 4:Parallel Programming in CUDA C
Quiz Questions CUDA ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson, 2013, QuizCUDA.ppt Nov 12, 2014.
6- General Purpose GPU Programming
Presentation transcript:

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture Slides for Chapter 11: OpenCL for CUDA Programmers

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 2 Programming Massively Parallel Processors Lecture Slides for Chapter 11: OpenCL for CUDA Programmers

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 3 OpenCL Parallelism ConceptCUDA Equivalent kernel host program NDRange (index space)grid work itemthread work groupblock OpenCL to CUDA Data Parallelism Model Mapping

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 4 Overview of OpenCL Execution Model

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 5 OpenCL API CallExplanationCUDA Equivalent get_global_id(0);global index of the work item in the x dimension blockIdx.x×blockDim.x+threadIdx.x get_local_id(0)local index of the work item within the work group in the x dimension blockIdx.x get_global_size(0);size of NDRange in the x dimension gridDim.x ×blockDim.x get_local_size(0);Size of each work group in the x dimension blockDim.x Mapping of OpenCL Dimensions and Indices to CUDA

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 66 Conceptual OpenCL Device Architecture

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 7 OpenCL Memory TypesCUDA Equivalent global memory constant memory local memoryshared memory private memoryLocal memory Mapping OpenCL Memory Types to CUDA

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 8 __kernel void vadd(__global const float *a, __global const float *b, __global float *result) { int id = get_global_id(0); result[id] = a[id] + b[id]; } A Simple OpenCL Kernel Example

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 9 OpenCL Context for Device Management

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 10 … 1. cl_int clerr = CL_SUCCESS; 2. cl_context clctx=clCreateContextFromType(0, CL_DEVICE_TYPE_ALL, NULL, NULL, &clerr); 3. size_t parmsz; 4. clerr= clGetContextInfo(clctx, CL_CONTEXT_DEVICES, 0, NULL, &parmsz); 5. cl_device_id* cldevs= (cl_device_id *) malloc(parmsz); 6. clerr= clGetContextInfo(clctx, CL_CONTEXT_DEVICES, parmsz,cldevs, NULL); 7. cl_command_queue clcmdq=clCreateCommandQueue(clctx,cldevs[0], 0, &clerr); Creating OpenCL Context and Device Queue

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 11 OpenCL Version of DCS Kernel 3

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 12 Figure Mapping DCS NDRange to OpenCL Device

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 13 OpenCL: __kernel voidclenergy( …) { unsigned int xindex= (get_global_id(0) / get_local_id(0))* UNROLLX + get_local_id(0) ; unsigned int yindex= get_global_id(1); unsigned int outaddr= get_global_size(0) * UNROLLX *yindex+xindex; CUDA: __global__ void cuenergy(…) { Unsigned int xindex= blockIdx.x *blockDim.x +threadIdx.x; unsigned int yindex= blockIdx.y *blockDim.y +threadIdx.y; unsigned int outaddr= gridDim.x *blockDim.x * UNROLLX*yindex+xindex Data Access Indexing: OpenCL vs. CUDA

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 14 …for (atomid=0; atomid<numatoms; atomid++) { float dy = coory -atominfo[atomid].y; float dyz2= (dy * dy) + atominfo[atomid].z; float dx1 = coorx –atominfo[atomid].x; float dx2 = dx1 + gridspacing_coalesce; float dx3 = dx2 + gridspacing_coalesce; float dx4 = dx3 + gridspacing_coalesce; float charge = atominfo[atomid].w; energyvalx1 += charge* native_rsqrt(dx1*dx1 + dyz2); energyvalx2 += charge* native_rsqrt(dx2*dx2 + dyz2); energyvalx3 += charge* native_rsqrt(dx3*dx3 + dyz2); energyvalx4 += charge* native_rsqrt(dx4*dx4 + dyz2); } Inner Loop of the OpenCL DCS Kernel

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign Host Code for Building an OpenCL Kernel

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign doutput= clCreateBuffer(clctx, CL_MEM_READ_WRITE,volmemsz, NULL, NULL); 2. datominfo= clCreateBuffer(clctx, CL_MEM_READ_ONLY, MAXATOMS *sizeof(cl_float4), NULL, NULL); … 3. clerr= clSetKernelArg(clkern, 0,sizeof(int), &runatoms); 4. clerr= clSetKernelArg(clkern, 1,sizeof(float), &zplane); 5. clerr= clSetKernelArg(clkern, 2,sizeof(cl_mem), &doutput); 6. clerr= clSetKernelArg(clkern, 3,sizeof(cl_mem), &datominfo); 7. cl_event event; 8. clerr= clEnqueueNDRangeKernel(clcmdq,clkern, 2, NULL, Gsz,Bsz, 0, NULL, &event); 9. clerr= clWaitForEvents(1, &event); 10. clerr= clReleaseEvent(event); … 11. clEnqueueReadBuffer(clcmdq,doutput, CL_TRUE, 0, volmemsz, energy, 0, NULL, NULL); 12. clReleaseMemObject(doutput); 13. clReleaseMemObject(datominfo); Host Code for OpenCL Kernel Launch