© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2012 ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.

Slides:

Advertisements

Similar presentations

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Advertisements

GPU Computing with OpenACC Directives. subroutine saxpy(n, a, x, y) real :: x(:), y(:), a integer :: n, i $!acc kernels do i=1,n y(i) = a*x(i)+y(i) enddo.

NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Structuring Parallel Algorithms.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.

L8: Memory Hierarchy Optimization, Bandwidth CS6963.

1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.

Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.

A Source-to-Source OpenACC compiler for CUDA Akihiro Tabuchi †1 Masahiro Nakao †2 Mitsuhisa Sato †1 †1. Graduate School of Systems and Information Engineering,

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Programming Massively Parallel Processors.

© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

CS470/570 Lecture 5 Introduction to OpenMP Compute Pi example OpenMP directives and options.

Programming GPUs using Directives Alan Gray EPCC The University of Edinburgh.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 11 Parallel.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 1 Programming Massively Parallel Processors Lecture Slides for Chapter 1: Introduction.

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA Threads.

© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core processors for Science and Engineering.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 12: Application Lessons When the tires.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395: CUDA Lecture 5 Memory coalescing (from.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.

GPU Architecture and Programming

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.

FIGURE 11.1 Mapping between OpenCL and CUDA data parallelism model concepts. KIRK CH:11 “Programming Massively Parallel Processors: A Hands-on Approach.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 3: A Simple Example, Tools, and.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 4: CUDA Threads – Part 2.

CUDA Parallel Execution Model with Fermi Updates © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al, University of Illinois, Urbana-Champaign.

ECE408/CS483 Applied Parallel Programming Lecture 4: Kernel-Based Data Parallel Execution Model © David Kirk/NVIDIA and Wen-mei Hwu, ECE408/CS483/ECE498al,

OpenCL Programming James Perry EPCC The University of Edinburgh.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (

OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483/498AL, University of Illinois, Urbana-Champaign 1 ECE 408/CS483 Applied Parallel Programming.

Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Introduction to CUDA C (Part 2)

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Lecture 13: Basic Parallel.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 15: Basic Parallel Programming Concepts.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.

CS/EE 217 GPU Architecture and Parallel Programming Lecture 23: Introduction to OpenACC.

© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Programming Massively Parallel.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Graphic Processing Processors (GPUs) Parallel.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Lecture 14 Introduction to OpenACC Kyu Ho Park May 12, 2016 Ref: 1.David Kirk and Wen-mei Hwu, Programming Massively Parallel Processors, MK and NVIDIA.

Introduction to OpenMP

Shared Memory Parallelism - OpenMP

ECE408/CS483 Fall 2015 Applied Parallel Programming Lecture 7: DRAM Bandwidth ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University.

L18: CUDA, cont. Memory Hierarchy and Examples

Introduction to High Performance Computing Lecture 20

Programming Massively Parallel Processors Lecture Slides for Chapter 9: Application Case Study – Electrostatic Potential Calculation © David Kirk/NVIDIA.

ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2

ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I

© 2012 Elsevier, Inc. All rights reserved.

© David Kirk/NVIDIA and Wen-mei W. Hwu,

Presentation transcript:

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming Lecture 27: Introduction to OpenACC

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign Objective To Understand the OpenACC programming model –basic concepts and pragmatypes –Simple examples to illustrate basic concepts and functionalities © Wen-mei W. Hwu and John Stone, Urbana July 22, 2010

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign OpenACC The OpenACC Application Programming Interface provides a set of –compiler directives (pragmas) –library routines and –environment variables that can be used to write data parallel FORTRAN, C and C++ programs that run on accelerator devices including GPUs and CPUs

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign OpenACC Pragmas In C and C++, the #pragma directive is the method to provide, to the compiler, information that is not specified in the standard language.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign Simple Matrix-Matrix Multiplication in OpenACC 1 void computeAcc(float *P, const float *M, const float *N, int Mh, int Mw, int Nw) 2 { 3 4 #pragma acc parallel loop copyin(M[0:Mh*Mw]) copyin(N[0:Nw*Mw]) copyout(P[0:Mh*Nw]) 5 for (int i=0; i<Mh; i++) { 6 #pragma acc loop 7 for (int j=0; j<Nw; j++) { 8 float sum = 0; 9 for (int k=0; k<Mw; k++) { 10 float a = M[i*Mw+k]; 11 float b = N[k*Nw+j]; 12 sum += a*b; 13 } 14 P[i*Nw+j] = sum; 15 } 16 } 17 }

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign Some Observations The code is almost identical to the sequential version, except for the two lines with #pragma at line 4 and line 6. OpenACC uses the compiler directive mechanism to extend the base language. – #pragma at line 4 tells the compiler to generate code for the ‘i’ loop at line 5 through 16 so that the loop iterations are executed in parallel on the accelerator. –The copyin clause and the copyout clause specify how the matrix data should be transferred between the host and the accelerator. The #pragma at line 6 instructs the compiler to map the inner ‘j’ loop to the second level of parallelism on the accelerator.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign Motivation OpenACC programmers can often start with writing a sequential version and then annotate their sequential program with OpenACC directives. –leave most of the details in generating a kernel and data transfers to the OpenACC compiler. OpenACC code can be compiled by non- OpenACC compilers by ignoring the pragmas.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign Frequently Encountered Issues Some OpenACC pragmas are hints to the OpenACC compiler, which may or may not be able to act accordingly –The performance of an OpenACC depends heavily on the quality of the compiler. –Much less so in CUDA or OpenCL Some OpenACC programs may behave differently or even incorrectly if pragmas are ignored

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign OpenACC Device Model Currently OpenACC does not allow synchronization across threads.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign OpenACC Execution Model

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign Parallel vs. Loop Constructs #pragma acc parallel loop copyin(M[0:Mh*Mw]) copyin(N[0:Nw*Mw]) copyout(P[0:Mh*Nw]) for (int i=0; i<Mh; i++) { … } is equivalent to: #pragma acc parallel copyin(M[0:Mh*Mw]) copyin(N[0:Nw*Mw]) copyout(P[0:Mh*Nw]) { #pragma acc loop for (int i=0; i<Mh; i++) { … } (a parallel region that consists of just a loop)

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign Parallel Construct A parallel construct is executed on an accelerator One can specify the number of gangs and number of works in each gang #pragma acc parallel copyout(a) num_gangs(1024) num_workers(32) { a = 23; } 1024*32 workers will be created. a=23 will be executed redundantly by all 1024 gang leads

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign What does each “Gang Loop” do? #pragma acc parallel num_gangs(1024) { for (int i=0; i<2048; i++) { … } #pragma acc parallel num_gangs(1024) { #pragma acc loop gang for (int i=0; i<2048; i++) { … }

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign Worker Loop #pragma acc parallel num_gangs(1024) num_workers(32) { #pragma acc loop gang for (int i=0; i<2048; i++) { #pragma acc loop worker for (int j=0; j<512; j++) { foo(i,j); } 1024*32=32K workers will be created, each executing 1M/32K = 32 instance of foo()

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign #pragma acc parallel num_gangs(32) { Statement 1; Statement 2; #pragma acc loop gang for (int i=0; i<n; i++) { Statement 3; Statement 4; } Statement 5; Statement 6; #pragma acc loop gang for (int i=0; i<m; i++) { Statement 7; Statement 8; } Statement 9; if (condition) Statement 10; } Statements 1 and 2 are redundantly executed by 32 gangs The n for-loop iterations are distributed to 32 gangs

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign Kernel Regions #pragma acc kernels { #pragma acc loop num_gangs(1024) for (int i=0; i<2048; i++) { a[i] = b[i]; } #pragma acc loop num_gangs(512) for (int j=0; j<2048; j++) { c[j] = a[j]*2; } for (int k=0; k<2048; k++) { d[k] = c[k]; } Kernel constructs are descriptive of programmer intentions

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ANY MORE QUESTIONS? READ CHAPTER 15