Lecture 14 Introduction to OpenACC Kyu Ho Park May 12, 2016 Ref: 1.David Kirk and Wen-mei Hwu, Programming Massively Parallel Processors, MK and NVIDIA.

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.
GPU Computing with OpenACC Directives. subroutine saxpy(n, a, x, y) real :: x(:), y(:), a integer :: n, i $!acc kernels do i=1,n y(i) = a*x(i)+y(i) enddo.
Introduction to the CUDA Platform
Introduction to Openmp & openACC
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.
Presented by Rengan Xu LCPC /16/2014
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Structuring Parallel Algorithms.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.
Parallelization and CUDA libraries Lei Zhou, Yafeng Yin, Hong Man.
A Source-to-Source OpenACC compiler for CUDA Akihiro Tabuchi †1 Masahiro Nakao †2 Mitsuhisa Sato †1 †1. Graduate School of Systems and Information Engineering,
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
An Introduction to Programming with CUDA Paul Richmond
Programming GPUs using Directives Alan Gray EPCC The University of Edinburgh.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 1 Concurrency in Programming Languages Matthew J. Sottile Timothy G. Mattson Craig.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 1 Programming Massively Parallel Processors Lecture Slides for Chapter 1: Introduction.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
GPU Architecture and Programming
FIGURE 11.1 Mapping between OpenCL and CUDA data parallelism model concepts. KIRK CH:11 “Programming Massively Parallel Processors: A Hands-on Approach.
Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.
Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
OpenCL Programming James Perry EPCC The University of Edinburgh.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Slide 1 Using OpenACC in IFS Physics’ Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 15: Basic Parallel Programming Concepts.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Synchronization These notes introduce:
CS/EE 217 GPU Architecture and Parallel Programming Lecture 23: Introduction to OpenACC.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Programming Massively Parallel.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Graphic Processing Processors (GPUs) Parallel.
My Coordinates Office EM G.27 contact time:
GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.
Lecture 9 Streams and Events Kyu Ho Park April 12, 2016 Ref:[PCCP]Professional CUDA C Programming.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
Lecture 15 Introduction to OpenCL
Prof. Zhang Gang School of Computer Sci. & Tech.
Employing compression solutions under openacc
Lecture 13 Sparse Matrix-Vector Multiplication and CUDA Libraries
EECE571R -- Harnessing Massively Parallel Processors ece
Lecture 13 Sparse Matrix-Vector Multiplication and CUDA Libraries
CS427 Multicore Architecture and Parallel Computing
Department of Computer Science and Engineering
Lecture 5: GPU Compute Architecture
Lecture 5: GPU Compute Architecture for the last time
Introduction to CUDA C Slide credit: Slides adapted from
Threads Chapter 4.
6- General Purpose GPU Programming
CS Introduction to Operating Systems
Presentation transcript:

Lecture 14 Introduction to OpenACC Kyu Ho Park May 12, 2016 Ref: 1.David Kirk and Wen-mei Hwu, Programming Massively Parallel Processors, MK and NVIDIA. 2. Jeff Larkin, Introduction to OpenACC, NVIDIA.

Acceleration Methods[2] 1. Using Libraries: “Drop-in” acceleration 2. OpenACC Directives: Easy way to accelerate 3. Programming Languages: Very flexible method

OpenACC : Open Programming standard for Parallel Computing using GPU Directives.  High level programming model for accelerator based architectures.  Developed by a consortium consisting of CAPS, CRAY, NVIDIA and the Portland Group. The Reason to use OpenACC:  Productivity  Portability  Performance feedback

Ex: Program in C #pragma acc parallel num_gangs(1024) { for( int i=0; i<2048;i++){ } } Ex: Program in Fortran !$acc parallel [clause …] …. !$acc end parallel

[Mathew Colgrove, Directive-based Accelerator Programming With OpenACC]

OpenACC The standard for GPU Directives  Simple:The easy way to accelerate compute intensive applications.  Open:Making GPU programming straightforward and portable across parallel and multi-core processors.  Powerful:It allows complete access to the massive parallel power to a GPU.  High-Level:Compiler directives to specify parallel regions in C & Fortran  Portable across Oses, host CPUs, Accelerators and Compilers.

Benefit of OpenACC[1]  OpenACC programmers can start with writing a sequential version and then annotate their sequential program with OpenACC directives.  OpenACC provides an incremental path for moving legacy applications to accelerators. Adding directives disturbs the existing code less than other approaches.  OpenACC allows a programmer to write OpenACC programs in such a way that when the directives are ignored, the program still sequentially and gives the same result as when the program is run in parallel( this property does not hold automatically for all OpenACC programs).

Execution Model  OpenACC: Host + Accelerator device  Host:  Main program on CPU  Code is transferred to the Accelerator  Execution is started  Wait for completion  Accelerator Device  Gangs  Workers  Vector Operations

OpenACC Task Granularity Gang Worker vector The accelerator is a collection of Processing Elements(PEs) where each PE is multithreaded and each thread can execute vector instructions.

Relations between CUDA and OpenACC Tasks OpenACCCUDA GangBlock WorkerWarp VectorThread

Memory Model  The host memory and the device memory are separate spaces in OPenACC.  The host can not access the device memory directly and the device can not access host memory directly.  In CUDA C, programmers should explicitly code data movement through APIs.  In OpenACC, we just annotate which memory objects need to be transferred. Ex.: #pragma acc parallel loop copyin(M[0:Mh*Mw]) copyin(N[0:Nw*Mw]) copyout(P[0:Mh*Mw])

OpenACC programs Accelerator Compute Constructs  parallel  kernel  loop  data

Parallel Construct  To specify which part of the program is to be executed on the accelerator, there are two construct:  parallel construct  kernel construct  When a program encounters a parallel construct, the execution of the code within the structured block of the construct( called a parallel region) is moved to the accelerator. Gangs of workers on the accelerator are created to execute the parallel region.  Initially only one worker(called a gang lead) within each gang will execute the parallel region. The other workers are conceptually idle at this point. They will be deployed when there is more parallel work at an inner level. Ex: #pragma acc parallel copyout(a) num_gangs(1024) num_workers(32) { a=32; }

Kernel Construct  A kernel region may be broken into a sequence of kernels, each of which will be executed on the accelerator, while the whole parallel region will become a kernel. Ex: #pragma acc kernels { #pragma acc loop num_gangs(1024) for(int i=0; i<2048; i++){ a[i]=b[i]; } #pragma acc loop num_gangs(512) for (int j=0; j<2048;j++){ c[j]=a[j]*2;} for(int k=0; k<2048;k++){ d[k]=c[k];} }

Loop Construct-Gang Loop  Gang Loop #pragma acc parallel num_gangs(1024) { for( int i=0; i<2048; i++){ c[i]=a[i] + b[i]; } #pragma acc parallel num_gangs(1024) { #pragma acc loop gang for(int i=0;i<2048; i++){ c[i]=a[i]+b[i]; }

Loop Construct-Worker Loop  Worker loop Ex: #pragma acc parallel num_gangs(1024) num_workers(32) { #pragma acc loop gang for (int i=0; i<2048;i++){ #pragma acc loop worker for (int j=0; j<512; j++){ foo(i,j); }

Loop Construct-Vector Loop #pragma acc parallel num_gangs(1024) num_workers(32) vector_length(32) { #pragma acc loop gang for(int i=0;i<2048; i++0){ #pragma acc loop worker for (int j=0; j<512; j++){ #pragma acc loop vector for( int k=0; k<1024; k++){ foo(i,j,k); }

Data Management  Data Clauses:  copyin(list)  copyout(list)  copy(list)  create(list)  deviceptr(list)  and others…  Data Construct #pragma acc data copy(field) create(tmpfield)

Asynchronous Computation and Data Transfer  ‘async’ clause can be added to parallel, kernel, or upate directive to enable asynchronous execution.  If there is no async clause, the host process will wait until the parallel region, kernel region, or updates are complete before continuing. Ex: #pragma acc parallel async

Reading and Presentation List 1.MRI and CT Processing with MathLab and CUDA: 강은희, 이주영 2.Matrix Multiplication with CUDA, Robert Hochberg, 2012: 박겨레 3.Optimizing Matrix Trqanspose in CUDA, Greg Ruetsch and Paulisu Micikevicius,2010: 박일우 4.NVIDIA Profiler User’s Guide: 노성철 5.Monte Carlo Methods in CUDA: 조정석 6.Optimizing Parallel Reduction in CUDA,Mark Harris,NVIDIA: 박주연 7.Deep Learning and MultiGPU: 박종찬 8.Image Processing with CUDA, Jia Tse, 2006: 최우석, 김치현 9.Image Convolution with CUDA, Victor Podlozhnyuk, 2007: Homework#4

Second Term Reading List 10. Parallel Genetic Algorithm on CUDA Architecture, Petr Pospichal,Jiri Jaros, and Josef Schwarz, 2010.: 양은주, May 19, Texture Memory, Chap 7 of CUDA by Example.: 전민수, May 19, Atomics, Chap 9 of CUDA by Example.: 이상록, May 26, Sparse Matrix-Vector Product.: 장형욱,May 26, Solving Ordinary Differential Equations on GPUs.: 윤종민, June 2, Fast Fourier Transform on GPUs.: 이한섭,June 2, Building an Efficient Hash Table on GPU.: 김태우,June 9, Efficient Batch LU and QR Decomposition on GPU, Brouwer and Taunay : 채종욱,June 9, CUDA vs OpenACC, Hoshino et al. : 홍찬솔, June 9,2016

Description of Term Project(Homework#6) 1. Evaluation Guideline: Homework( 5 Homeworks): 30% Term Project: 20% Presentations: 10% Mid-term Exam : 15% Final Exam: 25% 2.Schedule: (1) May 26: Proposal Submission (2)June 7: Progress Report Submission (3)June 24: Final Report Submission 3.Project Guidelines: (1)Team base( 2 students/team) (2)Subject: Free to choose (3)Show your implementation a. in C b. in CUDA C c. in OpenCL(optional) (4) You have to analyze and explain the performance of each implementation with the detailed description of your design.