OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

Slides:

Advertisements

Similar presentations

GPU Computing with OpenACC Directives. subroutine saxpy(n, a, x, y) real :: x(:), y(:), a integer :: n, i $!acc kernels do i=1,n y(i) = a*x(i)+y(i) enddo.

Advertisements

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.

NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.

Compiler Challenges for High Performance Architectures

Presented by Rengan Xu LCPC /16/2014

1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Fortran 9x HTML version. New F90 features Free source form Modules User-defined data types and operators Generic user-defined procedures Interface blocks.

1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.

INTEL CONFIDENTIAL OpenMP for Domain Decomposition Introduction to Parallel Programming – Part 5.

A Source-to-Source OpenACC compiler for CUDA Akihiro Tabuchi †1 Masahiro Nakao †2 Mitsuhisa Sato †1 †1. Graduate School of Systems and Information Engineering,

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.

Optimizing Compilers for Modern Architectures Compiling High Performance Fortran Allen and Kennedy, Chapter 14.

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

– 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance.

GPU Programming EPCC The University of Edinburgh.

An Introduction to Programming with CUDA Paul Richmond

Budapest, November st ALADIN maintenance and phasing workshop Short introduction to OpenMP Jure Jerman, Environmental Agency of Slovenia.

Programming with Shared Memory Introduction to OpenMP

Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.

CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.

Parallel Programming in Java with Shared Memory Directives.

1 Day 1 Module 2:. 2 Use key compiler optimization switches Upon completion of this module, you will be able to: Optimize software for the architecture.

Programming GPUs using Directives Alan Gray EPCC The University of Edinburgh.

Carnegie Mellon Introduction to Computer Systems /18-243, fall th Lecture, Dec 1 Instructors: Roger B. Dannenberg and Greg Ganger.

5.3 Machine-Independent Compiler Features

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

CUDA 5.0 By Peter Holvenstot CS6260. CUDA 5.0 Latest iteration of CUDA toolkit Requires Compute Capability 3.0 Compatible Kepler cards being installed.

OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.

PMaC Performance Modeling and Characterization Performance Modeling and Analysis with PEBIL Michael Laurenzano, Ananta Tiwari, Laura Carrington Performance.

A Compiler-Based Tool for Array Analysis in HPC Applications Presenter: Ahmad Qawasmeh Advisor: Dr. Barbara Chapman 2013 PhD Showcase Event.

GPU Architecture and Programming

High-Performance Parallel Scientific Computing 2008 Purdue University OpenMP Tutorial Seung-Jai Min School of Electrical and Computer.

1 The Portland Group, Inc. Brent Leback HPC User Forum, Broomfield, CO September 2009.

Performance Optimization Getting your programs to run faster.

Middleware Services. Functions of Middleware Encapsulation Protection Concurrent processing Communication Scheduling.

Introduction to OpenMP

OpenCL Programming James Perry EPCC The University of Edinburgh.

Slide 1 Using OpenACC in IFS Physics’ Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015.

Debugging PGI Compilers for Heterogeneous Supercomputing.

Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron.

Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.

Synchronization These notes introduce:

CS/EE 217 GPU Architecture and Parallel Programming Lecture 23: Introduction to OpenACC.

Martin Kruliš by Martin Kruliš (v1.0)1.

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

Special Topics in Computer Engineering OpenMP* Essentials * Open Multi-Processing.

CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.

Lecture 14 Introduction to OpenACC Kyu Ho Park May 12, 2016 Ref: 1.David Kirk and Wen-mei Hwu, Programming Massively Parallel Processors, MK and NVIDIA.

Computer Engg, IIT(BHU)

An Update on Accelerating CICE with OpenACC

Introduction to OpenMP

CS427 Multicore Architecture and Parallel Computing

Computer Engg, IIT(BHU)

Lecture 2: GPU Programming with OpenACC

Introduction to OpenMP

Department of Computer Science and Engineering

CS4230 Parallel Programming Lecture 12: More Task Parallelism Mary Hall October 4, /04/2012 CS4230.

Run-Time Environments

Run-Time Environments

Programming with Shared Memory Introduction to OpenMP

Binding Times Binding is an association between two things Examples:

Introduction to OpenMP

CUDA Execution Model – III Streams and Events

6- General Purpose GPU Programming

Dynamic Binary Translators and Instrumenters

CUDA Fortran Programming with the IBM XL Fortran Compiler

Shared-Memory Paradigm & OpenMP

Presentation transcript:

OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing

OpenACC Features Single source – many targets (host+gpu, multicore,...) Data management  structured data region, unstructured data lifetime  user managed data coherence Parallelism management  parallel construct, kernels construct, loop directive  gang, worker, vector levels of parallelism Concurrency (async, wait) Interoperability (CUDA, OpenMP)

` !$acc data copyin(a(:,:), v(:)) copy(x(:)) !$acc parallel !$acc loop gang do j = 1, n sum = 0.0 !$acc loop vector reduction(+:sum) do i = 1, n sum = sum + a(i,j) + v(i) enddo x(j) = sum enddo !$acc end parallel !$acc end data

` !$acc data copyin(a(:,:), v(:)) copy(x(:)) call matvec( a, v, x, n ) !$acc end data... subroutine matvec( m, v, r, n ) real :: m(:,:), v(:), r(:) !$acc parallel present(a,v,r) !$acc loop gang do j = 1, n sum = 0.0 !$acc loop vector reduction(+:sum) do i = 1, n sum = sum + m(i,j) + v(i) enddo r(j) = sum enddo !$acc end parallel end subroutine

` !$acc data copyin(a(:,:), v(:)) copy(x(:)) call matvec( a, v, x, n ) !$acc end data... subroutine matvec( m, v, r, n ) real :: m(:,:), v(:), r(:) !$acc parallel default(present) !$acc loop gang do j = 1, n sum = 0.0 !$acc loop vector reduction(+:sum) do i = 1, n sum = sum + m(i,j) + v(i) enddo r(j) = sum enddo !$acc end parallel end subroutine

` !$acc data copyin(a, v,...) copy(x) call init( x, n ) do iter = 1, niter call matvec( a, v, x, n ) call interp( b, x, n ) !$acc update host( x ) write(...) x call exch( x ) !$acc update device( x ) enddo !$acc end data...

call init( v, n ) call fill( a, n ) !$acc data copy( x ) do iter = 1, niter call matvec( a, v, x, n ) call interp( b, x, n ) !$acc update host( x ) write(...) x call exch( x ) !$acc update device( x ) enddo !$acc end data... subroutine init( v, n ) real, allocatable :: v(:) allocate(v(n)) v(1) = 0 do i = 2, n v(i) =.... enddo !$acc enter data copyin(v) end subroutine

use vmod use amod call initv( n ) call fill( n ) !$acc data copy( x ) do iter = 1, niter call matvec( a, v, x, n ) call interp( b, x, n ) !$acc update host( x ) write(...) x call exch( x ) !$acc update device( x ) enddo !$acc end data... module vmod real, allocatable :: v(:) contains subroutine initv( v, n ) allocate(v(n)) v(1) = 0 do i = 2, n v(i) =.... enddo !$acc enter data copyin(v) end subroutine subroutine finiv !$acc exit data delete(v) deallocate(v) end subroutine end module

use vmod use amod call initv( n ) call fill( n ) !$acc data copy( x ) do iter = 1, niter call matvec( a, v, x, n ) call interp( b, x, n ) !$acc update host( x ) write(...) x call exch( x ) !$acc update device( x ) enddo !$acc end data... module vmod real, allocatable :: v(:) !$acc declare create(v) contains subroutine initv( v, n ) allocate(v(n)) v(1) = 0 do i = 2, n v(i) =.... enddo !$acc update device(v) end subroutine subroutine finiv deallocate(v) end subroutine end module

Data Management Data construct – from acc data to acc end data  single-entry, single exit (no goto in or out, no return) Data region – dynamic extent of data construct  region includes any routines called during data construct Dynamic data lifetime – from enter data to exit data Data is present or not present on the device

!$acc data copy(x) copyin(v) do iter = 1, niter call matvec( a, v, x, n ) call interp( b, x, n ) !$acc update host( x ) write(...) x call exch( x ) !$acc update device( x ) enddo !$acc end data... subroutine matvec(m,v,r,n ) real :: m(:,:), v(:), r(:) !$acc parallel present(v,r) !$acc loop gang do j = 1, n sum = 0.0 !$acc loop vector & !$& reduction(+:sum) do i = 1, n sum = sum+m(i,j)+v(i) enddo r(j) = sum enddo !$acc end parallel end subroutine

Data Management Data Clauses:  copy – allocate+copyin at entry, copyout+deallocate at exit  copyin – allocate+copyin at entry, dealloate at exit  copyout – allocate at entry, copyout+deallocate at exit  create – allocate at entry, deallocate at exit  delete – deallocate at exit (only on exit data)  present – data must be present No data movement if data is already present  use update directive for unconditional data movement

!$acc data copy(x) copyin(v) do iter = 1, niter call matvec( a, v, x, n ) call interp( b, x, n ) !$acc update host( x ) write(...) x call exch( x ) !$acc update device( x ) enddo !$acc end data... subroutine matvec(m,v,r,n ) real :: m(:,:), v(:), r(:) !$acc parallel copy(r) & !$acc& copyin(v,m) !$acc loop gang do j = 1, n sum = 0.0 !$acc loop vector & !$& reduction(+:sum) do i = 1, n sum = sum+m(i,j)+v(i) enddo r(j) = sum enddo !$acc end parallel end subroutine

Data Management Declare directive  create  allocatable: allocate on both host and device  static: statically allocated on both host and device  copyin  in procedure, allocate and initialize for lifetime of procedure  present  in procedure, data must be present during the procedure

!$acc data copy(x) copyin(v) do iter = 1, niter call matvec( a, v, x, n ) call interp( b, x, n ) !$acc update host( x ) write(...) x call exch( x ) !$acc update device( x ) enddo !$acc end data... subroutine matvec(m,v,r,n ) real :: m(:,:), v(:), r(:) !$acc declare copyin(v,m) !$acc declare present(r) !$acc parallel !$acc loop gang do j = 1, n sum = 0.0 !$acc loop vector & !$& reduction(+:sum) do i = 1, n sum = sum+m(i,j)+v(i) enddo r(j) = sum enddo !$acc end parallel end subroutine

Data Management Update directive  device(x,y,z)  host(x,y,z) or self(x,y,z)  data must be present  subarrays allowed, even noncontiguous subarrays

!$acc data copy(x) copyin(v) do iter = 1, niter call matvec( a, v, x, n ) call interp( b, x, n ) !$acc update host( x ) write(...) x call exch( x ) !$acc update device( x ) enddo !$acc end data... subroutine matvec(m,v,r,n ) real :: m(:,:), v(:), r(:) !$acc declare copyin(v,m) !$acc declare present(r) !$acc parallel !$acc loop gang do j = 1, n sum = 0.0 !$acc loop vector & !$& reduction(+:sum) do i = 1, n sum = sum+m(i,j)+v(i) enddo r(j) = sum enddo !$acc end parallel end subroutine

Parallelism Management Parallel construct  from acc parallel to acc end parallel Parallel region  dynamic extent of parallel construct  may call procedures on the device (acc routine directive) gang, worker, vector parallelism launches a kernel with fixed #gangs, #workers, vlength usually use acc parallel loop....

!$acc parallel present(a,b,c) do i = 1, n a(i) = b(i) + c(i) enddo !$acc end parallel...

!$acc parallel present(a,b,c) !$acc loop gang vector do i = 1, n a(i) = b(i) + c(i) enddo !$acc end parallel...

!$acc parallel present(a,b,c) !$acc loop gang do j = 1, n !$acc loop vector do i = 1, n a(i,j) = b(i) + c(j) enddo !$acc end parallel...

Parallelism Management Loop directive  acc loop seq = run this loop sequentially  acc loop gang = run this loop across gangs  acc loop vector = run this loop in vector/SIMD mode  acc loop auto = detect whether this loop is parallel  acc loop independent = this loop IS parallel  acc loop reduction(+:variable) = sum reduction  acc loop private(t) = copy of t for each loop iteration  add loop collapse(2) = two nested loops together

Parallelism Management Kernels construct  from acc kernels to acc end kernels Kernels region  dynamic extent of kernels construct  may call procedures on the device (acc routine directive) gang, worker, vector parallelism launches one or more kernels usually use acc kernels loop....

!$acc kernels present(a,b,c) do i = 1, n a(i) = b(i) + c(i) enddo !$acc end kernels...

!$acc kernels present(a,b,c) do j = 1, n do i = 1, n a(i,j) = b(i) + c(j) enddo !$acc end kernels...

Parallelism Management acc parallel  more prescriptive, more like OpenMP parallel  user-specified parallelism  acc loop implies loop independent acc kernels  more descriptive, depends more on compiler analysis  compiler-discovered parallelism  acc loop implies acc loop auto  less useful in C/C++

Building OpenACC Programs pgfortran (pgf90, pgf95) –help (–help –ta, –help –acc) –acc – enable OpenACC directives –ta – select target accelerator (–ta=tesla) –Minfo or –Minfo=accel compile, link, run as normal

Building OpenACC Programs –acc=sync – ignore async clauses –acc=noautopar – disable autoparallelization –ta=tesla:cc20,cc30,cc35 – select compute capability –ta=tesla:cuda7.0 – select CUDA toolkit version –ta=tesla:nofma – disable fused multiply-add –ta=tesla:nordc – disable relocatable device code –ta=tesla:fastmath – use fast, low precision library –ta=tesla:managed – allocate in managed memory –ta=multicore – generate parallel multicore (host) code

Building OpenACC Programs –Minline – enable procedure inlining –Minline=levels:2 – two levels of inlining –O – enable optimization –fast – more optimization –tp – set target processor (default is build processor)

Running OpenACC Programs ACC_DEVICE_NUM – set device number to use PGI_ACC_TIME – set to collect profile information PGI_ACC_NOTIFY – bitmask for activity  1 – kernel launch  2 – data upload/download  4 – wait events  8 – region entry/exit  16 – data allocate/free

Performance Tuning Data Management  data regions or dynamic data management  minimize frequency and volume of data traffic Parallelism Management  as many loops running in parallel as possible Kernel Schedule Tuning  which loops are running in gang mode, vector mode

Data Management Profile to find where data movement occurs Insert data directives to remove data movement Insert update directives to manage coherence See async below

Kernel Schedule Tuning Look at –Minfo messages, profile, PGI_ACC_TIME Enough gang parallelism generated?  gangs = thread blocks  gangs << SM count Too much vector parallelism generated  vector = thread  vector length >> loop trip count Loop collapsing Worker parallelism for intermediate loops

!$acc parallel present(a,b,c) !$acc loop gang vector do i = 1, n a(i) = b(i) + c(i) enddo !$acc end parallel...

!$acc parallel present(a,b,c) num_gangs(30) vector_length(64) !$acc loop gang do j = 1, n !$acc loop vector do i = 1, n a(i,j) = b(i) + c(j) enddo !$acc end parallel...

!$acc kernels present(a,b,c) !$acc loop gang(32) do j = 1, n !$acc loop vector(64) do i = 1, n a(i,j) = b(i) + c(j) enddo !$acc end kernels...

!$acc kernels present(a,b,c) !$acc loop gang vector(4) do j = 1, n !$acc loop gang vector(32) do i = 1, n a(i,j) = b(i) + c(j) enddo !$acc end kernels...

Routines Must tell compiler what routines to compile for device  acc routine Must tell compiler what parallelism is used in the routine  acc routine gang / worker / vector / seq May be used to interface to native CUDA C

subroutine asub( a, b, x, n ) real a(*), b(*) real x integer n integer i !$acc loop gang vector do i = 1, n a(i) = x*b(i) enddo end subroutine

subroutine asub( a, b, x, n ) !$acc routine gang real a(*), b(*) real, value :: x integer, value:: n integer i !$acc loop gang vector do i = 1, n a(i) = x*b(i) enddo end subroutine

!$acc routine(asub) gang interface subroutine asub(a,b,x,n) !$acc routine gang real a(*), b(*) real, value :: x integer, value :: n end subroutine end interface use asub_mod !$acc parallel present(a,b,x) num_gangs(n/32) vector_length(32) call asub(a, b, x, n) !$acc end parallel

!$acc parallel present(a,b,c) num_gangs(n) vector_length(64) !$acc loop gang do j = 1, n call asubv( a(1,j), b, c(j), n ) enddo !$acc end parallel... subroutine asubv( a, b, x, n ) !$acc routine vector... !$acc loop vector do i = 1, n a(i) = x*b(i) enddo end subroutine

!$acc parallel present(a,b,c) num_gangs(n) vector_length(64) call msub( a, b, c, n ) !$acc end parallel... subroutine msub( a, b, c, n ) !$acc routine gang !$acc routine(asuby) vector... !$acc loop gang do j = 1, n call asubv( a(1,j), b, c(j), n ) enddo end subroutine

Routines Routine must know it’s being compiled for device Caller and callee must agree on level of parallelism  modules Scalar arguments passed by value are more efficient

Asynchronous operation async clause on parallel, kernels, enter data, exit data, update (and data – PGI extension) async argument is the queue number to use  PGI supports 16 queues; map to CUDA streams  default is “synchronous” queue (not null queue) wait directive to synchronize host with async queue(s) wait directive to synchronize between async queues behavior of synchronous queue with –Mcuda[lib]

!$acc parallel loop gang... async do j = 1, n call asubv( a(1,j), b, c(j), n ) enddo... !$acc parallel loop gang... async do j = 1, n call dosomethingelse(...) enddo... !$acc parallel loop gang... async do j = 1, n call doother(...) enddo !$acc wait

!$acc parallel loop gang... async(1) do j = 1, n call asubv( a(1,j), b, c(j), n ) enddo... !$acc parallel loop gang... async(1) do j = 1, n call dosomethingelse(...) enddo... !$acc parallel loop gang... async(1) do j = 1, n call doother(...) enddo !$acc wait

!$acc parallel loop gang... async(1) do j = 1, n call asubv( a(1,j), b, c(j), n ) enddo... !$acc parallel loop gang... async(1) do j = 1, n call dosomethingelse(...) enddo... !$acc parallel loop gang... async(1) do j = 1, n call doother(...) enddo !$acc wait(1)

!$acc parallel loop gang... async(1) do j = 1, n call asubv( a(1,j), b, c(j), n ) enddo... !$acc parallel loop gang... async(2) do j = 1, n call dosomethingelse(...) enddo... !$acc parallel loop gang... async(3) do j = 1, n call doother(...) enddo !$acc wait

!$acc parallel loop gang... async(1) do j = 1, n call asubv( a(1,j), b, c(j), n ) enddo... !$acc parallel loop gang... async(2) do j = 1, n call dosomethingelse(...) enddo... !$acc parallel loop gang... async(3) do j = 1, n call doother(...) enddo !$acc wait(1,2)

!$acc parallel loop gang... async(1) do j = 1, n call asubv( a(1,j), b, c(j), n ) enddo... !$acc parallel loop gang... async(2) do j = 1, n call dosomethingelse(...) enddo... !$acc wait(2) async(1) !$acc parallel loop gang... async(1) do j = 1, n call doother(...) enddo !$acc wait(1)

!$acc parallel loop gang... async(1) do j = 1, n call asubv( a(1,j), b, c(j), n ) enddo... !$acc update host(a) async(1)... !$acc parallel loop gang... async(1) do j = 1, n call doother(...) enddo... !$acc wait(1)