GPU Computing with OpenACC Directives

Slides:

Advertisements

Similar presentations

GPU Computing with OpenACC Directives. subroutine saxpy(n, a, x, y) real :: x(:), y(:), a integer :: n, i $!acc kernels do i=1,n y(i) = a*x(i)+y(i) enddo.

Advertisements

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.

Introduction to the CUDA Platform

1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

GPU Computing with OpenACC Directives. 1,000,000’s Early Adopters Time Research Universities Supercomputing Centers Oil & Gas CAE CFD Finance Rendering.

Presented by Rengan Xu LCPC /16/2014

Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

A Source-to-Source OpenACC compiler for CUDA Akihiro Tabuchi †1 Masahiro Nakao †2 Mitsuhisa Sato †1 †1. Graduate School of Systems and Information Engineering,

An Introduction to Programming with CUDA Paul Richmond

Programming with Shared Memory Introduction to OpenMP

Programming GPUs using Directives Alan Gray EPCC The University of Edinburgh.

This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

GPU Programming with CUDA – Optimisation Mike Griffiths

OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)

Profiling and Tuning OpenACC Code. Profiling Tools (PGI) Use time option to learn where time is being spent -ta=nvidia,time NVIDIA Visual Profiler 3 rd.

GPU Architecture and Programming

High-Performance Parallel Scientific Computing 2008 Purdue University OpenMP Tutorial Seung-Jai Min School of Electrical and Computer.

1 The Portland Group, Inc. Brent Leback HPC User Forum, Broomfield, CO September 2009.

OpenCL Programming James Perry EPCC The University of Edinburgh.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Slide 1 Using OpenACC in IFS Physics’ Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015.

OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron.

Threaded Programming Lecture 2: Introduction to OpenMP.

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

CS/EE 217 GPU Architecture and Parallel Programming Lecture 23: Introduction to OpenACC.

Single Node Optimization Computational Astrophysics.

Would'a, CUDA, Should'a. CUDA: Compute Unified Device Architecture OU Supercomputing Symposium Highly-Threaded HPC.

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.

CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.

GPU Computing with OpenACC Directives Presented by John Urbanic Pittsburgh Supercomputing Center Very major contribution by Mark Harris NVIDIA Corporation.

1 ITCS 4/5145GPU Programming, UNC-Charlotte, B. Wilkinson, Nov 4, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU programming.

Lecture 14 Introduction to OpenACC Kyu Ho Park May 12, 2016 Ref: 1.David Kirk and Wen-mei Hwu, Programming Massively Parallel Processors, MK and NVIDIA.

© 2011 Pittsburgh Supercomputing Center XSEDE 2012 Intro To Our Environment John Urbanic Pittsburgh Supercomputing Center July, 2012.

OpenMP Lab Antonio Gómez-Iglesias Texas Advanced Computing Center.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Computer Engg, IIT(BHU)

CUDA Introduction Martin Kruliš by Martin Kruliš (v1.1)

CUDA Programming Model

Introduction to OpenMP

A Closer Look at Instruction Set Architectures

Computer Engg, IIT(BHU)

Lecture 2: GPU Programming with OpenACC

Exploiting Parallelism

Processing Framework Sytse van Geldermalsen

Lecture 5: GPU Compute Architecture

Recitation 2: Synchronization, Shared memory, Matrix Transpose

CS4230 Parallel Programming Lecture 12: More Task Parallelism Mary Hall October 4, /04/2012 CS4230.

Intel® Parallel Studio and Advisor

Multi-core CPU Computing Straightforward with OpenMP

Lecture 5: GPU Compute Architecture for the last time

Constructing a system with multiple computers or processors

1.1 The Characteristics of Contemporary Processors, Input, Output and Storage Devices Types of Processors.

Multithreading Why & How.

CUDA Grids, Blocks, and Threads

Lecture 2 The Art of Concurrency

Multicore and GPU Programming

GPU Scheduling on the NVIDIA TX2:

6- General Purpose GPU Programming

Chapter 13: I/O Systems.

Shared-Memory Paradigm & OpenMP

Programming Parallel Computers

Presentation transcript:

GPU Computing with OpenACC Directives Presented by John Urbanic Pittsburgh Supercomputing Center Authored by Mark Harris NVIDIA Corporation

GPUs Reaching Broader Set of Developers CAE CFD Finance Rendering Data Analytics Life Sciences Defense Weather Climate Plasma Physics As I mentioned at the start of this talk, GPGPU started as the domain of a few hungry graphics PhD students. Its momentum is growing rapidly, and we’re now seeing GPU Computing reach a broader set of developers. Its being used by research universities, supercomputing centers, and oil and gas companies, as well as a number of software developers in a variety of fields. But there is still a lot of software out there that could be accelerated, but not every programmer can rewrite part of his or her application to use CUDA, for one reason or another. Universities Supercomputing Centers Oil & Gas 100,000’s Research Early Adopters 2004 Present Time

3 Ways to Accelerate Applications Libraries OpenACC Directives Programming Languages Many professional applications have already been ported to GPUs. If you are a developer interested in accelerating your own application, you have 3 main options for adding GPU acceleration: In order of increasing effort, they are, 1. Drop in pre-optimized GPU-accelerated libraries as an alternative to MKL, IPP, FFTW and other widely used libraries 2. Use OpenACC directives (or compiler “hints”) in your source code to automatically parallelize compute-intensive loops 3. Use the programming languages you already know to implement your own parallel algorithms For each performance-critical area in your application, these approaches can be used independently or together. For example, you might use a GPU-accelerated library to perform some initial calculations on your data and then use OpenACC to accelerate your own code to perform Custom computations that are not (yet) available in a library. CUDA is a parallel computing platform that supports all of these approaches. Easily Accelerate Applications Maximum Flexibility “Drop-in” Acceleration

OpenACC Directives Simple Compiler hints Compiler Parallelizes code CPU GPU Simple Compiler hints Compiler Parallelizes code Works on many-core GPUs & multicore CPUs Program myscience ... serial code ... !$acc kernels do k = 1,n1 do i = 1,n2 ... parallel code ... enddo enddo !$acc end kernels ... End Program myscience OpenACC is an open programming standard for parallel computing on accelerators such as GPUs, using compiler directives. OpenACC compiler directives are simple “hints” to the compiler that identify parallel regions of the code to accelerate. The compiler automatically accelerates these regions without requiring changes to the underlying code. Directives expose parallelism to the compiler so that it can do the detailed work of mapping the computation onto parallel accelerators such as GPUs. Additional directive options can be used to provide additional information to the compiler, for example about loop dependencies and data sharing. Compilers without support for OpenACC will ignore the directives as if they were code comments, ensuring that the code remains portable across many-core GPUs and multi-core CPUs. OpenACC Compiler Hint Your original Fortran or C code

Familiar to OpenMP Programmers main() { double pi = 0.0; long i; #pragma omp parallel for reduction(+:pi) for (i=0; i<N; i++) { double t = (double)((i+0.05)/N); pi += 4.0/(1.0+t*t); } printf(“pi = %f\n”, pi/N); CPU OpenMP main() { double pi = 0.0; long i; #pragma acc kernels for (i=0; i<N; i++) { double t = (double)((i+0.05)/N); pi += 4.0/(1.0+t*t); } printf(“pi = %f\n”, pi/N); CPU GPU OpenACC High-level Directives: Add a few lines of code and let the auto-parallelizing compiler and runtime determine the best mapping of functions to the available processors and their respective capabilities While this simplistic approach will yield some performance gains, in many cases code will require restructuring for best performance. 5

OpenACC Open Programming Standard for Parallel Computing “OpenACC will enable programmers to easily develop portable applications that maximize the performance and power efficiency benefits of the hybrid CPU/GPU architecture of Titan.” --Buddy Bland, Titan Project Director, Oak Ridge National Lab “OpenACC is a technically impressive initiative brought together by members of the OpenMP Working Group on Accelerators, as well as many others. We look forward to releasing a version of this proposal in the next release of OpenMP.” --Michael Wong, CEO OpenMP Directives Board OpenACC was developed by The Portland Group (PGI), Cray, CAPS and NVIDIA. PGI, Cray, and CAPs have spent over 2 years developing and shipping commercial compilers that use directives to enable GPU acceleration as core technology. The small differences between their approaches allowed the formation of a group to standardize a single directives approach for accelerators and CPUs. OpenACC’s basis in existing technology and initial support by multiple vendors will help accelerate its rate of adoption, providing a stable And portable platform for developers to target. OpenACC Standard

OpenACC The Standard for GPU Directives Easy: Directives are the easy path to accelerate compute intensive applications Open: OpenACC is an open GPU directives standard, making GPU programming straightforward and portable across parallel and multi-core processors Powerful: GPU Directives allow complete access to the massive parallel power of a GPU OpenACC is easy, open and powerful. Directives provide access to the high performance of parallel accelerators without having to learn a parallel programming language or rewrite large Portions of existing code. OpenACC is an open standard supported by multiple vendors, enabling straightforward development of portable accelerated code. OpenACC directives allow Complete access to the parallel power of GPUs, in many cases achieving nearly the performance of direct programming of parallel kernels in languages such as CUDA C and CUDA Fortran.

High-level, with low-level access Compiler directives to specify parallel regions in C, C++, Fortran OpenACC compilers offload parallel regions from host to accelerator Portable across OSes, host CPUs, accelerators, and compilers Create high-level heterogeneous programs Without explicit accelerator initialization, Without explicit data or program transfers between host and accelerator Programming model allows programmers to start simple Enhance with additional guidance for compiler on loop mappings, data location, and other performance details Compatible with other GPU languages and libraries Interoperate between CUDA C/Fortran and GPU libraries e.g. CUFFT, CUBLAS, CUSPARSE, etc.

Directives: Easy & Powerful Real-Time Object Detection Global Manufacturer of Navigation Systems Valuation of Stock Portfolios using Monte Carlo Global Technology Consulting Company Interaction of Solvents and Biomolecules University of Texas at San Antonio In initial trials of OpenACC, users have achieved significant speedups of existing code in just hours with only minor modifications. Parallel computations such as object detection in image sequences, monte carlo stock portfolio valuations, and molecular dynamics simulations are good examples of the types of computations that have the potential for good speedups with OpenACC. In all three of these examples, most of the time was spent in exploring the existing code to find hot spots with loops that can be parallelized. The developers of all three examples were experts in their domain, with a good understanding of their application’s performance and data, but with little or no experience programming GPUs. Even without GPU experience, they were able to get significant acceleration with relatively little effort. In some cases, it is necessary to make small changes to applications to expose sufficient parallelism, for example changing the layout of data in memory so that memory can be accessed in parallel by many threads without cache conflicts. We find that changes like this are valuable in general because they often benefit performance on the CPU even before OpenACC directives are added, and since the same code can be compiled with and without the directives, all platforms benefit. Additional Information: Global Manufacturer of Navigation Systems: Object detection algorithm: My application is involved in scene object detection in image sequence using also point clouds from laser scans (scans with low density). The computational intensive part is to calculate several features for each window, such as mean color, median, distance variance, etc. Global Technology Consulting Company: Doing a couple of things. One is finding valuation of stock portfolio using monte carlo sim and the other is finding option prices portfolio of stocks using Black Scholes. It’s compute intensive in that for a large basket of instruments, each stock/option is calculated in a loop. Just by adding one line “#pragma acc for” into the code, he got 2x immediately. University of Texas at San Antonio He is in the beginning stages of developing algorithms to describe protein & solvent interactions. He just finished modeling his work using Matlab and Fortran and needs to develop real code. His goal is to develop algorithms and solutions that are much faster than existing ones today and he believes his work will have big impact in computational study of organic and bio molecules. 5x in 40 Hours 2x in 4 Hours 5x in 8 Hours “ Optimizing code with directives is quite easy, especially compared to CPU threads or writing CUDA kernels. The most important thing is avoiding restructuring of existing code for production applications. ” -- Developer at the Global Manufacturer of Navigation Systems

Small Effort. Real Impact. Large Oil Company 3x in 7 days Solving billions of equations iteratively for oil production at world’s largest petroleum reservoirs Univ. of Houston Prof. M.A. Kayali 20x in 2 days Studying magnetic systems for innovations in magnetic storage media and memory, field sensors, and biomagnetism Uni. Of Melbourne Prof. Kerry Black 65x in 2 days Better understand complex reasons by lifecycles of snapper fish in Port Phillip Bay Ufa State Aviation Prof. Arthur Yuldashev 7x in 4 Weeks Generating stochastic geological models of oilfield reservoirs with borehole data GAMESS-UK Dr. Wilkinson, Prof. Naidoo 10x Used for various fields such as investigating biofuel production and molecular sensors. * Achieved using the PGI Accelerator Compiler 10

Focus on Exposing Parallelism With Directives, tuning work focuses on exposing parallelism, which makes codes inherently better Example: Application tuning work using directives for new Titan system at ORNL CAM-SE Answer questions about specific climate change adaptation and mitigation scenarios The great thing about compiler directives is that the work you do to expose parallelism to the compiler, to make it perform better with directives, is not platform specific work – it is work on improving the code in general. What we found working with engineers at Oak Ridge in preparation for Titan, is that preparing the code to get good speedups from directives, also made it run substantially better on Jaguar, their all-CPU system. In the S3D combustion code, focus is on the top 3 kernels which make up 90% of the runtime. This effort made the code 3-6x faster on CPU+GPU than on CPU only, but the effort to expose the parallelism to directives also improved the CPU-only version by a factor of 1.5. A similar effort on the CAM-SE climate code provided a 6.5X speedup on CPU+GPU, but the code changes also improved the CPU version by a factor of 2. S3D Research more efficient combustion with next-generation fuels Tuning top 3 kernels (90% of runtime) 3 to 6x faster on CPU+GPU vs. CPU+CPU But also improved all-CPU version by 50% Tuning top key kernel (50% of runtime) 6.5x faster on CPU+GPU vs. CPU+CPU Improved performance of CPU version by 100%

OpenACC Specification and Website Full OpenACC 1.0 Specification available online http://www.openacc-standard.org Quick reference card also available Beta implementations available now from PGI, Cray, and CAPS

Start Now with OpenACC Directives Sign up for a free trial of the directives compiler now! Free trial license to PGI Accelerator Tools for quick ramp NVIDIA has partnered with PGI to offer a free trial compiler license so that you can try OpenACC directives and experience how easy they are to use. You can sign up for a 15-day trial license at www.nvidia.com/gpudirectives to access to a free downlload of the PGI accelerator compiler. The website has links to documentation, Examples, and brief white papers that you can read over a cup of coffee to get started. All you have to do is provide your contact information and register to download PGI Accelerator. You don’t need to become a GPU expert to try it out. If you already have a PC with an NVIDIA GPU you can start right away, and if not, the website has some suggestions. www.nvidia.com/gpudirectives

A Very Simple Exercise: SAXPY SAXPY in C SAXPY in Fortran void saxpy(int n, float a, float *x, float *restrict y) { #pragma acc kernels for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } ... // Perform SAXPY on 1M elements saxpy(1<<20, 2.0, x, y); subroutine saxpy(n, a, x, y) real :: x(:), y(:), a integer :: n, i $!acc kernels do i=1,n y(i) = a*x(i)+y(i) enddo $!acc end kernels end subroutine saxpy ... $ Perform SAXPY on 1M elements call saxpy(2**20, 2.0, x_d, y_d) Change to 1M element vector.

Directive Syntax Fortran !$acc directive [clause [,] clause] …] Often paired with a matching end directive surrounding a structured code block !$acc end directive C #pragma acc directive [clause [,] clause] …] Often followed by a structured code block

kernels: Your first OpenACC Directive Each loop executed as a separate kernel on the GPU. !$acc kernels do i=1,n a(i) = 0.0 b(i) = 1.0 c(i) = 2.0 end do do i=1,n a(i) = b(i) + c(i) end do !$acc end kernels kernel 1 Kernel: A parallel function that runs on the GPU kernel 2

Kernels Construct Fortran !$acc kernels [clause …] structured block !$acc end kernels Clauses if( condition ) async( expression ) Also, any data clause (more later) C #pragma acc kernels [clause …] { structured block }

C tip: the restrict keyword Declaration of intent given by the programmer to the compiler Applied to a pointer, e.g. float *restrict ptr Meaning: “for the lifetime of ptr, only it or a value directly derived from it (such as ptr + 1) will be used to access the object to which it points”* Limits the effects of pointer aliasing OpenACC compilers often require restrict to determine independence Otherwise the compiler can’t parallelize loops that access ptr Note: if programmer violates the declaration, behavior is undefined http://en.wikipedia.org/wiki/Restrict

Complete SAXPY example code int main(int argc, char **argv) { int N = 1<<20; // 1 million floats if (argc > 1) N = atoi(argv[1]); float *x = (float*)malloc(N * sizeof(float)); float *y = (float*)malloc(N * sizeof(float)); for (int i = 0; i < N; ++i) { x[i] = 2.0f; y[i] = 1.0f; } saxpy(N, 3.0f, x, y); return 0; Trivial first example Apply a loop directive Learn compiler commands *restrict: “I promise y does not alias x” #include <stdlib.h> void saxpy(int n, float a, float *x, float *restrict y) { #pragma acc kernels for (int i = 0; i < n; ++i) y[i] = a * x[i] + y[i]; }

Compile and run C: Fortran: Compiler output: pgcc –acc -ta=nvidia -Minfo=accel –o saxpy_acc saxpy.c Fortran: pgf90 –acc -ta=nvidia -Minfo=accel –o saxpy_acc saxpy.f90 Compiler output: pgcc -acc -Minfo=accel -ta=nvidia -o saxpy_acc saxpy.c saxpy: 8, Generating copyin(x[:n-1]) Generating copy(y[:n-1]) Generating compute capability 1.0 binary Generating compute capability 2.0 binary 9, Loop is parallelizable Accelerator kernel generated 9, #pragma acc loop worker, vector(256) /* blockIdx.x threadIdx.x */ CC 1.0 : 4 registers; 52 shared, 4 constant, 0 local memory bytes; 100% occupancy CC 2.0 : 8 registers; 4 shared, 64 constant, 0 local memory bytes; 100% occupancy

Example: Jacobi Iteration Iteratively converges to correct value (e.g. Temperature), by computing new values at each point from the average of neighboring points. Common, useful algorithm Example: Solve Laplace equation in 2D: 𝛁 𝟐 𝒇(𝒙,𝒚)=𝟎 A(i,j) A(i+1,j) A(i-1,j) A(i,j-1) A(i,j+1) 𝐴 𝑘+1 𝑖,𝑗 = 𝐴 𝑘 (𝑖−1,𝑗)+ 𝐴 𝑘 𝑖+1,𝑗 + 𝐴 𝑘 𝑖,𝑗−1 + 𝐴 𝑘 𝑖,𝑗+1 4

Jacobi Iteration C Code Iterate until converged while ( error > tol && iter < iter_max ) { error=0.0; for( int j = 1; j < n-1; j++) { for(int i = 1; i < m-1; i++) { Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = max(error, abs(Anew[j][i] - A[j][i]); } for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; iter++; Iterate across matrix elements Calculate new value from neighbors Compute max error for convergence Swap input/output arrays

Jacobi Iteration Fortran Code do while ( err > tol .and. iter < iter_max ) err=0._fp_kind do j=1,m do i=1,n Anew(i,j) = .25_fp_kind * (A(i+1, j ) + A(i-1, j ) + & A(i , j-1) + A(i , j+1)) err = max(err, Anew(i,j) - A(i,j)) end do do j=1,m-2 do i=1,n-2 A(i,j) = Anew(i,j) iter = iter +1 Iterate until converged Iterate across matrix elements Calculate new value from neighbors Compute max error for convergence Here we have a more in-depth code example which shows Fortran code that performs Jacobi integration, applying a 2D Laplacian operator to every grid cell at each iteration of an outer do while loop. The inner two do loops iterate over the 2D grid cells and apply the Laplacian operator. The outer do while loop iterates until the solution is converged. You really don’t have to understand any of what I just said in order to understand OpenACC. Suffice to say that the inner loops over the (i,j) grid cells represent thousands of independent operations that can be run in parallel, and the outer do while loop represents running this parallel computation hundreds of times. The directives we insert into the code are shown in green. To accelerate this code, we start outside the outer loop, where we use an “acc data copy” directive to specify that there should be copies of the arrays A and Anew on kept the accelerator. The compiler generates copy commands to move the data from the host CPU to the accelerator before the do while loop begins and then move the results back from the accelerator to the host after the while loop ends. Doing this data movement only before and after the outer loop ensures that these expensive copies only happen once, rather than before and after every computation on the accelerator. On the inner parallel loop nest, we use an “acc kernels” directive to specify that the compiler should automatically generate parallel device code for the accelerator to implement these loops. The compiler Detects that each iteration of these loops is independent, and generates the equivalent of a parallel CUDA function for the GPU. The GPU can process thousands of iterations of these loops concurrently. We close off both directives with “acc end” at the end of the corresponding loops. The result is not only a very fast parallel implementation of the Laplacian operation, but efficient allocation and copying of data to and from the GPU, without having to write any GPU-specific code. And since we Haven’t changed the underlying source code, we can run the sequential version on a CPU just as we did before adding the directives. Swap input/output arrays

OpenMP C Code Parallelize loop across CPU threads while ( error > tol && iter < iter_max ) { error=0.0; #pragma omp parallel for shared(m, n, Anew, A) for( int j = 1; j < n-1; j++) { for(int i = 1; i < m-1; i++) { Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = max(error, abs(Anew[j][i] - A[j][i]); } for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; iter++; Parallelize loop across CPU threads Parallelize loop across CPU threads

OpenMP Fortran Code Parallelize loop across CPU threads do while ( err > tol .and. iter < iter_max ) err=0._fp_kind !$omp parallel do shared(m,n,Anew,A) reduction(max:err) do j=1,m do i=1,n Anew(i,j) = .25_fp_kind * (A(i+1, j ) + A(i-1, j ) + & A(i , j-1) + A(i , j+1)) err = max(err, Anew(i,j) - A(i,j)) end do !$omp parallel do shared(m,n,Anew,A) do j=1,m-2 do i=1,n-2 A(i,j) = Anew(i,j) iter = iter +1 Parallelize loop across CPU threads Parallelize loop across CPU threads Here we have a more in-depth code example which shows Fortran code that performs Jacobi integration, applying a 2D Laplacian operator to every grid cell at each iteration of an outer do while loop. The inner two do loops iterate over the 2D grid cells and apply the Laplacian operator. The outer do while loop iterates until the solution is converged. You really don’t have to understand any of what I just said in order to understand OpenACC. Suffice to say that the inner loops over the (i,j) grid cells represent thousands of independent operations that can be run in parallel, and the outer do while loop represents running this parallel computation hundreds of times. The directives we insert into the code are shown in green. To accelerate this code, we start outside the outer loop, where we use an “acc data copy” directive to specify that there should be copies of the arrays A and Anew on kept the accelerator. The compiler generates copy commands to move the data from the host CPU to the accelerator before the do while loop begins and then move the results back from the accelerator to the host after the while loop ends. Doing this data movement only before and after the outer loop ensures that these expensive copies only happen once, rather than before and after every computation on the accelerator. On the inner parallel loop nest, we use an “acc kernels” directive to specify that the compiler should automatically generate parallel device code for the accelerator to implement these loops. The compiler Detects that each iteration of these loops is independent, and generates the equivalent of a parallel CUDA function for the GPU. The GPU can process thousands of iterations of these loops concurrently. We close off both directives with “acc end” at the end of the corresponding loops. The result is not only a very fast parallel implementation of the Laplacian operation, but efficient allocation and copying of data to and from the GPU, without having to write any GPU-specific code. And since we Haven’t changed the underlying source code, we can run the sequential version on a CPU just as we did before adding the directives.

Exercises: General Instructions (compiling) Exercises are in “exercises” directory in your home directory Solutions are in “solutions” directory To compile, use one of the provided makefiles > cd exercises/001-laplace2D C: > make Fortran: > make –f Makefile_f90 Remember these compiler flags: –acc -ta=nvidia -Minfo=accel

Exercises: General Instructions (running) To run, use qsub with one of the provided job files > qsub laplace_acc.job > qstat # prints qsub status Output is placed in laplace_acc.job.o<job#> when finished. OpenACC job file looks like this #!/bin/csh #PBS -l walltime=3:00 ./laplace2d_acc The OpenMP version specifies number of cores to use setenv OMP_NUM_THREADS 6 ./laplace2d_omp Edit this to control the number of cores to use

GPU startup overhead If no other GPU process running, GPU driver may be swapped out Linux specific Starting it up can take 1-2 seconds Two options Run nvidia-smi in persistence mode (requires root permissions) Run “nvidia-smi –q –l 30” in the background If your running time is off by ~2 seconds from results in these slides, suspect this Nvidia-smi should be running in persistent mode for these exercises

Exercise 1: Jacobi Kernels Task: use acc kernels to parallelize the Jacobi loop nests Edit laplace2D.c or laplace2D.f90 (your choice) In the 001-laplace2D-kernels directory Add directives where it helps Figure out the proper compilation command (similar to SAXPY example) Compile both with and without OpenACC parallelization Optionally compile with OpenMP (original code has OpenMP directives) Run OpenACC version with laplace_acc.job, OpenMP with laplace_omp.job Q: can you get a speedup with just kernels directives? Versus 1 CPU core? Versus 6 CPU cores?

Exercise 1 Solution: OpenACC C while ( error > tol && iter < iter_max ) { error=0.0; #pragma acc kernels for( int j = 1; j < n-1; j++) { for(int i = 1; i < m-1; i++) { Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = max(error, abs(Anew[j][i] - A[j][i]); } for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; iter++; Execute GPU kernel for loop nest Execute GPU kernel for loop nest

Exercise 1 Solution: OpenACC Fortran do while ( err > tol .and. iter < iter_max ) err=0._fp_kind !$acc kernels do j=1,m do i=1,n Anew(i,j) = .25_fp_kind * (A(i+1, j ) + A(i-1, j ) + & A(i , j-1) + A(i , j+1)) err = max(err, Anew(i,j) - A(i,j)) end do !$acc end kernels do j=1,m-2 do i=1,n-2 A(i,j) = Anew(i,j) iter = iter +1 Generate GPU kernel for loop nest Here we have a more in-depth code example which shows Fortran code that performs Jacobi integration, applying a 2D Laplacian operator to every grid cell at each iteration of an outer do while loop. The inner two do loops iterate over the 2D grid cells and apply the Laplacian operator. The outer do while loop iterates until the solution is converged. You really don’t have to understand any of what I just said in order to understand OpenACC. Suffice to say that the inner loops over the (i,j) grid cells represent thousands of independent operations that can be run in parallel, and the outer do while loop represents running this parallel computation hundreds of times. The directives we insert into the code are shown in green. To accelerate this code, we start outside the outer loop, where we use an “acc data copy” directive to specify that there should be copies of the arrays A and Anew on kept the accelerator. The compiler generates copy commands to move the data from the host CPU to the accelerator before the do while loop begins and then move the results back from the accelerator to the host after the while loop ends. Doing this data movement only before and after the outer loop ensures that these expensive copies only happen once, rather than before and after every computation on the accelerator. On the inner parallel loop nest, we use an “acc kernels” directive to specify that the compiler should automatically generate parallel device code for the accelerator to implement these loops. The compiler Detects that each iteration of these loops is independent, and generates the equivalent of a parallel CUDA function for the GPU. The GPU can process thousands of iterations of these loops concurrently. We close off both directives with “acc end” at the end of the corresponding loops. The result is not only a very fast parallel implementation of the Laplacian operation, but efficient allocation and copying of data to and from the GPU, without having to write any GPU-specific code. And since we Haven’t changed the underlying source code, we can run the sequential version on a CPU just as we did before adding the directives. Generate GPU kernel for loop nest

Exercise 1 Solution: C Makefile CC = pgcc CCFLAGS = ACCFLAGS = -acc -ta=nvidia, -Minfo=accel OMPFLAGS = -fast -mp -Minfo BIN = laplace2d_omp laplace2d_acc all: $(BIN) laplace2d_acc: laplace2d.c $(CC) $(CCFLAGS) $(ACCFLAGS) -o $@ $< laplace2d_omp: laplace2d.c $(CC) $(CCFLAGS) $(OMPFLAGS) -o $@ $< clean: $(RM) $(BIN)

Exercise 1 Solution: Fortran Makefile F90 = pgf90 CCFLAGS = ACCFLAGS = -acc -ta=nvidia, -Minfo=accel OMPFLAGS = -fast -mp -Minfo BIN = laplace2d_f90_omp laplace2d_f90_acc all: $(BIN) laplace2d_f90_acc: laplace2d.f90 $(F90) $(CCFLAGS) $(ACCFLAGS) -o $@ $< laplace2d_f90_omp: laplace2d.f90 $(F90) $(CCFLAGS) $(OMPFLAGS) -o $@ $< clean: $(RM) $(BIN)

Exercise 1: Compiler output (C) pgcc -acc -ta=nvidia -Minfo=accel -o laplace2d_acc laplace2d.c main: 57, Generating copyin(A[:4095][:4095]) Generating copyout(Anew[1:4094][1:4094]) Generating compute capability 1.3 binary Generating compute capability 2.0 binary 58, Loop is parallelizable 60, Loop is parallelizable Accelerator kernel generated 58, #pragma acc loop worker, vector(16) /* blockIdx.y threadIdx.y */ 60, #pragma acc loop worker, vector(16) /* blockIdx.x threadIdx.x */ Cached references to size [18x18] block of 'A' CC 1.3 : 17 registers; 2656 shared, 40 constant, 0 local memory bytes; 75% occupancy CC 2.0 : 18 registers; 2600 shared, 80 constant, 0 local memory bytes; 100% occupancy 64, Max reduction generated for error 69, Generating copyout(A[1:4094][1:4094]) Generating copyin(Anew[1:4094][1:4094]) 70, Loop is parallelizable 72, Loop is parallelizable 70, #pragma acc loop worker, vector(16) /* blockIdx.y threadIdx.y */ 72, #pragma acc loop worker, vector(16) /* blockIdx.x threadIdx.x */ CC 1.3 : 8 registers; 48 shared, 8 constant, 0 local memory bytes; 100% occupancy CC 2.0 : 10 registers; 8 shared, 56 constant, 0 local memory bytes; 100% occupancy

Exercise 1: Performance CPU: Intel Xeon X5680 6 Cores @ 3.33GHz GPU: NVIDIA Tesla M2070 Execution Time (s) Speedup CPU 1 OpenMP thread 69.80 -- CPU 2 OpenMP threads 44.76 1.56x CPU 4 OpenMP threads 39.59 1.76x CPU 6 OpenMP threads 39.71 OpenACC GPU 162.16 0.24x FAIL Speedup vs. 1 CPU core Speedup vs. 6 CPU cores

Huge Data Transfer Bottleneck! Data movement: 133.3 seconds What went wrong? Add –ta=nvidia,time to compiler command line Accelerator Kernel Timing data /usr/users/6/harrism/openacc-workshop/solutions/001-laplace2D-kernels/laplace2d.c main 69: region entered 1000 times time(us): total=77524918 init=240 region=77524678 kernels=4422961 data=66464916 w/o init: total=77524678 max=83398 min=72025 avg=77524 72: kernel launched 1000 times grid: [256x256] block: [16x16] time(us): total=4422961 max=4543 min=4345 avg=4422 57: region entered 1000 times time(us): total=82135902 init=216 region=82135686 kernels=8346306 data=66775717 w/o init: total=82135686 max=159083 min=76575 avg=82135 60: kernel launched 1000 times time(us): total=8201000 max=8297 min=8187 avg=8201 64: kernel launched 1000 times grid: [1] block: [256] time(us): total=145306 max=242 min=143 avg=145 acc_init.c acc_init 29: region entered 1 time time(us): init=158248 4.4 seconds 66.5 seconds 8.3 seconds 66.8 seconds Huge Data Transfer Bottleneck! Computation: 12.7 seconds Data movement: 133.3 seconds

For efficiency, decouple data movement and compute off-load Basic Concepts Transfer data CPU Memory GPU Memory PCI Bus CPU GPU Offload computation For efficiency, decouple data movement and compute off-load

Excessive Data Transfers while ( error > tol && iter < iter_max ) { error=0.0; ... } A, Anew resident on host Copy #pragma acc kernels for( int j = 1; j < n-1; j++) { for(int i = 1; i < m-1; i++) { Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = max(error, abs(Anew[j][i] - A[j][i]); } A, Anew resident on accelerator These copies happen every iteration of the outer while loop!* A, Anew resident on accelerator A, Anew resident on host Copy *Note: there are two #pragma acc kernels, so there are 4 copies per while loop iteration!

Data Management

Manage data movement. Data regions may be nested. Data Construct Fortran !$acc data [clause …] structured block !$acc end data General Clauses if( condition ) async( expression ) C #pragma acc data [clause …] { structured block } Manage data movement. Data regions may be nested.

Data Clauses copy ( list ) Allocates memory on GPU and copies data from host to GPU when entering region and copies data to the host when exiting region. copyin ( list ) Allocates memory on GPU and copies data from host to GPU when entering region. copyout ( list ) Allocates memory on GPU and copies data to the host when exiting region. create ( list ) Allocates memory on GPU but does not copy. present ( list ) Data is already present on GPU from another containing data region. and present_or_copy[in|out], present_or_create, deviceptr.

Array Shaping Compiler sometimes cannot determine size of arrays C Must specify explicitly using data clauses and array “shape” C #pragma acc data copyin(a[0:size-1]), copyout(b[s/4:3*s/4]) Fortran !$pragma acc data copyin(a(1:size)), copyout(b(s/4:3*s/4)) Note: data clauses can be used on data, kernels or parallel

Update Construct Fortran !$acc update [clause …] Clauses C host( list ) device( list ) C #pragma acc update [clause …] if( expression ) async( expression ) Used to update existing data after it has changed in its corresponding copy (e.g. update device copy after host copy changes) Move data from GPU to host, or host to GPU. Data movement can be conditional, and asynchronous.

Exercise 2: Jacobi Data Directives Task: use acc data to minimize transfers in the Jacobi example Start from given laplace2D.c or laplace2D.f90 (your choice) In the 002-laplace2d-data directory Add directives where it helps (hint: [do] while loop) Q: What speedup can you get with data + kernels directives? Versus 1 CPU core? Versus 6 CPU cores?

Exercise 2 Solution: OpenACC C Copy A in at beginning of loop, out at end. Allocate Anew on accelerator #pragma acc data copy(A), create(Anew) while ( error > tol && iter < iter_max ) { error=0.0; #pragma acc kernels for( int j = 1; j < n-1; j++) { for(int i = 1; i < m-1; i++) { Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); error = max(error, abs(Anew[j][i] - A[j][i]); } for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; iter++;

Exercise 2 Solution: OpenACC Fortran Copy A in at beginning of loop, out at end. Allocate Anew on accelerator !$acc data copy(A), create(Anew) do while ( err > tol .and. iter < iter_max ) err=0._fp_kind !$acc kernels do j=1,m do i=1,n Anew(i,j) = .25_fp_kind * (A(i+1, j ) + A(i-1, j ) + & A(i , j-1) + A(i , j+1)) err = max(err, Anew(i,j) - A(i,j)) end do !$acc end kernels ... iter = iter +1 !$acc end data Here we have a more in-depth code example which shows Fortran code that performs Jacobi integration, applying a 2D Laplacian operator to every grid cell at each iteration of an outer do while loop. The inner two do loops iterate over the 2D grid cells and apply the Laplacian operator. The outer do while loop iterates until the solution is converged. You really don’t have to understand any of what I just said in order to understand OpenACC. Suffice to say that the inner loops over the (i,j) grid cells represent thousands of independent operations that can be run in parallel, and the outer do while loop represents running this parallel computation hundreds of times. The directives we insert into the code are shown in green. To accelerate this code, we start outside the outer loop, where we use an “acc data copy” directive to specify that there should be copies of the arrays A and Anew on kept the accelerator. The compiler generates copy commands to move the data from the host CPU to the accelerator before the do while loop begins and then move the results back from the accelerator to the host after the while loop ends. Doing this data movement only before and after the outer loop ensures that these expensive copies only happen once, rather than before and after every computation on the accelerator. On the inner parallel loop nest, we use an “acc kernels” directive to specify that the compiler should automatically generate parallel device code for the accelerator to implement these loops. The compiler Detects that each iteration of these loops is independent, and generates the equivalent of a parallel CUDA function for the GPU. The GPU can process thousands of iterations of these loops concurrently. We close off both directives with “acc end” at the end of the corresponding loops. The result is not only a very fast parallel implementation of the Laplacian operation, but efficient allocation and copying of data to and from the GPU, without having to write any GPU-specific code. And since we Haven’t changed the underlying source code, we can run the sequential version on a CPU just as we did before adding the directives.

Exercise 2: Performance CPU: Intel Xeon X5680 6 Cores @ 3.33GHz GPU: NVIDIA Tesla M2070 Execution Time (s) Speedup CPU 1 OpenMP thread 69.80 -- CPU 2 OpenMP threads 44.76 1.56x CPU 4 OpenMP threads 39.59 1.76x CPU 6 OpenMP threads 39.71 OpenACC GPU 13.65 2.9x Speedup vs. 1 CPU core Speedup vs. 6 CPU cores Note: same code runs in 9.78s on NVIDIA Tesla M2090 GPU

Further speedups OpenACC gives us more detailed control over parallelization Via gang, worker, and vector clauses By understanding more about OpenACC execution model and GPU hardware organization, we can get higher speedups on this code By understanding bottlenecks in the code via profiling, we can reorganize the code for higher performance Will tackle these in later exercises

Finding Parallelism in your code (Nested) for loops are best for parallelization Large loop counts needed to offset GPU/memcpy overhead Iterations of loops must be independent of each other To help compiler: restrict keyword (C), independent clause Compiler must be able to figure out sizes of data regions Can use directives to explicitly control sizes Pointer arithmetic should be avoided if possible Use subscripted arrays, rather than pointer-indexed arrays. Function calls within accelerated region must be inlineable.

Tips and Tricks (PGI) Use time option to learn where time is being spent -ta=nvidia,time Eliminate pointer arithmetic Inline function calls in directives regions (PGI): -inline or –inline,levels(<N>) Use contiguous memory for multi-dimensional arrays Use data regions to avoid excessive memory transfers Conditional compilation with _OPENACC macro

OpenACC Learning Resources OpenACC info, specification, FAQ, samples, and more http://openacc.org PGI OpenACC resources http://www.pgroup.com/resources/accel.htm

Complete OpenACC API

Directive Syntax Fortran !$acc directive [clause [,] clause] …] Often paired with a matching end directive surrounding a structured code block !$acc end directive C #pragma acc directive [clause [,] clause] …] Often followed by a structured code block

Kernels Construct Fortran !$acc kernels [clause …] structured block !$acc end kernels Clauses if( condition ) async( expression ) Also any data clause C #pragma acc kernels [clause …] { structured block }

Kernels Construct Each loop executed as a separate kernel on the GPU. !$acc kernels do i=1,n a(i) = 0.0 b(i) = 1.0 c(i) = 2.0 end do do i=1,n a(i) = b(i) + c(i) end do !$acc end kernels kernel 1 kernel 2

Parallel Construct Fortran !$acc parallel [clause …] structured block !$acc end parallel Clauses if( condition ) async( expression ) num_gangs( expression ) num_workers( expression ) vector_length( expression ) C #pragma acc parallel [clause …] { structured block } private( list ) firstprivate( list ) reduction( operator:list ) Also any data clause

Parallel Clauses num_gangs ( expression ) Controls how many parallel gangs are created (CUDA gridDim). num_workers ( expression ) Controls how many workers are created in each gang (CUDA blockDim). vector_length ( list ) Controls vector length of each worker (SIMD execution). private( list ) A copy of each variable in list is allocated to each gang. firstprivate ( list ) private variables initialized from host. reduction( operator:list ) private variables combined across gangs.

Loop Construct Fortran !$acc loop [clause …] loop !$acc end loop Combined directives !$acc parallel loop [clause …] !$acc kernels loop [clause …] C #pragma acc loop [clause …] { loop } !$acc parallel loop [clause …] !$acc kernels loop [clause …] Detailed control of the parallel execution of the following loop.

Loop Clauses collapse( n ) Applies directive to the following n nested loops. seq Executes the loop sequentially on the GPU. private( list ) A copy of each variable in list is created for each iteration of the loop. reduction( operator:list ) private variables combined across iterations.

Loop Clauses Inside parallel Region gang Shares iterations across the gangs of the parallel region. worker Shares iterations across the workers of the gang. vector Execute the iterations in SIMD mode.

Loop Clauses Inside kernels Region gang [( num_gangs )] Shares iterations across across at most num_gangs gangs. worker [( num_workers )] Shares iterations across at most num_workers of a single gang. vector [( vector_length )] Execute the iterations in SIMD mode with maximum vector_length. independent Specify that the loop iterations are independent.

Other Syntax

Other Directives cache construct Cache data in software managed data cache (CUDA shared memory). host_data construct Makes the address of device data available on the host. wait directive Waits for asynchronous GPU activity to complete. declare directive Specify that data is to allocated in device memory for the duration of an implicit data region created during the execution of a subprogram.

Runtime Library Routines Fortran use openacc #include "openacc_lib.h" acc_get_num_devices acc_set_device_type acc_get_device_type acc_set_device_num acc_get_device_num acc_async_test acc_async_test_all C #include "openacc.h" acc_async_wait acc_async_wait_all acc_shutdown acc_on_device acc_malloc acc_free

Environment and Conditional Compilation ACC_DEVICE device Specifies which device type to connect to. ACC_DEVICE_NUM num Specifies which device number to connect to. _OPENACC Preprocessor directive for conditional compilation. Set to OpenACC version

Try out OpenACC directives today Try out OpenACC directives today. Thanks for listening, and remember: you can access all the presentations in this series by clicking the attachments link in your viewing window. Thank you