Lecture 2: GPU Programming with OpenACC

Slides:

Advertisements

Similar presentations

GPU Computing with OpenACC Directives. subroutine saxpy(n, a, x, y) real :: x(:), y(:), a integer :: n, i $!acc kernels do i=1,n y(i) = a*x(i)+y(i) enddo.

Advertisements

Presented by Rengan Xu LCPC /16/2014

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

A Source-to-Source OpenACC compiler for CUDA Akihiro Tabuchi †1 Masahiro Nakao †2 Mitsuhisa Sato †1 †1. Graduate School of Systems and Information Engineering,

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.

An Introduction to Programming with CUDA Paul Richmond

CS470/570 Lecture 5 Introduction to OpenMP Compute Pi example OpenMP directives and options.

Programming GPUs using Directives Alan Gray EPCC The University of Edinburgh.

Lecture 8: Caffe - CPU Optimization

GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

GPU Programming with CUDA – Optimisation Mike Griffiths

CUDA Programming David Monismith CS599 Based on notes from the Udacity Parallel Programming (cs344) Course.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

Profiling and Tuning OpenACC Code. Profiling Tools (PGI) Use time option to learn where time is being spent -ta=nvidia,time NVIDIA Visual Profiler 3 rd.

GPU Architecture and Programming

1 The Portland Group, Inc. Brent Leback HPC User Forum, Broomfield, CO September 2009.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

OpenCL Programming James Perry EPCC The University of Edinburgh.

Slide 1 Using OpenACC in IFS Physics’ Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015.

OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

1 GPU programming Dr. Bernhard Kainz. 2 Dr Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages.

CS/EE 217 GPU Architecture and Parallel Programming Lecture 23: Introduction to OpenACC.

Heterogeneous Computing With GPGPUs Matthew Piehl Overview Introduction to CUDA Project Overview Issues faced nvcc Implementation Performance Metrics Conclusions.

GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.

CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.

Lecture 14 Introduction to OpenACC Kyu Ho Park May 12, 2016 Ref: 1.David Kirk and Wen-mei Hwu, Programming Massively Parallel Processors, MK and NVIDIA.

© 2011 Pittsburgh Supercomputing Center XSEDE 2012 Intro To Our Environment John Urbanic Pittsburgh Supercomputing Center July, 2012.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.

These slides are based on the book:

An Update on Accelerating CICE with OpenACC

Installing Java on a Home machine

Gwangsun Kim, Jiyun Jeong, John Kim

Architectures of Digital Information Systems Part 1: Interrupts and DMA dr.ir. A.C. Verschueren Eindhoven University of Technology Section of Digital.

Employing compression solutions under openacc

CS427 Multicore Architecture and Parallel Computing

CSCI-235 Micro-Computer Applications

William Stallings Computer Organization and Architecture 8th Edition

Lecture 5: Shared-memory Computing with Open MP

CS427 Multicore Architecture and Parallel Computing

Enabling machine learning in embedded systems

Exploiting Parallelism

Exploiting NVIDIA GPUs with OpenMP

Department of Computer Science and Engineering

NVIDIA Profiler’s Guide

ECE408 Fall 2015 Applied Parallel Programming Lecture 21: Application Case Study – Molecular Dynamics.

CS 179: GPU Programming Lecture 1: Introduction 1

Lecture 5: GPU Compute Architecture

Recitation 2: Synchronization, Shared memory, Matrix Transpose

Experience with Maintaining the GPU Enabled Version of COSMO

CS4230 Parallel Programming Lecture 12: More Task Parallelism Mary Hall October 4, /04/2012 CS4230.

Installing Java on a Home machine

Pipeline parallelism and Multi–GPU Programming

Operation System Program 4

L18: CUDA, cont. Memory Hierarchy and Examples

Lecture 5: GPU Compute Architecture for the last time

Java Programming Arrays

Parallel programming with GPGPU coprocessors

CUDA Execution Model – III Streams and Events

M4 and Parallel Programming

CUDA Grids, Blocks, and Threads

System calls….. C-program->POSIX call

CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization

Outline Chapter 2 (cont) Chapter 3: Processes Virtual machines

Programming with Shared Memory Specifying parallelism

1.3.7 High- and low-level languages and their translators

6- General Purpose GPU Programming

Introduction to Optimization

Presentation transcript:

Lecture 2: GPU Programming with OpenACC OPENACC ONLINE COURSE Lecture 2: GPU Programming with OpenACC John Urbanic, Pittsburgh Supercomputing Center October 26, 2017

Questions? Email openacc@nvidia.com Course Syllabus: October 19: Introduction to OpenACC October 26: GPU Programming with OpenACC November 2: Optimizing and Best Practices for OpenACC Questions? Email openacc@nvidia.com

Homework solution

Exercise 1: C Solution Generate a parallel Kernel while ( dt > MAX_TEMP_ERROR && iteration <= max_iterations ) { #pragma acc kernels for(i = 1; i <= ROWS; i++) { for(j = 1; j <= COLUMNS; j++) { Temperature[i][j] = 0.25 * (Temperature_last[i+1][j] + Temperature_last[i-1][j] + Temperature_last[i][j+1] + Temperature_last[i][j-1]); } dt = 0.0; // reset largest temperature change { for(i = 1; i <= ROWS; i++){ for(j = 1; j <= COLUMNS; j++){ dt = fmax( fabs(Temperature[i][j]-Temperature_last[i][j]), dt); Temperature_last[i][j] = Temperature[i][j]; if((iteration % 100) == 0) { track_progress(iteration); iteration++; Generate a parallel Kernel Generate a parallel Kernel

Exercise 1: Fortran Solution do while ( dt > max_temp_error .and. iteration <= max_iterations) !$acc kernels do j=1,columns do i=1,rows temperature(i,j)=0.25*(temperature_last(i+1,j)+temperature_last(i-1,j)+ & temperature_last(i,j+1)+temperature_last(i,j-1) ) enddo !$acc end kernels dt=0.0 dt = max( abs(temperature(i,j) - temperature_last(i,j)), dt ) temperature_last(i,j) = temperature(i,j) if( mod(iteration,100).eq.0 ) then call track_progress(temperature, iteration) endif iteration = iteration+1 Generate a parallel Kernel Generate a parallel Kernel

CPU Performance 3372 steps to convergence

Moving to the GPU Change the target accelerator flag from “-ta=multicore” to “-ta=tesla” % pgcc -acc -ta=tesla -Minfo=accel –DMAX_ITER=4000 laplace_kernels.c –o laplace.out main: 65, Generating implicit copyin(Temperature_last[:][:]) Generating implicit copyout(Temperature[1:1000][1:1000]) 66, Loop is parallelizable 67, Loop is parallelizable Accelerator kernel generated Generating Tesla code 66, #pragma acc loop gang, vector(4) /* blockIdx.y threadIdx.y */ 67, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */ 76, Generating implicit copyin(Temperature[1:1000][1:1000]) Generating implicit copy(Temperature_last[1:1000][1:1000]) 77, Loop is parallelizable 78, Loop is parallelizable 77, #pragma acc loop gang, vector(4) /* blockIdx.y threadIdx.y */ 78, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */ 79, Generating implicit reduction(max:dt)

Running on the GPU % ./laplace.out … ---------- Iteration number: 3100 ------------ [995,995]: 97.58 [996,996]: 98.18 [997,997]: 98.71 [998,998]: 99.17 [999,999]: 99.55 [1000,1000]: 99.86 ---------- Iteration number: 3200 ------------ [995,995]: 97.62 [996,996]: 98.21 [997,997]: 98.73 [998,998]: 99.18 [999,999]: 99.56 [1000,1000]: 99.86 ---------- Iteration number: 3300 ------------ [995,995]: 97.66 [996,996]: 98.24 [997,997]: 98.75 [998,998]: 99.19 [999,999]: 99.56 [1000,1000]: 99.87 Max error at iteration 3372 was 0.009995 Total time was 40.462415 seconds. Our time slows down by almost 7x from the original serial version of the code!

What went wrong? 13.2 seconds copying data from host to device % pgprof –cpu-profiling-mode top-down ./laplace.out …. ==455== Profiling result: Time(%) Time Calls Avg Min Max Name 57.41% 13.2519s 13488 982.49us 959ns 1.3721ms [CUDA memcpy HtoD] 36.08% 8.32932s 10116 823.38us 2.6230us 1.5129ms [CUDA memcpy DtoH] 3.42% 788.42ms 3372 233.81us 231.79us 235.79us main_80_gpu 2.80% 646.58ms 3372 191.75us 189.69us 194.04us main_69_gpu 0.29% 66.916ms 3372 19.844us 19.519us 20.223us main_81_gpu_red ======== CPU profiling result (top down): Time(%) Time Name … 9.15% 16.3609s | __pgi_uacc_dataexitdone 9.11% 16.2809s | __pgi_uacc_dataonb 8.3 seconds copying data from device to host Including the copy, there is an additional CPU overhead needed to create and delete the device data

Architecture Basic Concept Simplified, but sadly true CPU GPU $ $ Shared Cache $ Architecture Basic Concept Simplified, but sadly true Shared Cache High Capacity Memory PCIe Bus High Bandwidth Memory

Multiple Times Each Iteration Excessive Data Transfers CPU Memory GPU Memory A(i,j) A(i+1,j) A(i-1,j) A(i,j-1) A(i,j) A(i+1,j) A(i-1,j) A(i,j-1) PCIe Bus A(i,j) A(i+1,j) A(i-1,j) A(i,j-1) Multiple Times Each Iteration CPU GPU

Excessive Data Transfers while ( dt > MAX_TEMP_ERROR && iteration <= max_iterations ) { Temperature, Temperature_old resident on host Temperature, Temperature_old resident on device #pragma acc kernels for(i = 1; i <= ROWS; i++) { for(j = 1; j <= COLUMNS; j++) { Temperature[i][j] = 0.25 * (Temperature_old[i+1][j] + … } Temperature, Temperature_old resident on device Temperature, Temperature_old resident on host 4 copies happen every iteration of the outer while loop! dt = 0.0; Temperature, Temperature_old resident on host Temperature, Temperature_old resident on device #pragma acc kernels for(i = 1; i <= ROWS; i++) { for(j = 1; j <= COLUMNS; j++) { Temperature[i][j] = 0.25 * (Temperature_old[i+1][j] + … } Temperature, Temperature_old resident on host Temperature, Temperature_old resident on device }

Data Management The First, Most Important, and Possibly Only OpenACC Optimization

Data Construct Syntax and Scope #pragma acc data [clause …] { structured block } Fortran !$acc data [clause …] structured block !$acc end data

Data Clauses copy( list ) Allocates memory on GPU and copies data from host to GPU when entering region and copies data to the host when exiting region. Principal use: For many important data structures in your code, this is a logical default to input, modify and return the data. copyin( list ) Allocates memory on GPU and copies data from host to GPU when entering region. Principal use: Think of this like an array that you would use as just an input to a subroutine. copyout( list ) Allocates memory on GPU and copies data to the host when exiting region. Principal use: A result that isn’t overwriting the input data structure. create( list ) Allocates memory on GPU but does not copy. Principal use: Temporary arrays.

Array Shaping Compilers sometimes cannot determine the size of arrays, so we must specify explicitly using data clauses with an array “shape”. The compiler will let you know if you need to do this. Sometimes, you will want to for your own efficiency reasons. C #pragma acc data copyin(a[0:count]), copyout(b[s/4:3*s/4]) Fortran !$acc data copyin(a(1:end)), copyout(b(s/4:3*s/4)) Fortran uses start:end and C uses start:count Data clauses can be used on data, kernels or parallel

Compiler will (increasingly) often make a good guess… C Code Compiler Output int main(int argc, char *argv[]) { int i; double A[2000], B[1000], C[1000]; #pragma acc kernels for (i=0; i<1000; i++){ A[i] = 4 * i; B[i] = B[i] + 2; C[i] = A[i] + 2 * B[i]; } pgcc -acc -Minfo=accel loops.c main: 6, Generating implicit copy(B[:]) Generating implicit copyout(C[:]) Generating implicit copyout(A[:1000]) 7, Loop is parallelizable Accelerator kernel generated Generating Tesla code 7, #pragma acc loop gang, vector(128) Smarter Smart Smartest

Data Regions Have Real Consequences Simplest Kernel With Global Data Region int main(int argc, char** argv){ float A[1000]; #pragma acc kernels for( int iter = 1; iter < 1000 ; iter++){ A[iter] = 1.0; } A[10] = 2.0; printf("A[10] = %f", A[10]); int main(int argc, char** argv){ float A[1000]; #pragma acc kernels for( int iter = 1; iter < 1000 ; iter++){ A[iter] = 1.0; } A[10] = 2.0; printf("A[10] = %f", A[10]); A[ ] Copied To GPU A[ ] Copied To GPU #pragma acc data copy(A) { A[ ] Copied To Host Still Runs On Host Runs On Host } A[ ] Copied To Host Output: A[10] = 2.0 Output: A[10] = 1.0

Data Regions vs. Compute Regions int main(int argc, char** argv){ float A[1000]; #pragma acc kernels for( int iter = 1; iter < 1000 ; iter++){ A[iter] = 1.0; } A[10] = 2.0; printf("A[10] = %f", A[10]); #pragma acc data copy(A) { Compute Region Data Region } Output: A[10] = 1.0

Data Movement Decisions Much like loop data dependencies, sometime the compiler needs your human intelligence to make high-level decisions about data movement. Otherwise, it must remain conservative – sometimes at great cost. You must think about when data truly needs to migrate, and see if that is better than the default. Besides the scope-based data clauses, there are OpenACC options to let us manage data movement more intensely or asynchronously. We could manage the above behavior with the update construct: C Fortran #pragma acc update [self(), device(), …] !$acc update [self(), device(), …] Ex: #pragma acc update self(Temp_array) //Get host a copy from device

Analyzing / Profiling

OpenACC Development Cycle Analyze Analyze Analyze your code, and predict where potential parallelism can be uncovered. Use profiler to help understand what is happening in the code and were parallelism may exist Parallelize your code, starting with the most time consuming parts. Focus on maintaining correct results from your program Optimize your code, focusing on maximizing performance. Performance may not increase all-at- once during early parallelization Optimize Parallelize This is the OpenACC development cycle. Analyze -> Parallelize -> Optimize. Rinse and repeat. It is slightly over-simplified, however. It is very important to profile you code before beginning, however, it is equally important to continue profiling your code as you make changes. It is important to verify that changes you are making are impacting the code in a way that you expected. The Modules are designed to follow this format. This Module will include the Analyze portion. We are expecting students to profile for every module, however.

Profiling gpu code Using PGPROF to profile GPU code This is the view of PGPROF after profiling a GPU application The first obvious difference when compared to our command line profile: a visual timeline of events

Profiling gpu code Using PGPROF to profile GPU code MemCpy(HtoD): This includes data transfers from the Host to the Device (CPU to GPU) MemCpy(DtoH): These are data transfers from the Device to the Host (GPU to CPU) Compute: These are our computational functions.

Inspecting the PGPROF Timeline Zooming in gives us a better view of the timeline. It looks like our program is spending a significant amount of time transferring data between the host and device. We also see that the compute regions are very small, with a lot of distance between them. Device Memory Transfers Compute on the Device CPU overhead time including: - Device data creation and deletion - Virtual to Physical Host Memory Copy * Note we use a 1 kernel solution (available from the Resources console)

PROFILE after adding a data region After adding a data region, we see lots of compute and data movement only when we update. Device compute Update to host Pinned to Virtual Host memory copy

Exercise 2 Use acc data to minimize transfers Q: What speedup can you get with data + kernels directives? Start with your Exercise 1 solution (or ours). Add data directives where it helps. Think: when should I move data between host and GPU? Think how you would do it by hand, then determine which data clauses will implement that plan. Hint: you may find it helpful to ignore the output at first and just concentrate on getting the solution to converge quickly (at 3372 steps). Then worry about updating the printout (hint, hint).

Exercises: General Instructions The exercise is found at https://nvlabs.qwiklab.com Create an account or log in (use the same email you registered for the course with) Open class: OpenACC Course Oct 2017

Questions? Email openacc@nvidia.com TODAY WE DISCUSSED What OpenACC data directives are Why they are important for GPU use How to spot these, and many other types of performance hotspots, with a Profiler Questions? Email openacc@nvidia.com

OPENACC Resources FREE Compilers Resources Guides ● Talks ● Tutorials ● Videos ● Books ● Spec ● Code Samples ● Teaching Materials ● Events ● Success Stories ● Courses ● Slack ● Stack Overflow Resources https://www.openacc.org/resources Success Stories https://www.openacc.org/success-stories FREE Compilers Compilers and Tools https://www.openacc.org/tools Events https://www.openacc.org/events https://www.openacc.org/community#slack

PGI Professional Services OpenACC for Everyone DOWNLOAD https://www.pgroup.com/products/community.htm New PGI Community Edition Now Available FREE PROGRAMMING MODELS OpenACC, CUDA Fortran, OpenMP, C/C++/Fortran Compilers and Tools PLATFORMS X86, OpenPOWER, NVIDIA GPU UPDATES 1-2 times a year 6-9 times a year SUPPORT User Forums PGI Support PGI Professional Services LICENSE Annual Perpetual Volume/Site

New Book: OpenACC for Programmers Edited by: By Sunita Chandrasekaran, Guido Juckeland Discover how OpenACC makes scalable parallel programming easier and more practical Get productive with OpenACC code editors, compilers, debuggers, and performance analysis tools Build your first real-world OpenACC programs Overcome common performance, portability, and interoperability challenges Efficiently distribute tasks across multiple processors READ MORE http://www.informit.com/store/openacc-for-programmers-concepts-and-strategies-9780134694283 Use code OPENACC during checkout for 35% off *Discount taken off list price. Offer only good at informit.com

XSEDE Monthly Workshop Series NSF Funded for the national scientific community Wide Area Classroom (WAC) Format Next OpenACC event is November 7th Classroom based with 25 remote sites per event 50 events, 8000 students and counting Contact Tom Maiden, tmaiden@psc.edu

Questions? Email openacc@nvidia.com Course Syllabus: October 19: Introduction to OpenACC October 26: GPU Programming with OpenACC November 2: Optimizing and Best Practices for OpenACC Questions? Email openacc@nvidia.com