Profiling and Tuning OpenACC Code. Profiling Tools (PGI) Use time option to learn where time is being spent -ta=nvidia,time NVIDIA Visual Profiler 3 rd.

Slides:



Advertisements
Similar presentations
Copyright © 2008 SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks.
Advertisements

GPU Computing with OpenACC Directives. subroutine saxpy(n, a, x, y) real :: x(:), y(:), a integer :: n, i $!acc kernels do i=1,n y(i) = a*x(i)+y(i) enddo.
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
+ Accelerating Fully Homomorphic Encryption on GPUs Wei Wang, Yin Hu, Lianmu Chen, Xinming Huang, Berk Sunar ECE Dept., Worcester Polytechnic Institute.
Computing with Accelerators: Overview ITS Research Computing Mark Reed.
Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software.
Dan Iannuzzi Kevin Pine CS 680. Outline The Problem Recap of CS676 project Goal of this GPU Research Approach Parallelization attempts Results Difficulties.
GPU Computing with OpenACC Directives. 1,000,000’s Early Adopters Time Research Universities Supercomputing Centers Oil & Gas CAE CFD Finance Rendering.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
Presented by Rengan Xu LCPC /16/2014
Optimizing and Auto-Tuning Belief Propagation on the GPU Scott Grauer-Gray and Dr. John Cavazos Computer and Information Sciences, University of Delaware.
Week 7 - Programming II Today – more features: – Loop control – Extending if/else – Nesting of loops Debugging tools Textbook chapter 7, pages
A Source-to-Source OpenACC compiler for CUDA Akihiro Tabuchi †1 Masahiro Nakao †2 Mitsuhisa Sato †1 †1. Graduate School of Systems and Information Engineering,
Panda: MapReduce Framework on GPU’s and CPU’s
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
Jared Barnes Chris Jackson.  Originally created to calculate pixel values  Each core executes the same set of instructions Mario projected onto several.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
An Introduction to Programming with CUDA Paul Richmond
SAGE: Self-Tuning Approximation for Graphics Engines
Programming GPUs using Directives Alan Gray EPCC The University of Edinburgh.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
2012/06/22 Contents  GPU (Graphic Processing Unit)  CUDA Programming  Target: Clustering with Kmeans  How to use.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering1 Score-P Hands-On CUDA: Jacobi example.
GPU Performance Prediction GreenLight Education & Outreach Summer Workshop UCSD. La Jolla, California. July 1 – 2, Javier Delgado Gabriel Gazolla.
Introduction to Parallel Programming with C and MPI at MCSR Part 2 Broadcast/Reduce.
While Loops Indefinite Iteration. Last lesson we looked at definite loops using the ‘For’ statement. The while loop keeps going while some condition is.
Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015.
GPU Architecture and Programming
JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison.
Performance Optimization Getting your programs to run faster.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
QCAdesigner – CUDA HPPS project
GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Slide 1 Using OpenACC in IFS Physics’ Cloud Scheme (CLOUDSC) Sami Saarinen ECMWF Basic GPU Training Sept 16-17, 2015.
OpenACC for Fortran PGI Compilers for Heterogeneous Supercomputing.
Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron.
Dynamic Scheduling Monte-Carlo Framework for Multi-Accelerator Heterogeneous Clusters Authors: Anson H.T. Tse, David B. Thomas, K.H. Tsoi, Wayne Luk Source:
Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.
1 Project 7: Looping. Project 7 For this project you will produce two Java programs. The requirements for each program will be described separately on.
CS Class 04 Topics  Selection statement – IF  Expressions  More practice writing simple C++ programs Announcements  Read pages for next.
 2000 Prentice Hall, Inc. All rights reserved Program Components in C++ Function definitions –Only written once –These statements are hidden from.
GPU Computing with OpenACC Directives Presented by John Urbanic Pittsburgh Supercomputing Center Very major contribution by Mark Harris NVIDIA Corporation.
Lecture 14 Introduction to OpenACC Kyu Ho Park May 12, 2016 Ref: 1.David Kirk and Wen-mei Hwu, Programming Massively Parallel Processors, MK and NVIDIA.
S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.
Sobolev(+Node 6, 7) Showcase +K20m GPU Accelerator.
OpenMP Lab Antonio Gómez-Iglesias Texas Advanced Computing Center.
GPU Computing with OpenACC Directives
An Update on Accelerating CICE with OpenACC
Generalized and Hybrid Fast-ICA Implementation using GPU
HPC usage and software packages
Employing compression solutions under openacc
Amit Amritkar & Danesh Tafti Eric de Sturler & Kasia Swirydowicz
Scientific requirements and dimensioning for the MICADO-SCAO RTC
Image Transformation 4/30/2009
Lecture 2: GPU Programming with OpenACC
NVIDIA Profiler’s Guide
A Pattern Specification and Optimizations Framework for Accelerating Scientific Computations on Heterogeneous Clusters Linchuan Chen Xin Huo and Gagan.
Linchuan Chen, Xin Huo and Gagan Agrawal
Experience with Maintaining the GPU Enabled Version of COSMO
General Programming on Graphical Processing Units
Advanced Computing Facility Introduction
CS/EE 217 – GPU Architecture and Parallel Programming
General Programming on Graphical Processing Units
Cristiano Padrin (CASPUR)
Using OpenMP offloading in Charm++
Using compiler-directed approach to create MPI code automatically
Convolution Layer Optimization
Presentation transcript:

Profiling and Tuning OpenACC Code

Profiling Tools (PGI) Use time option to learn where time is being spent -ta=nvidia,time NVIDIA Visual Profiler 3 rd -party profiling tools that are CUDA-aware (But those are outside the scope of this talk)

PGI Accelerator profiling Compiler automatically instruments the code, outputs profile data -ta=nvidia,time Accelerator Kernel Timing data /usr/users/7/jwoolley/openacc-workshop/solutions/003-laplace2D-loop/laplace2d.c main 66: region entered 1000 times time(us): total= init=110 region= kernels= data=0 w/o init: total= max=13486 min=5269 avg= : kernel launched 1000 times grid: [16x512] block: [32x8] time(us): total= max=5426 min=5200 avg=5320 /usr/users/7/jwoolley/openacc-workshop/solutions/003-laplace2D-loop/laplace2d.c main 53: region entered 1000 times time(us): total= init=171 region= kernels= data=0...

PGI Accelerator profiling Compiler automatically instruments the code, outputs profile data Provides insight into API-level efficiency How many bytes of data were copied in and out? How many times was each kernel launched, and how long did they take? What kernel grid and block dimensions were used? …but provides relatively little insight (at present) into how efficient the kernels themselves were

Profiling Tools Need a profiling tool that is more aware of the inner workings of the GPU to provide deeper insights E.g.: NVIDIA Visual Profiler

NVIDIA Visual Profiler

Note: Today we are using the CUDA 4.0 Visual Profiler CUDA 4.1 and later include a revamped profiler called nvvp Try it on your own codes after the workshop

Exercise 4: Jacobi Profiling Task: use NVIDIA Visual Profiler data to identify additional optimization opportunities in Jacobi example Start from given laplace2d.c or laplace2d.f90 (your choice) In the 004-laplace2d-profiling directory Use computeprof to examine the provided laplace2d.cvp project Identify areas for possible improvement Modify code where it helps (hint: look at bandwidth utilization) Q: What speedup can you get by improving the kernels? Does it help the CPU code as well? By how much?

Exercise 4: Jacobi Profiling

NVIDIA Visual Profiler: PSC Workshop Tips for use of computeprof in PSC’s shared environment: If you need to profile your own code, submit a PBS job that lets you run computeprof via remote-X on the compute node Your profiling session on the compute node will be limited to 5 minutes Set the timeout for each profile pass in the profiler to 5 seconds (default is 30 seconds) SAVE YOUR SESSION as soon as the profile has been gathered and exit the profiler to release the compute node Use an instance of computeprof running on the login node to study the saved session offline while someone else uses the compute node For this exercise, please try to use ONLY the pre-saved profile if possible

Exercise 4 Solution: OpenACC C #pragma acc data copy(A), copyin(Anew) while ( error > tol && iter < iter_max ) { error=0.0; #pragma acc kernels loop for( int j = 1; j < n-1; j++) { #pragma acc kernels gang(16) vector(32) for(int i = 1; i < m-1; i++) { Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + A[j-1][i] + A[j+1][i]); } #pragma acc kernels loop for( int j = 1; j < n-1; j++) { #pragma acc kernels gang(16) vector(32) for( int i = 1; i < m-1; i++ ) { A[j][i] = 0.25 * (Anew[j][i+1] + Anew[j][i-1] + Anew[j-1][i] + Anew[j+1][i]); error = max(error, fabs(A[j][i] - Anew[j][i]); } iter+=2; } Need to switch back to copying Anew in to accelerator so that halo cells will be correct Replace memcpy kernel with a second instance of the stencil kernel Can calculate the max reduction on ‘error’ once per pair, so removed it from this loop Only need half as many times through the loop now

Exercise 4: Performance vs. original CPU: Intel Xeon X GHz GPU: NVIDIA Tesla M2070

Thank you