Debugging and Optimization Tools Richard Gerber NERSC User Services David Skinner NERSC Outreach, Software & Programming Group UCB CS267 February 15, 2011.

Slides:

Advertisements

Similar presentations

Profiler In software engineering, profiling ("program profiling", "software profiling") is a form of dynamic program analysis that measures, for example,

Advertisements

PAPI for Blue Gene/Q: The 5 BGPM Components Heike Jagode and Shirley Moore Innovative Computing Laboratory University of Tennessee-Knoxville

Profiling your application with Intel VTune at NERSC

Cray Optimization and Performance Tools Harvey Wasserman Woo-Sun Yang NERSC User Services Group Cray XE6 Workshop February 7-8, 2011 NERSC Oakland Scientific.

Richard Gerber NERSC User Services Group Lead Debugging and Optimization Tools Thanks to Woo-Sun Yang and Helen He.

The Path to Multi-core Tools Paul Petersen. Multi-coreToolsThePathTo 2 Outline Motivation Where are we now What is easy to do next What is missing.

Performance Debugging Techniques For HPC Applications David Skinner CS267 Feb

Ritu Varma Roshanak Roshandel Manu Prasanna

Page 1 CS Department Parallel Design of JPEG2000 Image Compression Xiuzhen Huang CS Department UC Santa Barbara April 30th, 2003.

Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.

Figure 1.1 Interaction between applications and the operating system.

Processes April 5, 2000 Instructor: Gary Kimura Slides courtesy of Hank Levy.

Performance Engineering and Debugging HPC Applications David Skinner

COP4020 Programming Languages

ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.

Spring 2014 SILICON VALLEY UNIVERSITY CONFIDENTIAL 1 Introduction to Embedded Systems Dr. Jerry Shiao, Silicon Valley University.

Performance Measurement on kraken using fpmpi and craypat Kwai Wong NICS at UTK / ORNL March 24, 2010.

Statistical Performance Analysis for Scientific Applications Presentation at the XSEDE14 Conference Atlanta, GA Fei Xing Haihang You Charng-Da Lu July.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Tools and Utilities for parallel and serial codes in ENEA-GRID environment CRESCO Project: Salvatore Raia SubProject I.2 C.R. ENEA-Portici. 11/12/2007.

Lecture 8. Profiling - for Performance Analysis - Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture &

CHAPTER FOUR COMPUTER SOFTWARE.

NERSC NUG Training 5/30/03 Understanding and Using Profiling Tools on Seaborg Richard Gerber NERSC User Services

Support for Debugging Automatically Parallelized Programs Robert Hood Gabriele Jost CSC/MRJ Technology Solutions NASA.

Debugging and Profiling GMAO Models with Allinea’s DDT/MAP Georgios Britzolakis April 30, 2015.

BG/Q Performance Tools Scott Parker Mira Community Conference: March 5, 2012 Argonne Leadership Computing Facility.

CS 240A Models of parallel programming: Distributed memory and MPI.

Tools for Performance Debugging HPC Applications David Skinner

DDT Debugging Techniques Carlos Rosales Scaling to Petascale 2010 July 7, 2010.

Application performance and communication profiles of M3DC1_3D on NERSC babbage KNC with 16 MPI Ranks Thanh Phung, Intel TCAR Woo-Sun Yang, NERSC.

Overview of CrayPat and Apprentice 2 Adam Leko UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative.

Profiling Tools on the NERSC Crays and IBM/SP NERSC User Services N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER.

Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,

Code optimization What do we want to optimize? Code A exhibits 90% of theoretical peak performance (FLOPS) Code B exhibits 90% of strong scaling Code C.

Using parallel tools on the SDSC IBM DataStar DataStar Overview HPM Perf IPM VAMPIR TotalView.

1 Serial Run-time Error Detection and the Fortran Standard Glenn Luecke Professor of Mathematics, and Director, High Performance Computing Group Iowa State.

Performance Monitoring Tools on TCS Roberto Gomez and Raghu Reddy Pittsburgh Supercomputing Center David O’Neal National Center for Supercomputing Applications.

CSE 303 Concepts and Tools for Software Development Richard C. Davis UW CSE – 12/6/2006 Lecture 24 – Profilers.

Profiling, Tracing, Debugging and Monitoring Frameworks Sathish Vadhiyar Courtesy: Dr. Shirley Moore (University of Tennessee)

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

Debugging and Profiling With some help from Software Carpentry resources.

Zhengji Zhao, Nicholas Wright, and Katie Antypas NERSC Effects of Hyper- Threading on the NERSC workload on Edison NUG monthly meeting, June 6, 2013.

Software Overview Environment, libraries, debuggers, programming tools and applications Jonathan Carter NUG Training 3 Oct 2005.

Luiz DeRose © Cray Inc. CSCS July 13-15, 2009 Using Hardware Performance Counters on the Cray XT Luiz DeRose Programming Environment Director.

A New Parallel Debugger for Franklin: DDT Katie Antypas User Services Group NERSC User Group Meeting September 17, 2007.

Debugging parallel programs. Breakpoint debugging Probably the most widely familiar method of debugging programs is breakpoint debugging. In this method,

Operating System Principles And Multitasking

NUG Meeting Performance Profiling Using hpmcount, poe+ & libhpm Richard Gerber NERSC User Services

Tool Visualizations, Metrics, and Profiled Entities Overview [Brief Version] Adam Leko HCS Research Laboratory University of Florida.

1 SciDAC High-End Computer System Performance: Science and Engineering Jack Dongarra Innovative Computing Laboratory University of Tennesseehttp://

Connections to Other Packages The Cactus Team Albert Einstein Institute

Third-party software plan Zhengji Zhao NERSC User Services NERSC User Group Meeting September 19, 2007.

Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.

Single Node Optimization Computational Astrophysics.

PAPI on Blue Gene L Using network performance counters to layout tasks for improved performance.

Richard Gerber NERSC User Services Group Lead Debugging and Optimization Tools Thanks to Woo-Sun Yang and Helen He.

EPICS and LabVIEW Tony Vento, National Instruments

Other Tools HPC Code Development Tools July 29, 2010 Sue Kelly Sandia is a multiprogram laboratory operated by Sandia Corporation, a.

A Dynamic Tracing Mechanism For Performance Analysis of OpenMP Applications - Caubet, Gimenez, Labarta, DeRose, Vetter (WOMPAT 2001) - Presented by Anita.

Introduction to HPC Debugging with Allinea DDT Nick Forrington

An Introduction to Parallel Programming with MPI February 17, 19, 24, David Adams

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.

DDC 2223 SYSTEM SOFTWARE DDC2223 SYSTEM SOFTWARE.

Microprocessor and Assembly Language

Performance Analysis, Tools and Optimization

What we need to be able to count to tune programs

Processes Hank Levy 1.

Processes Hank Levy 1.

Dynamic Binary Translators and Instrumenters

Presentation transcript:

Debugging and Optimization Tools Richard Gerber NERSC User Services David Skinner NERSC Outreach, Software & Programming Group UCB CS267 February 15, 2011

Introduction Debugging Performance / Optimization Outline See slides and videos from NERSC Hopper Training tutorials/courses/CS267/ (newweb -> www sometime soon)

3 Scope of Today’s Talks –Debugging and optimization tools (R. Gerber) –Some basic strategies for parallel performance (D. Skinner) Take Aways –Common problems to look out for –How tools work in general –A few specific tools you can try –Where to get more information Introduction

Debugging 4

5 Typical problems –“Serial” Invalid memory references Array reference out of bounds Divide by zero Uninitialized variables –Parallel Unmatched sends/receives Blocking receive before corresponding send Out of order collectives Race conditions Debugging

6 printf, write –Versatile, sometimes useful –Doesn’t scale well –Not interactive Compiler / runtime –Turn on bounds checking, exception handling –Check dereferencing of NULL pointers Serial gdb –GNU debugger, serial, command-line interface –See “man gdb” Parallel GUI debuggers (X-Windows) –DDT –Totalview Tools

7 Or DDT video

Out of bounds reference in source code for program “flip” … allocate(put_seed(random_size)) … bad_index = random_size+1 put_seed(bad_index) = 67 ftn -c -g -Ktrap=fp –Mbounds flip.f90 ftn -c -g -Ktrap=fp -Mbounds printit.f90 ftn -o flip flip.o printit.o -g % qsub –I –qdebug –lmppwidth=48 % cd $PBS_O_WORKDIR % % aprun –n 48./flip 0: Subscript out of range for array put_seed (flip.f90: 50) subscript=35, lower bound=1, upper bound=34, dimension=1 0: Subscript out of range for array put_seed (flip.f90: 50) subscript=35, lower bound=1, upper bound=34, dimension=1 8 Compiler runtime bounds checking

Performance / Optimization 9

Performance Questions How can we tell if a program is performing well? Or isn’t? If performance is not “good,” how can we pinpoint why? How can we identify the causes? What can we do about it? 10

Performance Metrics Primary metric: application time –but gives little indication of efficiency Derived measures: –rate (Ex.: messages per unit time, Flops per Second, clocks per instruction), cache utilization Indirect measures: –speedup, parallel efficiency, scalability 11

12 Serial –Leverage ILP on the processor –Feed the pipelines –Exploit data locality –Reuse data in cache Parallel –Minimizing latency effects –Maximizing work vs. communication Optimization Strategies

13 Sampling –Regularly interrupt the program and record where it is –Build up a statistical profile Tracing / Instrumenting –Insert hooks into program to record and time events Use Hardware Event Counters –Special registers count events on processor –E.g. floating point instructions –Many possible events –Only a few (~4 counters) Identifying Targets for Optimization

Typical Process (Sometimes) Modify your code with macros, API calls, timers Compile your code Transform your binary for profiling/tracing with a tool Run the transformed binary –A data file is produced Interpret the results with a tool 14

Performance NERSC Vendor Tools: –CrayPat Community Tools : –TAU (U. Oregon via ACTS) –PAPI (Performance Application Programming Interface) –gprof IPM: Integrated Performance Monitoring 15

Introduction to CrayPat Suite of tools to provide a wide range of performance-related information Can be used for both sampling and tracing user codes –with or without hardware or network performance counters –Built on PAPI Supports Fortran, C, C++, UPC, MPI, Coarray Fortran, OpenMP, Pthreads, SHMEM Man pages –intro_craypat(1), intro_app2(1), intro_papi(1) 16

Using CrayPat 1.Access the tools –module load perftools 2.Build your application; keep.o files –make clean –make 3.Instrument application –pat_build... a.out –Result is a new file, a.out+pat 4.Run instrumented application to get top time consuming routines –aprun... a.out+pat –Result is a new file XXXXX.xf (or a directory containing.xf files) 5.Run pat_report on that new file; view results –pat_report XXXXX.xf > my_profile –vi my_profile –Result is also a new file: XXXXX.ap2 17

Guidelines to Identify the Need for Optimization 18 * Suggested by Cray Derived metricOptimization needed when* PAT_RT_HWP C Computational intensity< 0.5 ops/ref0, 1 L1 cache hit ratio< 90%0, 1, 2 L1 cache utilization (misses)< 1 avg hit0, 1, 2 L1+L2 cache hit ratio< 92%2 L1+L2 cache utilization (misses) < 1 avg hit2 TLB utilization< 0.9 avg use1 (FP Multiply / FP Ops) or (FP Add / FP Ops) < 25%5 Vectorization< 1.5 for dp; 3 for sp12 (13, 14)

Using Apprentice Optional visualization tool for Cray’s perftools data Use it in a X Windows environment Uses a data file as input (XXX.ap2) that is prepared by pat_report 1.module load perftools 2.ftn -c mpptest.f 3.ftn -o mpptest mpptest.o 4.pat_build -u -g mpi mpptest 5.aprun -n 16 mpptest+pat 6.pat_report mpptest+pat+PID.xf > my_report 7.app2 [--limit_per_pe tags] [XXX.ap2] 19

20 Apprentice Basic View Can select new (additional) data file and do a screen dump Can select other views of the data Worthless Useful Can drag the “calipers” to focus the view on portions of the run

PAPI PAPI (Performance API) provides a standard interface for use of the performance counters in major microprocessors Predefined actual and derived counters supported on the system –To see the list, run ‘papi_avail’ on compute node via aprun: module load perftools aprun –n 1 papi_avail AMD native events also provided; use ‘papi_native_avail’: aprun –n 1 papi_native_avail 21

TAU Tuning and Analysis Utilities Fortran, C, C++, Java performance tool Procedure –Insert macros –Run the program –View results with pprof More info that gprof –E.g. per process, per thread info; supports pthreads 22

IPM Integrated Performance Monitoring MPI profiling, hardware counter metrics, IO profiling (?) IPM requires no code modification & no instrumented binary –Only a “module load ipm” before running your program on systems that support dynamic libraries –Else link with the IPM library IPM uses hooks already in the MPI library to intercept your MPI calls and wrap them with timers and counters 23

IPM 24 # host : s05601/ C00_AIX mpi_tasks : 32 on 2 nodes # start : 11/30/04/14:35:34 wallclock : sec # stop : 11/30/04/14:36:00 %comm : # gbytes : e-01 total gflop/sec : e+00 total # [total] min max # wallclock # user # system # mpi # %comm # gflop/sec # gbytes # PM_FPU0_CMPL e e e e+08 # PM_FPU1_CMPL e e e e+08 # PM_FPU_FMA e e e e+08 # PM_INST_CMPL e e e e+09 # PM_LD_CMPL e e e e+09 # PM_ST_CMPL e e e e+09 # PM_TLB_MISS e e e e+07 # PM_CYC e e e e+09 # [time] [calls] # MPI_Send # MPI_Wait # MPI_Irecv # MPI_Barrier # MPI_Reduce # MPI_Comm_rank # MPI_Comm_size