Profiling Tools on the NERSC Crays and IBM/SP NERSC User Services N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER.

Slides:

Advertisements

Similar presentations

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER TotalView on the T3E and IBM SP Systems NERSC User Services June 12, 2000.

Advertisements

SE-292 High Performance Computing Profiling and Performance R. Govindarajan

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Mixed Language Programming on Seaborg Mark Durst NERSC User Services.

Profiling your application with Intel VTune at NERSC

Intel® performance analyze tools Nikita Panov Idrisov Renat.

1/1/ / faculty of Electrical Engineering eindhoven university of technology Processor support devices Part 1:Interrupts and shared memory dr.ir. A.C. Verschueren.

Automated Instrumentation and Monitoring System (AIMS)

MCTS GUIDE TO MICROSOFT WINDOWS 7 Chapter 10 Performance Tuning.

SE-292 High Performance Computing Profiling and Performance R. Govindarajan

Debugging and Optimization Tools Richard Gerber NERSC User Services David Skinner NERSC Outreach, Software & Programming Group UCB CS267 February 15, 2011.

1 School of Computing Science Simon Fraser University CMPT 300: Operating Systems I Dr. Mohamed Hefeeda.

1 Lecture 6 Performance Measurement and Improvement.

Chapter 14 Chapter 14: Server Monitoring and Optimization.

Chapter 7 Interupts DMA Channels Context Switching.

Midterm Tuesday October 23 Covers Chapters 3 through 6 - Buses, Clocks, Timing, Edge Triggering, Level Triggering - Cache Memory Systems - Internal Memory.

Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.

Computer Organization and Assembly language

MPI Program Performance. Introduction Defining the performance of a parallel program is more complex than simply optimizing its execution time. This is.

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Comparison of Communication and I/O of the Cray T3E and IBM SP Jonathan Carter NERSC User.

Performance Measuring on Blue Horizon and Sun HPC Systems: Timing, Profiling, and Reading Assembly Language NPACI Parallel Computing Institute 2000 Sean.

Advanced Computing Technology Center © 2005 IBM Corporation The IBM High Performance Computing Toolkit Guojing Cong.

Chocolate Bar! luqili. Milestone 3 Speed 11% of final mark 7%: path quality and speed –Some cleverness required for full marks –Implement some A* techniques.

MCTS Guide to Microsoft Windows 7

 What is an operating system? What is an operating system?  Where does the OS fit in? Where does the OS fit in?  Services provided by an OS Services.

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

CCS APPS CODE COVERAGE. CCS APPS Code Coverage Definition: –The amount of code within a program that is exercised Uses: –Important for discovering code.

Lecture 8. Profiling - for Performance Analysis - Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture &

NERSC NUG Training 5/30/03 Understanding and Using Profiling Tools on Seaborg Richard Gerber NERSC User Services

Lecture 2b: Performance Metrics. Performance Metrics Measurable characteristics of a computer system: Count of an event Duration of a time interval Size.

Adventures in Mastering the Use of Performance Evaluation Tools Manuel Ríos Morales ICOM 5995 December 4, 2002.

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 Porting from the Cray T3E to the IBM SP Jonathan Carter NERSC User Services.

The Vampir Performance Analysis Tool Hans–Christian Hoppe Gesellschaft für Parallele Anwendungen und Systeme mbH Pallas GmbH Hermülheimer Straße 10 D

VAMPIR. Visualization and Analysis of MPI Resources Commercial tool from PALLAS GmbH VAMPIRtrace - MPI profiling library VAMPIR - trace visualization.

BG/Q Performance Tools Scott Parker Mira Community Conference: March 5, 2012 Argonne Leadership Computing Facility.

Deep Computing © 2008 IBM Corporation The IBM High Performance Computing Toolkit Advanced Computing Technology Center

Application performance and communication profiles of M3DC1_3D on NERSC babbage KNC with 16 MPI Ranks Thanh Phung, Intel TCAR Woo-Sun Yang, NERSC.

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER Evolution of the NERSC SP System NERSC User Services Original Plans Phase 1 Phase 2 Programming.

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 1 A Comparison of Performance Analysis Tools on the NERSC SP Jonathan Carter NERSC User Services.

© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Overview Part 2: History (continued)

Overview of CrayPat and Apprentice 2 Adam Leko UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative.

Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,

Using parallel tools on the SDSC IBM DataStar DataStar Overview HPM Perf IPM VAMPIR TotalView.

Performance Monitoring Tools on TCS Roberto Gomez and Raghu Reddy Pittsburgh Supercomputing Center David O’Neal National Center for Supercomputing Applications.

Portable Parallel Performance Tools Shirley Browne, UTK Clay Breshears, CEWES MSRC Jan 27-28, 1998.

NUG Meeting Performance Profiling Using hpmcount, poe+ & libhpm Richard Gerber NERSC User Services

Tool Visualizations, Metrics, and Profiled Entities Overview [Brief Version] Adam Leko HCS Research Laboratory University of Florida.

CE Operating Systems Lecture 2 Low level hardware support for operating systems.

UNIX Unit 1- Architecture of Unix - By Pratima.

NetLogger Using NetLogger for Distributed Systems Performance Analysis of the BaBar Data Analysis System Data Intensive Distributed Computing Group Lawrence.

Overview of AIMS Hans Sherburne UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative note Green:

CE Operating Systems Lecture 2 Low level hardware support for operating systems.

Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.

Copyright 2014 – Noah Mendelsohn Performance Analysis Tools Noah Mendelsohn Tufts University Web:

Performance Analysis on Blue Gene/P Tulin Kaman Department of Applied Mathematics and Statistics Stony Brook University.

3/12/2013Computer Engg, IIT(BHU)1 MPI-1. MESSAGE PASSING INTERFACE A message passing library specification Extended message-passing model Not a language.

NGS computation services: APIs and.

BG/L Application Tuning and Lessons Learned Bob Walkup IBM Watson Research Center.

Introduction to HPC Debugging with Allinea DDT Nick Forrington

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

POE Parallel Operating Environment. Cliff Montgomery.

Measuring Performance Based on slides by Henri Casanova.

Performance Analysis Tools

Chapter 13: I/O Systems Modified by Dr. Neerja Mhaskar for CS 3SH3.

Introduction to Operating System (OS)

Compiling and Job Submission

Prof. Leonardo Mostarda University of Camerino

Presentation transcript:

Profiling Tools on the NERSC Crays and IBM/SP NERSC User Services N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER

2 Outline Profiling Tools on NERSC platforms –Cray PVP (killeen, seymour) –Cray T3E (mcurie) –IBM/SP (gseaborg) UNIX profiling/performance analysis tools References

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 3 Why Profile? Characterise application : –Is code cpu bound? –Is code I/O bound? –Is code memory bound? –Analyse communication patterns - D.M. codes Focus optimisation effort... and ultimately.. Improve performance and resource utilisation

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 4 Cray PVP/T3E - Application Characterization Job accounting (ja) ja./a.out ja -st -n a.out - see next slide for sample output Look out for : Maximum Memory Used > available memory Total I/O wait time (locked+unlocked) > 50% User CPU time Multitasking breakdown for parallel codes

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 5 Job accounting : summary report Elapsed Time : 8 Seconds User CPU Time : Seconds Multitasking/ Multistreaming Breakdown (Concurrent CPUs * Connect seconds = CPU seconds) 1 * = * = * = * = (Avg.) (total) (total) 3.99 * = System CPU Time : Seconds I/O Wait Time (Locked) : I/O Wait Time (Unlocked) : CPU Time Memory Integral : Mword-seconds Data Transferred : MWords Maximum memory used : MWords

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 6 HPM - Hardware Performance Monitor Helps locate CPU related code bottlenecks reports use of vector registers, instruction buffers, memory ports hpm {options}./a.out {prog_arguments} options = -g2 -> memory access information options = -g3 -> vector register information Look for : Ratio of Floating Ops/CPU second to CPU mem. references per sec should reflect the FpOps in the code

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 7 Sample hpm output : (hpm -g0./a.out) Million inst/sec (MIPS) : 7.67 Instructions : Avg. clock periods/inst : % CP holding issue : CP holding issue : Inst.buffer fetches/sec : 0.04M Inst.buf. fetches: Floating adds/sec : 15.40M F.P. adds : Floating multiplies/sec : 24.36M F.P. multiplies : Floating reciprocal/sec : 0.28M F.P. reciprocals : Cache hits/sec : 0.00M Cache hits : CPU mem. references/sec : 34.64M CPU references : Floating ops/CPU second: 40.5M

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 8 Cray PVP : CPU Bound Codes: prof/profview Instruments code to provide % cpu time in function calls f90 -lprof prog.f90./a.out -> generates prof.data prof -st./a.out > prof.report Chart (over) indicates relative distribution of CPU execution time by function call –prof -x a.out > pgm.prof –profview pgm.prof

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 9 Profview - Sample Output

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 10 I/O and Memory Bound Codes : procstat/procview procstat -m -i -R a.raw a.out procview a.raw –I/O Analysis : Reports, Files -> All User Files (Long Report) Bytes Processed or I/O Wait Time –Memory Analysis : Reports -> Processes -> Maximum Memory Used (Long Format)

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 11 I/O Bound Codes : procview procview indicates which files consume most real time for I/O processing

Memory Bound Codes : procview –“High” (> 10% Elapsed Time) Time to complete Memory requests may indicate memory bound code –Use Graphs option to produce plot of Memory use over elapsed time of application N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 12

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 13 ATExpert - Autotasking Prediction Analysis of source code to predict autotasking performance on dedicated Cray PVP f90 -eX -O3 -r4 -o {prog_name} prog.f90 –./a.out –atexpert -> shows predicted speed-up

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 14 ATExpert Sample output Indicates predicted speed-up of 4.3 on dedicated 8 processor PVP when source code is autotasked

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 15 Also available on Cray PVP Flowtrace/flowview times (using Operating System timers) subroutines and functions during program execution jumptrace/jumpview provides exact timing in function/subroutine by analysis of machine instructions in program perftrace/perfview times subroutines/functions based on statistics gathered from HPM tool

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 16 Cray T3E - Apprentice Locate performance problems /inefficiencies MPI and shared memory performance, load balance and communication, memory use Provides hardware performance information and tuning recommendations (Displays -> Observations) Compile/link f90 -o {prog} -eA {prog_name.f90} -lapp cc -o {prog} -happrentice {prog_name.c} -lapp Run code to generate app.rif

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 17 Output from : apprentice app.rif

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 18 Cray T3E - PAT Generates profile of CPU time in functions; load balance across PEs; h/w counter info. Compile and Link with PAT library f90 -o exe -lpat {source.f} pat.cld Run program as normal mpprun -n {procs} {exe} -> generate exe.pif pat executable exe.pif

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 19 Profile based on relative CPU time in function calls Load Balance Histogram for routine “COLL”

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 20 Cray T3E - ACTS/TAU Performance analysis of distributed/shared memory applications (C++ in particular) module load tau instrument programs with TAU macros add $(TAU_DEFS), $(TAULIBS) to compile/link run application; view tracefile with pprof, VAMPIR Reference

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 21 Cray T3E - Vampir Analysis of message passing characteristics - generates display of MPI activity over instrumented time period (e.g. sender, receiver, message size, elapsed time) module load VAMPIR; module load vampirtrace Facility to instrument with VAMPIRtrace calls Generate trace file using TAU or VAMPIRtrace Reference :

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 22 IBM/SP - Xprofiler Graphical interface for gprof profiles of parallel applications –Compile and link code with “-g -pg” –poe./a.out -procs {n} generates gmon.out.{n} file for each process may introduce significant (upto factor of 2) overhead –(In $TMPDIR) xprofiler./a.out gmon.out.* Report menu provides (gprof) text profile Source statement profiling shown

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 23

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 24 Statement level profile available by clicking on relevant function graphical output - use Show Source Code option

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 25 IBM/SP - Visualization Tool (VT) Message passing trace visualization Realtime system activity monitor (limited) MPI load balance overview : poe./a.out -procs {n} -tlevel=3 copy a.out.trc to $TMPDIR (In $TMPDIR) Invoke vt In trace visualization mode, “Play” a.out.trc see next slide for sample of Interprocessor Communication during program execution

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 26

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 27 IBM/SP : system_stats IBM Internal Tool module load sptools instrument code with system_stats() call Link with $(SPTOOLS), run code as normal Sample output Summary of the utilization of system resources: node hostname wall(s) user(s) sys(s) size(KB) pswitches 0 gs gs gs gs

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 28 IBM/SP - trace-mpi IBM Internal tool - Quantitative information on MPI calls –module load USG ; module load trace-mpi –Fortran - add $(TRACE_MPIF) to build –C - add $(TRACE_MPI) to build –poe./a.out -procs {n} - generates mpi.trace_file for each process (executable must call MPI_Finalize) –summary mpi.trace_file.{n} (see over) Useful check for load balance : –grep “Total Communication” mpi.trace.file.*

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 29 MPI message-passing summary for mpi.trace_file.3 MPI Function #calls Avg Bytes Time (sec) MPI_Allreduce: MPI_Barrier: MPI_Bcast: MPI_Scatter: MPI_Comm_rank: MPI_Comm_size: MPI_Isend: MPI_Recv: MPI_Wait: Total Communication Information: WALL = , CPU = 15.53, MBYTES = The total amount of wall time =

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 30 Upcoming on the SP ACTS/TAU (C/C++) currently being ported to the IBM/SP VAMPIR has been ordered, awaiting delivery Performance Monitor Toolkit (HPM) should be available with Phase II system (requires AIX 4.3.4) Also, see Performance API project: –

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 31 General/UNIX Profiling Tools Command line profilers and system analysis prof/gprof (enabled for MPI on IBM/SP) csh time command : time./a.out vmstat -> look for high paging over extended time period (application may require more memory) Fortran/C function timers getrusage rtc, irtc etime, dtime, mclock MPI_Wtime

N ATIONAL E NERGY R ESEARCH S CIENTIFIC C OMPUTING C ENTER 32 Reference Material NERSC web pages Cray PVP/Cray T3E –Optimizing Code on Cray PVP Systems –Cray T3E C, Fortran Optimization Guides IBM/SP LLNL Workshop on Performance Tools