Lecture 5: Parallel Tools Landscape – Part 2 Allen D. Malony Department of Computer and Information Science.

Slides:



Advertisements
Similar presentations
1 Automating Auto Tuning Jeffrey K. Hollingsworth University of Maryland
Advertisements

Houssam Haitof Technische Univeristät München
Μπ A Scalable & Transparent System for Simulating MPI Programs Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory Adjunct Professor.
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Intel® performance analyze tools Nikita Panov Idrisov Renat.
Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.
MCTS GUIDE TO MICROSOFT WINDOWS 7 Chapter 10 Performance Tuning.
GridRPC Sources / Credits: IRISA/IFSIC IRISA/INRIA Thierry Priol et. al papers.
© 2012 IBM Corporation Barcelona Supercomputing Center MICRO 2012 Tuesday, December 4, 2012 Systematic Energy Characterization of CMP/SMT Processor Systems.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 11: Monitoring Server Performance.
3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.
Nick Trebon, Alan Morris, Jaideep Ray, Sameer Shende, Allen Malony {ntrebon, amorris, Department of.
3.5 Interprocess Communication
Chapter 1 and 2 Computer System and Operating System Overview
Instrumentation and Measurement CSci 599 Class Presentation Shreyans Mehta.
Lecture 5: Parallel Tools Landscape – Part 2 Allen D. Malony Department of Computer and Information Science.
Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.
MCTS Guide to Microsoft Windows 7
Judit Giménez, Juan González, Pedro González, Jesús Labarta, Germán Llort, Eloy Martínez, Xavier Pegenaute, Harald Servat Brief introduction.
Blaise Barney, LLNL ASC Tri-Lab Code Development Tools Workshop Thursday, July 29, 2010 Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,
Beyond Automatic Performance Analysis Prof. Dr. Michael Gerndt Technische Univeristät München
MpiP Evaluation Report Hans Sherburne, Adam Leko UPC Group HCS Research Laboratory University of Florida.
1 Guide to Novell NetWare 6.0 Network Administration Chapter 13.
The Pipeline Processing Framework LSST Applications Meeting IPAC Feb. 19, 2008 Raymond Plante National Center for Supercomputing Applications.
TRACEREP: GATEWAY FOR SHARING AND COLLECTING TRACES IN HPC SYSTEMS Iván Pérez Enrique Vallejo José Luis Bosque University of Cantabria TraceRep IWSG'15.
Adventures in Mastering the Use of Performance Evaluation Tools Manuel Ríos Morales ICOM 5995 December 4, 2002.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 11: Monitoring Server Performance.
Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.
SC 2012 © LLNL / JSC 1 HPCToolkit / Rice University Performance Analysis through callpath sampling  Designed for low overhead  Hot path analysis  Recovery.
Application performance and communication profiles of M3DC1_3D on NERSC babbage KNC with 16 MPI Ranks Thanh Phung, Intel TCAR Woo-Sun Yang, NERSC.
Chapter 3 System Performance and Models. 2 Systems and Models The concept of modeling in the study of the dynamic behavior of simple system is be able.
CHAPTER TEN AUTHORING.
Guide to Linux Installation and Administration, 2e1 Chapter 10 Managing System Resources.
Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,
Replay Compilation: Improving Debuggability of a Just-in Time Complier Presenter: Jun Tao.
Blaise Barney, LLNL ASC Tri-Lab Code Development Tools Workshop Thursday, July 29, 2010 Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 11: Monitoring Server Performance.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Belgrade, 25 September 2014 George S. Markomanolis, Oriol Jorba, Kim Serradell Performance analysis Tools: a case study of NMMB on Marenostrum.
Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.
University of Maryland Towards Automated Tuning of Parallel Programs Jeffrey K. Hollingsworth Department of Computer Science University.
Tool Visualizations, Metrics, and Profiled Entities Overview [Brief Version] Adam Leko HCS Research Laboratory University of Florida.
Connections to Other Packages The Cactus Team Albert Einstein Institute
So, You Need to Look at a New Application … Scenarios:  New application development  Analyze/Optimize external application  Suspected bottlenecks First.
Overview of AIMS Hans Sherburne UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative note Green:
Architecture View Models A model is a complete, simplified description of a system from a particular perspective or viewpoint. There is no single view.
Full and Para Virtualization
OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.
® IBM Software Group © 2007 IBM Corporation Module 1: Getting Started with Rational Software Architect Essentials of Modeling with IBM Rational Software.
Chapter 1 Basic Concepts of Operating Systems Introduction Software A program is a sequence of instructions that enables the computer to carry.
Projections - A Step by Step Tutorial By Chee Wai Lee For the 2004 Charm++ Workshop.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Overview: Using Hardware.
Introduction to HPC Debugging with Allinea DDT Nick Forrington
Tuning Threaded Code with Intel® Parallel Amplifier.
Embedded Real-Time Systems Processing interrupts Lecturer Department University.
LECTURE 19 Subroutines and Parameter Passing. ABSTRACTION Recall: Abstraction is the process by which we can hide larger or more complex code fragments.
Some of the utilities associated with the development of programs. These program development tools allow users to write and construct programs that the.
Beyond Application Profiling to System Aware Analysis Elena Laskavaia, QNX Bill Graham, QNX.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Fermilab Scientific Computing Division Fermi National Accelerator Laboratory, Batavia, Illinois, USA. Off-the-Shelf Hardware and Software DAQ Performance.
OPERATING SYSTEMS CS 3502 Fall 2017
MCTS Guide to Microsoft Windows 7
In-situ Visualization using VisIt
Introduction to Operating System (OS)
Performance Analysis, Tools and Optimization
Department of Computer Science University of California, Santa Barbara
Department of Computer Science University of California, Santa Barbara
In Today’s Class.. General Kernel Responsibilities Kernel Organization
Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.
Presentation transcript:

Lecture 5: Parallel Tools Landscape – Part 2 Allen D. Malony Department of Computer and Information Science

Performance and Debugging Tools Performance Measurement and Analysis: Scalasca Vampir HPCToolkit Open|SpeedShop – Periscope – mpiP – Vtune – Paraver – PerfExpert Modeling and prediction – Prophesy – MuMMI Autotuning Frameworks – Active Harmony – Orio and Pbound Debugging – Stat Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

Periscope Technische Universität München (Germany)

Periscope  Automated profile-based performance analysis ❍ Iterative on-line performance analysis ◆ Multiple distributed hierarchical agents ❍ Automatic search for bottlenecks based on properties formalizing expert knowledge ◆ MPI wait states, OpenMP overheads and imbalances ◆ Processor utilization hardware counters ❍ Clustering of processes/threads with similar properties ❍ Eclipse-based integrated environment  Supports ❍ SGI Altix Itanium2, IBM Power and x86-based architectures  Developed by TU Munich ❍ Released as open-source ❍ Parallel Performance Tools: A Short Course, Beihang University, December 2-4, 2013

Periscope properties & strategies (examples)  MPI ❍ Excessive MPI communication time ❍ Excessive MPI time due to many small messages ❍ Excessive MPI time in receive due to late sender ❍ …  OpenMP ❍ Load imbalance in parallel region/section ❍ Sequential computation in master/single/ordered region ❍...  Hardware performance counters (platform-specific) ❍ Cycles lost due to cache misses ◆ High L1/L2/L3 demand load miss rate ❍ Cycles lost due to no instruction to dispatch ❍... 5 Parallel Performance Tools: A Short Course, Beihang University, December 2-4, 2013

Periscope plug-in to Eclipse environment 6 SIR outline view Properties view Project view Source code view Parallel Performance Tools: A Short Course, Beihang University, December 2-4, 2013

Benchmark Instrumentation  Instrumentation, either manually or using score-p  Change compiler to score-p wrapper script  Compile & link Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

Periscope Online Analysis  Periscope is started via its frontend. It automatically starts application and hierarchy of analysis agents.  Run psc_frontend --help for brief usage information % psc_frontend --help Usage: psc_frontend [--help] (displays this help message) [--quiet] (do not display debug messages) [--registry=host:port] (address of the registry service, optional) [--port=n] (local port number, optional) [--maxfan=n] (max. number of child agents, default=4) [--timeout=secs] (timeout for startup of agent hierarchy) [--delay=n] (search delay in phase executions) [--appname=name] [--apprun=commandline] [--mpinumprocs=number of MPI processes] [--ompnumthreads=number of OpenMP threads] … [--strategy=name] [--sir=name] [--phase=(FileID,RFL)] [--debug=level] Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

Periscope Online Analysis  Run Periscope by executing psc_frontend with the following command and options:  Experiment with strategies and OpenMP thread counts.  Strategies: MPI, OMP, scalability_OMP livetau$ psc_frontend --apprun=./bt-mz_B.4 --strategy=MPI --mpinumprocs=4 –ompnumthreads=1 –phase=“OA_phase” [psc_frontend][DBG0:fe] Agent network UP and RUNNING. Starting search. NAS Parallel Benchmarks BT Benchmark [...] Time step 200 BT Benchmark Completed End Periscope run! Search took 60.5 seconds (33.3 seconds for startup) Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

Starting Periscope GUI  Start Eclipse with Periscope GUI from console  Or by double-click on Eclipse pictogram on the Desktop % eclipse Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

Creating Fortran Project  File->New->Project... → Fortran->Fortran Project Input project name Unmark “Use default location” and provide path to BT folder Press Finish Project type Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

Add BT-MZ as a Source Folder  Right-click -> File-> New -> Fortran Source Folder Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

Add BT-MZ as a Source Folder  Choose BT-MZ as a source folder Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

Loading properties Expand BT project, search for *.psc and Right click->Periscope-> Load all properties Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

Periscope GUI Periscope properties view SIR outline view Project explorer view Source code view Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

Periscope GUI report exploration features  Multi-functional table is used in the GUI for Eclipse for the visualization of bottlenecks  Multiple criteria sorting algorithm  Complex categorization utility  Searching engine using Regular Expressions  Filtering operations  Direct navigation from the bottlenecks to their precise source location using the default IDE editor for that source file type (e.g. CDT/Photran editor).  SIR outline view shows a combination of the standard intermediate representation (SIR) of the analysed application and the distribution of its bottlenecks. The main goals of this view are to assist the navigation in the source code and attract developer's attention to the most problematic code areas. Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

Properties clustering  Clustering can effectively summarize displayed properties and identify a similar performance behaviour possibly hidden in the large amount of data Right-click-> Cluster properties by type Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

Properties clustering Severity value of the Cluster 1 Processes belonging To the Cluster1 Region and property where clustering performed Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

mpiP: Lightweight, Scalable MPI Profiling Lawrence Livermore National Laboratory (USA)

So, You Need to Look at a New Application … Scenarios:  New application development  Analyze/Optimize external application  Suspected bottlenecks First goal: overview of …  Communication frequency and intensity  Types and complexity of communication  Source code locations of expensive MPI calls  Differences between processes Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

Basic Principle of Profiling MPI Intercept all MPI API calls  Using wrappers for all MPI calls Aggregate statistics over time  Number of invocations  Data volume  Time spent during function execution Multiple aggregations options/granularity  By function name or type  By source code location (call stack)  By process rank Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

mpiP: Efficient MPI Profiling Open source MPI profiling library  Developed at LLNL, maintained by LLNL & ORNL  Available from sourceforge  Works with any MPI library Easy-to-use and portable design  Relies on PMPI instrumentation  No additional tool daemons or support infrastructure  Single text file as output  Optional: GUI viewer Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

Running with mpiP 101 / Experimental Setup mpiP works on binary files  Uses standard development chain  Use of “-g” recommended Run option 1: Relink  Specify libmpi.a/.so on the link line  Portable solution, but requires object files Run option 2: library preload  Set preload variable (e.g., LD_PRELOAD) to mpiP  Transparent, but only on supported systems Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

Running with mpiP 101 / Running bash-3.2$ srun –n4 smg2000 mpiP: mpiP: mpiP V3.1.2 (Build Dec /17:31:26) mpiP: Direct questions and errors to mpip- mpiP: Running with these driver parameters: (nx, ny, nz) = (60, 60, 60) (Px, Py, Pz) = (4, 1, 1) (bx, by, bz) = (1, 1, 1) (cx, cy, cz) = ( , , ) (n_pre, n_post) = (1, 1) dim = 3 solver ID = 0 ============================================= Struct Interface: ============================================= Struct Interface: wall clock time = seconds cpu clock time = seconds ============================================= Setup phase times: ============================================= SMG Setup: wall clock time = seconds cpu clock time = seconds ============================================= Solve phase times: ============================================= SMG Solve: wall clock time = seconds cpu clock time = seconds Iterations = 7 Final Relative Residual Norm = e-07 mpiP: mpiP: Storing mpiP output in [./smg2000-p mpiP]. mpiP: bash-3.2$ Header Output File Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

mpiP 101 / Output – Command :./smg2000-p -n Version : MPIP Build date : Dec , Start time : Stop time : Timer Used : MPIP env var : Collector Rank : Collector PID : Final Output Dir Report generation : MPI Task Assignment : 0 MPI Task Assignment : 1 MPI Task Assignment : 2 MPI Task Assignment : 3 hera31 Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

mpiP 101 / Output – Overview MPI Time (seconds) Task AppTime MPITime MPI% * Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

mpiP 101 / Output – Callsites Callsites: ID Lev File/Address Line Parent_Funct MPI_Call 1 0 communication.c 1405 hypre_CommPkgUnCommit Type_free 2 0 timing.c 419 hypre_PrintTiming Allreduce 3 0 communication.c 492 hypre_InitializeCommunication Isend 4 0 struct_innerprod.c 107 hypre_StructInnerProd Allreduce 5 0 timing.c 421 hypre_PrintTiming Allreduce 6 0 coarsen.c 542 hypre_StructCoarsen Waitall 7 0 coarsen.c 534 hypre_StructCoarsen Isend 8 0 communication.c 1552 hypre_CommTypeEntryBuildMPI Type_free 9 0 communication.c 1491 hypre_CommTypeBuildMPI Type_free 10 0 communication.c 667 hypre_FinalizeCommunication Waitall 11 0 smg2000.c 231 main Barrier 12 0 coarsen.c 491 hypre_StructCoarsen Waitall 13 0 coarsen.c 551 hypre_StructCoarsen Waitall 14 0 coarsen.c 509 hypre_StructCoarsen Irecv 15 0 communication.c 1561 hypre_CommTypeEntryBuildMPI Type_free 16 0 struct_grid.c 366 hypre_GatherAllBoxes Allgather 17 0 communication.c 1487 hypre_CommTypeBuildMPI Type_commit 18 0 coarsen.c 497 hypre_StructCoarsen Waitall 19 0 coarsen.c 469 hypre_StructCoarsen Irecv 20 0 communication.c 1413 hypre_CommPkgUnCommit Type_free 21 0 coarsen.c 483 hypre_StructCoarsen Isend 22 0 struct_grid.c 395 hypre_GatherAllBoxes Allgatherv 23 0 communication.c 485 hypre_InitializeCommunication Irecv Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

mpiP 101 / Output – per Function Timing Aggregate Time (top twenty, descending, milliseconds) Call Site Time App% MPI% COV Waitall e Isend e Irecv Waitall Type_commit Type_free Waitall Type_free Type_free Type_free Isend Isend Type_free Irecv Irecv Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

mpiP Output – per Function Message Size Aggregate Sent Message Size (top twenty, descending, bytes) Call Site Count Total Avrg Sent% Isend e Isend e Isend e Allreduce Allgatherv Allreduce Allreduce Allgather Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

mpiP 101 / Output – per Callsite Timing Callsite Time statistics (all, milliseconds): Name Site Rank Count Max Mean Min App% MPI% Allgather Allgather Allgather Allgather Allgather 16 * Allgatherv Allgatherv Allgatherv Allgatherv Allgatherv 22 * Allreduce Allreduce Allreduce Allreduce Allreduce 2 * Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

mpiP 101 / Output – per Callsite Message Size Callsite Message Sent statistics (all, sent bytes) Name Site Rank Count Max Mean Min Sum Allgather Allgather Allgather Allgather Allgather 16 * Allgatherv Allgatherv Allgatherv Allgatherv Allgatherv 22 * Allreduce Allreduce Allreduce Allreduce Allreduce 2 * Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

Fine Tuning the Profile Run mpiP Advanced Features  User controlled stack trace depth  Reduced output for large scale experiments  Application control to limit scope  Measurements for MPI I/O routines Controlled by MPIP environment variable  Set by user before profile run  Command line style argument list  Example: MPIP = “-c –o –k 4” Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

mpiP Parameters Param.DescriptionDefault -cConcise Output / No callsite data -f dirSet output directory -k nSet callsite stack traceback size to n1 -lUse less memory for data collection -nDo not truncate pathnames -oDisable profiling at startup -s nSet hash table size256 -t xPrint threshold0.0 -vPrint concise & verbose output Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

Controlling the Stack Trace Callsites are determined using stack traces  Starting from current call stack going backwards  Useful to avoid MPI wrappers  Helps to distinguishes library invocations Tradeoff: stack trace depth  Too short: can’t distinguish invocations  Too long: extra overhead / too many call sites User can set stack trace depth  -k parameter Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

Concise Output Output file contains many details  Users often only interested in summary  Per callsite/task data harms scalability Option to provide concise output  Same basic format  Omit collection of per callsite/task data User controls output format through parameters  -c = concise output only  -v = provide concise and full output files Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

Limiting Scope By default, mpiP measures entire execution  Any event between MPI_Init and MPI_Finalize Optional: controlling mpiP from within the application  Disable data collection at startup (-o)  Enable using MPI_Pcontrol(x) ❍ x=0: Disable profiling ❍ x=1: Enable profiling ❍ x=2: Reset profile data ❍ x=3: Generate full report ❍ x=4: Generate concise report Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

Limiting Scope / Example for(i=1; i < 10; i++) { switch(i) { case 5: MPI_Pcontrol(2); MPI_Pcontrol(1); break; case 6: MPI_Pcontrol(0); MPI_Pcontrol(4); break; default: break; } /*... compute and communicate for one timestep... */ } Reset & Start in Iteration 5 Stop & Report in Iteration 6 Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

Platforms Highly portable design  Built on top of PMPI, which is part of any MPI  Very few dependencies Tested on many platforms, including  Linux (x86, x86-64, IA-64, MIPS64)  BG/L & BG/P  AIX (Power 3/4/5)  Cray XT3/4/5 with Catamount and CNL  Cray X1E Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

mpiP Installation Download from  Current release version:  CVS access to development version Autoconf-based build system with options to  Disable I/O support  Pick a timing option  Choose name demangling scheme  Build on top of the suitable stack tracer  Set maximal stack trace depth Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

mpiPView: The GUI for mpiP  Optional: displaying mpiP data in a GUI  Implemented as part of the Tool Gear project  Reads mpiP output file  Provides connection to source files  Usage process  First: select input metrics  Hierarchical view of all callsites  Source panel once callsite is selected  Ability to remap source directories Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

Example Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

Vtune

Paraver Barcelona Supercomputing Center (Spain)

BSC Tools  Since 1991  Based on traces  Open Source ❍  Core tools: ❍ Paraver (paramedir) – offline trace analysis ❍ Dimemas – message passing simulator ❍ Extrae – instrumentation  Focus ❍ Detail, flexibility, intelligence Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

Extrae  Major BSC instrumentation package  When / where ❍ Parallel programming model runtime ◆ MPI, OpenMP, pthreads, OmpSs, CUDA, OpenCL, MIC… ◆ API entry/exit, OpenMP outlined routines ❍ Selected user functions ❍ Periodic samples ❍ User events  Additional information ❍ Counters ◆ PAPI ◆ Network counters ◆ OS counters ❍ Link to source code ◆ Callstack Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

How does Extrae intercept your app?  LD_PRELOAD ❍ Works on production binaries ❍ Specific library for each combination of runtimes ❍ Does not require knowledge on the application  Dynamic instrumentation ❍ Works on production binaries ❍ Just specify functions to be instrumented.  Other possibilities ❍ Link instrumentation library statically (i.e., BG/Q, …) ❍ OmpSs (instrumentation calls injected by compiler + linked to library) Based on DynInst U.Wisconsin/U.Maryland Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

How to use Extrae?  Adapt job submission script ❍ Specify LD_PRELOAD library and xml instrumentation control file  Specify the data to be captured in the.xml instrumentation control file  Run and get the trace … Extrae User's Guide available in Default control files and further examples within installation in $EXTRAE_HOME/share/example Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

Paraver – Performance data browser Timelines Raw data 2/3D tables (Statistics) Goal = Flexibility No semantics Programmable Comparative analyses Multiple traces Synchronize scales + trace manipulation Trace visualization/analysis Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

Paraver mathematical foundation  Every behavioral aspect/metric described as a function of time ❍ Possibly aggregated along ◆ the process model dimension (thread, process, application, workload) ◆ The resource model dimension (core, node, system) ❍ Language to describe how to compute such functions of time (GUI) ◆ Basic operators (from) trace records ◆ Ways of combining them  Those functions of time can be rendered into a 2D image ❍ Timeline  Statistics can be computed for each possible value or range of values of that function of time ❍ Tables: profiles and histograms Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

Paraver mathematical foundation Function of timeSeries of values Filter Semantic Display Trace Sub-Trace / subset of records (S 1,t 1 ), (S 2,t 2 ), (S 3,t 3 ),… Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

Timelines  Representation ❍ Function of time ❍ Colour encoding ❍ Not null gradient ◆ Black for zero value ◆ Light green  Dark blue  Non linear rendering to address scalability J. Labarta, et al.: “Scalability of tracing and visualization tools”, PARCO 2005 Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

Tables: Profiles, histograms, correlations  From timelines to tables MPI calls profile Useful DurationHistogram Useful Duration MPI calls Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

Dimemas: Coarse grain, Trace driven simulation  Simulation: Highly non linear model ❍ Linear components ◆ Point to point communication ◆ Sequential processor performance – Global CPU speed – Per block/subroutine ❍ Non linear components ◆ Synchronization semantics – Blocking receives – Rendezvous ◆ Resource contention – CPU – Communication subsystem » links (half/full duplex), busses CPU Local Memory B CPU L Local Memory L CPU Local Memory L Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

L=25us, BW=100MB/s L=1000us, BW=100MB/sL=25us, BW=10MB/s All windows same scale Are all parts of an app. equally sensitive to network? MPIRE 32 tasks, no network contention Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

Ideal machine The impossible machine: BW = , L = 0  Actually describes/characterizes Intrinsic application behavior ❍ Load balance problems? ❍ Dependence problems? waitall sendrec alltoall Real run Ideal network Allgather + sendrecv allreduce Nehalem cluster 256 processes Impact on practical machines? Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

Prophesy

MuMMI

Active Harmony University of Maryland (USA) University of Wisconsin (USA)

Why Automate Performance Tuning?  Too many parameters that impact performance.  Optimal performance for a given system depends on: ❍ Details of the processor ❍ Details of the inputs (workload) ❍ Which nodes are assigned to the program ❍ Other things running on the system  Parameters come from: ❍ User code ❍ Libraries ❍ Compiler choices Automated Parameter tuning can be used for adaptive tuning in complex software. Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

Automated Performance Tuning  Goal: Maximize achieved performance  Problems: ❍ Large number of parameters to tune ❍ Shape of objective function unknown ❍ Multiple libraries and coupled applications ❍ Analytical model may not be available ❍ Shape of objective function could change during execution!  Requirements: ❍ Runtime tuning for long running programs ❍ Don’t run too many configurations ❍ Don’t try really bad configurations Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

I really mean don’t let people hand tune  Example, a dense matrix multiply kernel  Various Options: ❍ Original program: 30.1 sec ❍ Hand Tuned (by developer): 11.4 sec ❍ Auto-tuned of hand-tuned: 15.9 sec ❍ Auto-tuned original program: 8.5 sec  What Happened? ❍ Hand tuning prevented analysis ◆ Auto-tuned transformations were then not possible Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

Active Harmony  Runtime performance optimization ❍ Can also support training runs  Automatic library selection (code) ❍ Monitor library performance ❍ Switch library if necessary  Automatic performance tuning (parameter) ❍ Monitor system performance ❍ Adjust runtime parameters  Hooks for Compiler Frameworks ❍ Integrated with USC/ISI Chill Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

Basic Client API Overview  Only five core functions needed ❍ Initialization/Teardown (4 functions) ❍ Critical loop analysis (2 functions) ❍ Available for C, C++, Fortran, and Chapel hdesc_t *harmony_init (appName, iomethod) int harmony_connect (hdesc, host, port) int harmony_bundle_file (hdesc, filename) harmony_bind_int(hdesc, varName, &ptr) hdesc_t *harmony_init (appName, iomethod) int harmony_connect (hdesc, host, port) int harmony_bundle_file (hdesc, filename) harmony_bind_int(hdesc, varName, &ptr) int harmony_fetch (hdesc) int harmony_report (hdesc, performance) int harmony_fetch (hdesc) int harmony_report (hdesc, performance) Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

Harmony Bundle Definitions  External bundle definition file  Equivalent client API routines available code_generation]$ cat bundle.cfg harmonyApp gemm { { harmonyBundle TI { int { } global } } { harmonyBundle TJ { int { } global } } { harmonyBundle TK { int { } global } } { harmonyBundle UI { int {1 8 1} global } } { harmonyBundle UJ { int {1 8 1} global } } } code_generation]$ cat bundle.cfg harmonyApp gemm { { harmonyBundle TI { int { } global } } { harmonyBundle TJ { int { } global } } { harmonyBundle TK { int { } global } } { harmonyBundle UI { int {1 8 1} global } } { harmonyBundle UJ { int {1 8 1} global } } } Bundle Name Value Class Value Range Bundle Scope Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

A Bit More About Harmony Search  Pre-execution ❍ Sensitivity Discovery Phase ❍ Used to Order not Eliminate search dimensions  Online Algorithm ❍ Use Parallel Rank Order Search ◆ Variation of Rank Oder Algorithm – Part of the class of Generating Set Algorithms ◆ Different configurations on different nodes Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

Parallel Rank Ordering Algorithm  All, but the best point of simplex moves.  Computations can be done in parallel. Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

Concern: Performance Variability Performance has two components: – Deterministic: Function of parameters – Stochastic: Function of parameters and high priority job throughput. – More of a problem in clusters 67

Performance variability impact on optimization  We do not need the exact performance value.  We only need to compare different configurations.  Approach: Take multiple samples and use a simple operator.  Goal: Come up with an appropriate operator, which results in lower variance. Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

69 Integrating Compiler Transformations Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

Auto-tuning Compiler Transformations  Described as a series of Recipes  Recipes consist of a sequence of operations ❍ permute([stmt],order): change the loop order ❍ tile(stmt,loop,size,[outer-loop]): tile loop at level loop ❍ unroll(stmt,loop,size): unroll stmt's loop at level loop ❍ datacopy(stmt,loop,array,[index]): ◆ Make a local copy of the data ❍ split(stmt,loop,condition): split stmt's loop level loop into multiple loops ❍ nonsingular(matrix) Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

Ananta Tiwari, Chun Chen, Jacqueline Chame, Mary Hall, Jeffrey K. Hollingsworth, “A Scalable Auto-tuning Framework for Compiler Optimization,” IPDPS 2009, Rome, May Generate and evaluate different optimizations that would have been prohibitively time consuming for a programmer to explore manually. CHiLL + Active Harmony Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

Outlined Code for (si = 0; si < stencil_size; si++) for (kk = 0; kk < hypre__mz; kk++) for (jj = 0; jj < hypre__my; jj++) for (ii = 0; ii < hypre__mx; ii++) rp[((ri+ii)+(jj*hypre__sy3))+(kk*hypre__sz3)] -= ((Ap_0[((ii+(jj*hypre__sy1))+ (kk*hypre__sz1))+ (((A->data_indices)[i])[si])])* (xp_0[((ii+(jj*hypre__sy2))+(kk*hypre__sz2))+(( *dxp_s)[si])])); CHiLL Transformation Recipe permute([2,3,1,4]) tile(0,4,TI) tile(0,3,TJ) tile(0,3,TK) unroll(0,6,US) unroll(0,7,UI) Constraints on Search 0 ≤ TI, TJ, TK ≤ ≤ UI ≤ 16 0 ≤ US ≤ 10 compilers ∈ {gcc, icc} Search space: x16x10x2 = 581M points SMG2000 Optimization Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

Selected parameters: TI=122,TJ=106,TK=56,UI=8,US=3,Comp=gcc Performance gain on residual computation: 2.37X Performance gain on full app: 27.23% improvement Parallel search evaluates 490 points and converges in 20 steps SMG2000 Search and Results Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

Compiling New Code Variants at Runtime Outlined code-section Code Generation Tools Code Server v1sv1s v2sv2s vNsvNs compiler v 1 s.so Active Harmony v 2 s.so v N s.so Performance Measurements (PM) stall_phase READY Signal Code Transformation Parameters PM 1 PM 2 PM N Application Execution timeline SS 1 SS 2 SS N PM 1, PM 2, … PM N Search Steps (SS) Application Harmony Timeline Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

Auto-Tuning for Communication  Optimize overlap ❍ Hide communication behind as much computation as possible ❍ Tune how often to call poll routines ◆ Messages don’t move unless call into MPI library ◆ Calling too often results in slowdown  Portability ❍ No support from special hardware for overlap  Optimize local computation ❍ Improve cache performance with loop tiling  Parameterize & Auto-Tuning ❍ Cope with the complex trade-off regarding the optimization techniques Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

Parallel 3-D FFT Performance  NEW: our method with overlap  TH: another method with overlap  FFTW: a popular library without overlap TH approach not so fast our NEW method 1.76x over FFTW complex numbers 256 cores on Hopper Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

Conclusions and Future Work  Ongoing Work ❍ More end-to-end Application Studies ❍ Continued Evaluation of Online Code Generation ❍ Communication Optimization ❍ Re-using data from prior runs  Conclusions ❍ Auto tuning can be done at many levels ◆ Offline – using training runs (choices fixed for an entire run) – Compiler options – Programmer supplied per-run tunable parameters – Compiler Transformations ◆ Online –training or production (choices change during execution) – Programmer supplied per-timestep parameters – Compiler Transformations ❍ It Works! ◆ Real programs run faster Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

Orio

Pbound

PerfExpert

STAT: Stack Trace Analysis Tool Lawrence Livermore National Laboratory (USA) University of Wisconsin (USA) University of New Mexico (USA)

STAT: Aggregating Stack Traces for Debugging  Existing debuggers don’t scale o Inherent limits in the approaches o Need for new, scalable methodologies  Need to pre-analyze and reduce data o Fast tools to gather state o Help select nodes to run conventional debuggers on  Scalable tool: STAT o Stack Trace Analysis Tool o Goal: Identify equivalence classes o Hierarchical and distributed aggregation of stack traces from all tasks o Stack trace merge <1s from 200K+ cores Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

Distinguishing Behavior with Stack Traces Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

Appl … … 3D-Trace Space/Time Analysis Parallel Performance Tools: A Short Course, Beihang University, December 2-4,

Scalable Representation 288 Nodes / 10 Snapshots Parallel Performance Tools: A Short Course, Beihang University, December 2-4,