Performance Instrumentation and Measurement for Terascale Systems Jack Dongarra, Shirley Moore, Philip Mucci University of Tennessee Sameer Shende, and.

Slides:

Advertisements

Similar presentations

Performance Analysis and Optimization through Run-time Simulation and Statistics Philip J. Mucci University Of Tennessee

Advertisements

K T A U Kernel Tuning and Analysis Utilities Department of Computer and Information Science Performance Research Laboratory University of Oregon.

Chorus and other Microkernels Presented by: Jonathan Tanner and Brian Doyle Articles By: Jon Udell Peter D. Varhol Dick Pountain.

Dynamic performance measurement control Dynamic event grouping Multiple configurable counters Selective instrumentation Application-Level Performance Access.

The Path to Multi-core Tools Paul Petersen. Multi-coreToolsThePathTo 2 Outline Motivation Where are we now What is easy to do next What is missing.

Robert Bell, Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science.

Sameer Shende Department of Computer and Information Science Neuro Informatics Center University of Oregon Tool Interoperability.

Profiling S3D on Cray XT3 using TAU Sameer Shende

The TAU Performance Technology for Complex Parallel Systems (Performance Analysis Bring Your Own Code Workshop, NRL Washington D.C.) Sameer Shende, Allen.

On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Bernd Mohr 1, Allen D. Malony 2, Rudi Eigenmann 3 1 Forschungszentrum.

June 2, 2003ICCS Performance Instrumentation and Measurement for Terascale Systems Jack Dongarra, Shirley Moore, Philip Mucci University of Tennessee.

Kai Li, Allen D. Malony, Robert Bell, Sameer Shende Department of Computer and Information Science Computational.

The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon.

Sameer Shende, Allen D. Malony Computer & Information Science Department Computational Science Institute University of Oregon.

Instrumentation and Measurement CSci 599 Class Presentation Shreyans Mehta.

PAPI Tool Evaluation Bryan Golden 1/4/2004 HCS Research Laboratory University of Florida.

UPC/SHMEM PAT High-level Design v.1.1 Hung-Hsun Su UPC Group, HCS lab 6/21/2005.

Paradyn Week – April 14, 2004 – Madison, WI DPOMP: A DPCL Based Infrastructure for Performance Monitoring of OpenMP Applications Bernd Mohr Forschungszentrum.

1 Lecture 4: Threads Operating System Fall Contents Overview: Processes & Threads Benefits of Threads Thread State and Operations User Thread.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Analyzing parallel programs with Pin Moshe Bach, Mark Charney, Robert Cohn, Elena Demikhovsky, Tevi Devor, Kim Hazelwood, Aamer Jaleel, Chi- Keung Luk,

Adventures in Mastering the Use of Performance Evaluation Tools Manuel Ríos Morales ICOM 5995 December 4, 2002.

Support for Debugging Automatically Parallelized Programs Robert Hood Gabriele Jost CSC/MRJ Technology Solutions NASA.

11/17/02 1 PAPI and Dynaprof Application Signatures and Performance Analysis of Scientific Applications Philip J. Mucci Innovative Computing Laboratory,

Using TAU on SiCortex Alan Morris, Aroon Nataraj Sameer Shende, Allen D. Malony University of Oregon {amorris, anataraj, sameer,

Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.

BLU-ICE and the Distributed Control System Constraints for Software Development Strategies Timothy M. McPhillips Stanford Synchrotron Radiation Laboratory.

VAMPIR. Visualization and Analysis of MPI Resources Commercial tool from PALLAS GmbH VAMPIRtrace - MPI profiling library VAMPIR - trace visualization.

March 17, 2005 Roadmap of Upcoming Research, Features and Releases Bart Miller & Jeff Hollingsworth.

Dynaprof and PAPI A Tool for Dynamic Runtime Instrumentation and Performance Analysis Philip Mucci, Research Consultant Innovative Computing Laboratory/LBNL.

Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,

Performance Analysis Tool List Hans Sherburne Adam Leko HCS Research Laboratory University of Florida.

Profiling, Tracing, Debugging and Monitoring Frameworks Sathish Vadhiyar Courtesy: Dr. Shirley Moore (University of Tennessee)

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

Instrumentation in Software Dynamic Translators for Self-Managed Systems Bruce R. Childers Naveen Kumar, Jonathan Misurda and Mary.

Dynamic performance measurement control Dynamic event grouping Multiple configurable counters Selective instrumentation Application-Level Performance Access.

Dynaprof Evaluation Report Adam Leko, Hans Sherburne UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red:

Debugging parallel programs. Breakpoint debugging Probably the most widely familiar method of debugging programs is breakpoint debugging. In this method,

Preparatory Research on Performance Tools for HPC HCS Research Laboratory University of Florida November 21, 2003.

Allen D. Malony Department of Computer and Information Science TAU Performance Research Laboratory University of Oregon Discussion:

Tool Visualizations, Metrics, and Profiled Entities Overview [Brief Version] Adam Leko HCS Research Laboratory University of Florida.

Dynaprof Evaluation Report Adam Leko, Hans Sherburne UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red:

1 SciDAC High-End Computer System Performance: Science and Engineering Jack Dongarra Innovative Computing Laboratory University of Tennesseehttp://

© 2006, National Research Council Canada © 2006, IBM Corporation Solving performance issues in OTS-based systems Erik Putrycz Software Engineering Group.

Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.

Overview of AIMS Hans Sherburne UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative note Green:

August 2003 At A Glance The IRC is a platform independent, extensible, and adaptive framework that provides robust, interactive, and distributed control.

Computer Science Department University of Texas at El Paso PCAT Performance Counter Assessment Team PAPI Development Team UGC 2003, Bellevue, WA – June.

1 Presentation Methodology Summary B. Golden. 2 Introduction Why use visualizations?  To facilitate user comprehension  To convey complexity and intricacy.

Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs Allen D. Malony, Scott Biersdorff, Sameer Shende, Heike Jagode†, Stanimire.

Performane Analyzer Performance Analysis and Visualization of Large-Scale Uintah Simulations Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance.

Chapter – 8 Software Tools.

CEPBA-Tools experiences with MRNet and Dyninst Judit Gimenez, German Llort, Harald Servat

1 Chapter 2: Operating-System Structures Services Interface provided to users & programmers –System calls (programmer access) –User level access to system.

Online Performance Analysis and Visualization of Large-Scale Parallel Applications Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance Research.

Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.

Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.

Introduction to threads

Shirley Moore Towards Scalable Cross-Platform Application Performance Analysis -- Tool Goals and Progress Shirley Moore

Kai Li, Allen D. Malony, Sameer Shende, Robert Bell

TAU integration with Score-P

Allen D. Malony, Sameer Shende

Chapter 4: Threads.

TAU: A Framework for Parallel Performance Analysis

Allen D. Malony Computer & Information Science Department

Threads Chapter 5 2/17/2019 B.Ramamurthy.

Threads Chapter 5 2/23/2019 B.Ramamurthy.

Outline Introduction Motivation for performance mapping SEAA model

Parallel Program Analysis Framework for the DOE ACTS Toolkit

Outline Operating System Organization Operating System Examples

Presentation transcript:

Performance Instrumentation and Measurement for Terascale Systems Jack Dongarra, Shirley Moore, Philip Mucci University of Tennessee Sameer Shende, and Allen Malony University of Oregon

Requirements for Terascale Systems Performance framework must support a wide range of –Performance problems (e.g., single-node performance, synchronization and communication overhead, load balancing) –Performance evaluation methods (e.g., parameter-based modeling, bottleneck detection and diagnosis) –Programming environments (e.g., multiprocess and /or multithreaded, parallel and distributed, large-scale) Need for flexible and extensible performance observation framework

Research Problems Appropriate level and location for implementing instrumentation and measurement How to make the framework modular and extensible Appropriate compromise between level of detail/accuracy and instrumentation cost

Instrumentation Strategies Source code instrumentation –Manual or using preprocessor Library level instrumentation –e.g., MPI and OpenMP profiling interfaces Binary rewriting –E.g., Pixie, ATOM, EEL, PAT Dynamic instrumentation –DyninstAPI

Types of Measurements Profiling Tracing Real-time Analysis

Profiling Recording of summary information during execution –inclusive, exclusive time, # calls, hardware statistics, … Reflects performance behavior of program entities –functions, loops, basic blocks –user-defined “semantic” entities Very good for low-cost performance assessment Helps to expose performance bottlenecks and hotspots Implemented through –sampling: periodic OS interrupts or hardware counter traps –instrumentation: direct insertion of measurement code

Tracing –Recording of information about significant points (events) during program execution entering/exiting code region (function, loop, block, …) thread/process interactions (e.g., send/receive message) –Save information in event record timestamp CPU identifier, thread identifier Event type and event-specific information – Event trace is a time-sequenced stream of event records – Can be used to reconstruct dynamic program behavior –Typically requires code instrumentation

Real-time Analysis Allows evaluation of program performance during execution Examples –Paradyn –Autopilot –Perfometer

TAU Performance System Architecture EPILOG Paraver

TAU Instrumentation Manually using TAU instrumentation API Automatically using –Program Database Toolkit (PDT) –MPI profiling library –Opari OpenMP rewriting tool Uses PAPI to access hardware counter data

Program Database Toolkit (PDT) Program code analysis framework for developing source-based tools High-level interface to source code information Integrated toolkit for source code parsing, database creation, and database query –commercial grade front end parsers –portable IL analyzer, database format, and access API –open software approach for tool development Targets and integrates multiple source languages Used in TAU to build automated performance instrumentation tools

PDT Components Language front end –Edison Design Group (EDG): C, C++ –Mutek Solutions Ltd.: F77, F90 –creates an intermediate-language (IL) tree IL Analyzer –processes the intermediate language (IL) tree –creates “program database” (PDB) formatted file DUCTAPE (Bernd Mohr, ZAM, Germany) –C++ program Database Utilities and Conversion Tools APplication Environment –processes and merges PDB files –C++ library to access the PDB for PDT applications

OPARI: Basic Usage (f90) Reset OPARI state information –rm -f opari.rc Call OPARI for each input source file –opari file1.f90... opari fileN.f90 Generate OPARI runtime table, compile it with ANSI C –opari -table opari.tab.c cc -c opari.tab.c Compile modified files *.mod.f90 using OpenMP Link the resulting object files, the OPARI runtime table opari.tab.o and the TAU POMP RTL

TAU Analysis Profile analysis –pprof parallel profiler with text-based display –Racy / jRacy graphical interface to pprof (Tcl/Tk) jRacy is a Java implementation of Racy –ParaProf Next-generation parallel profile analysis and display Trace analysis and visualization –Trace merging and clock adjustment (if necessary) –Trace format conversion (ALOG, SDDF, Vampir) –Vampir (Pallas) trace visualization –Paraver (CEPBA) trace visualization

TAU Pprof Display

jracy (NAS Parallel Benchmark – LU) n: node c: context t: thread Global profiles Individual profile Routine profile across all nodes

ParaProf Scalable Profiler Re-implementation of jRacy tool Target flexibility in profile input source –Profile files, performance database, online Target scalability in profile size and display –Will include three-dimensional display support Provide more robust analysis and extension –Derived performance statistics

ParaProf Architecture

512-Processor Profile (SAMRAI)

Three-dimensional Profile Displays 500-processor Uintah execution (University of Utah)

Overview of PAPI Performance Application Programming Interface The purpose of the PAPI project is to design, standardize and implement a portable and efficient API to access the hardware performance monitor counters found on most modern microprocessors. Parallel Tools Consortium project References implementations for all major HPC platforms Installed and in use at major government labs, academic sites Becoming de facto industry standard Incorporated into many performance analysis tools – e.g., HPCView,SvPablo, TAU, Vampir, Vprof

PAPI Counter Interfaces PAPI provides three interfaces to the underlying counter hardware: 1.The low level interface provides functions for setting options, accessing native events, callback on counter overflow, etc. 2.The high level interface simply provides the ability to start, stop and read the counters for a specified list of events. 3.Graphical tools to visualize information.

PAPI Implementation Tools PAPI Low Level PAPI High Level Hardware Performance Counter Operating System Kernel Extension PAPI Machine Dependent Substrate Machine Specific Layer Portable Layer

PAPI Preset Events Proposed standard set of events deemed most relevant for application performance tuning Defined in papiStdEventDefs.h Mapped to native events on a given platform –Run tests/avail to see list of PAPI preset events available on a platform

Scalability of PAPI Instrumentation Overhead of library calls to read counters can be excessive. Statistical sampling can reduce overhead. PAPI substrate for Alpha Tru64 UNIX –Built on top of DADD/DCPI (Dynamic Access to DCPI Data/Digital Continuous Profiling Interface) –Sampling approach supported in hardware –1-2% overhead compared to 30% on other platforms Using sampling and hardware profiling support on Itanium/Itanium2

Vampir v3.x: Hardware Counter Data Counter Timeline Display

What is DynaProf? A portable tool to instrument a running executable with Probes that monitor application performance. Simple command line interface. Open Source Software A work in progress…

DynaProf Methodology Make collection of run-time performance data easy by: –Avoiding instrumentation and recompilation –Using the same tool with different probes –Providing useful and meaningful probe data –Providing different kinds of probes –Allowing custom probes

Why the “Dyna”? Instrumentation is selectively inserted directly into the program’s address space. Why is this a better way? –No perturbation of compiler optimizations –Complete language independence –Multiple Insert/Remove instrumentation cycles

DynaProf Design GUI, command line & script driven user interface Uses GNU readline for command line editing and command completion. Instrumentation is done using: –Dyninst on Linux, Solaris and IRIX –DPCL on AIX

DynaProf Commands load list [module pattern] use [probe args] instr module [probe args] instr function [probe args] stop continue run [args] Info unload

DynaProf Probe Design Probes provided with distribution –Wallclock probe –PAPI probe –Perfometer probe Can be written in any compiled language Probes export 3 functions with a standardized interface. Easy to roll your own (<1day) Supports separate probes for MPI/OpenMP/Pthreads

Future development GUI development Additional probes –Perfex probe –Vprof probe –TAU probe Better support for parallel applications

Perfometer Application is instrumented with PAPI –call perfometer() –call mark_perfometer(int color, char *label) Application is started. At the call to perfometer, signal handler and a timer are set up to collect and send the information to a Java applet containing the graphical view. Sections of code that are of interest can be designated with specific colors Real-time display or trace file

Perfometer Display Machine info Process & Real time Flop/s Rate Flop/s Min/Max

Perfometer Parallel Interface

Conclusions TAU and PAPI projects are addressing important research problems involved in constructing a flexible and extensible performance observation framework. Widespread adoption of PAPI demonstrates the value of a portable interface to low-level architecture-specific performance monitoring hardware. TAU framework provides flexible mechanisms for instrumentation and measurement.

Conclusions (cont.) Terascale systems require scalable low-overhead means of collecting performance data. –Statistical sampling support in PAPI –TAU filtering and feedback schemes for focusing instrumentation –Real-time monitoring capabilities (Dynaprof, Perfometer) PAPI and TAU infrastructure is designed for interoperability, flexibility, and extensibility.

More Information –Software, documentation, mailing lists TAU ( PDT ( PAPI ( OPARI (