Allen D. Malony, Sameer S. Shende, Alan Morris, Robert Bell, Kevin Huck, Nick Trebon, Suravee Suthikulpanit, Kai Li, Li Li

Slides:

Advertisements

Similar presentations

Machine Learning-based Autotuning with TAU and Active Harmony Nicholas Chaimov University of Oregon Paradyn Week 2013 April 29, 2013.

Advertisements

K T A U Kernel Tuning and Analysis Utilities Department of Computer and Information Science Performance Research Laboratory University of Oregon.

Dynamic performance measurement control Dynamic event grouping Multiple configurable counters Selective instrumentation Application-Level Performance Access.

Workload Characterization using the TAU Performance System Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer,

Sameer Shende, Allen D. Malony, and Alan Morris {sameer, malony, Steven Parker, and J. Davison de St. Germain {sparker,

Robert Bell, Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science.

Scalability Study of S3D using TAU Sameer Shende

Sameer Shende Department of Computer and Information Science Neuro Informatics Center University of Oregon Tool Interoperability.

Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science Institute University.

Profiling S3D on Cray XT3 using TAU Sameer Shende

TAU: Tuning and Analysis Utilities. TAU Performance System Framework  Tuning and Analysis Utilities  Performance system framework for scalable parallel.

TAU Parallel Performance System DOD UGC 2004 Tutorial Allen D. Malony, Sameer Shende, Robert Bell Univesity of Oregon.

The TAU Performance Technology for Complex Parallel Systems (Performance Analysis Bring Your Own Code Workshop, NRL Washington D.C.) Sameer Shende, Allen.

Nick Trebon, Alan Morris, Jaideep Ray, Sameer Shende, Allen Malony {ntrebon, amorris, Department of.

On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Bernd Mohr 1, Allen D. Malony 2, Rudi Eigenmann 3 1 Forschungszentrum.

Allen D. Malony Department of Computer and Information Science Performance Research Laboratory University of Oregon Performance Technology.

Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science Institute University.

The TAU Performance System: Advances in Performance Mapping Sameer Shende University of Oregon.

TAU Performance System Alan Morris, Sameer Shende, Allen D. Malony University of Oregon {amorris, sameer,

Performance Tools BOF, SC’07 5:30pm – 7pm, Tuesday, A9 Sameer S. Shende Performance Research Laboratory University.

Performance Instrumentation and Measurement for Terascale Systems Jack Dongarra, Shirley Moore, Philip Mucci University of Tennessee Sameer Shende, and.

Allen D. Malony Department of Computer and Information Science Computational Science Institute University of Oregon TAU Performance.

June 2, 2003ICCS Performance Instrumentation and Measurement for Terascale Systems Jack Dongarra, Shirley Moore, Philip Mucci University of Tennessee.

Allen D. Malony Department of Computer and Information Science Performance Research Laboratory NeuroInformatics Center University.

TAU Parallel Performance System DOD UGC 2004 Tutorial Part 1: TAU Overview and Architecture.

Performance Evaluation of S3D using TAU Sameer Shende

TAU: Performance Regression Testing Harness for FLASH Sameer Shende

TAU Performance Toolkit (WOMPAT 2004 OpenMP Lab) Sameer Shende, Allen D. Malony University of Oregon {sameer,

Scalability Study of S3D using TAU Sameer Shende

Allen D. Malony, Sameer Shende, Robert Bell Department of Computer and Information Science Computational Science Institute, NeuroInformatics.

Kai Li, Allen D. Malony, Robert Bell, Sameer Shende Department of Computer and Information Science Computational.

The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon.

Sameer Shende, Allen D. Malony Computer & Information Science Department Computational Science Institute University of Oregon.

Performance Observation Sameer Shende and Allen D. Malony cs.uoregon.edu.

SC’01 Tutorial Nov. 7, 2001 TAU Performance System Framework  Tuning and Analysis Utilities  Performance system framework for scalable parallel and distributed.

Allen D. Malony Performance Research Laboratory (PRL) Neuroinformatics Center (NIC) Department.

Research & Development Roadmap 1. Outline A New Communication Framework Giving Bro Control over the Network Security Monitoring for Industrial Control.

Paradyn Week – April 14, 2004 – Madison, WI DPOMP: A DPCL Based Infrastructure for Performance Monitoring of OpenMP Applications Bernd Mohr Forschungszentrum.

Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.

Using TAU on SiCortex Alan Morris, Aroon Nataraj Sameer Shende, Allen D. Malony University of Oregon {amorris, anataraj, sameer,

Allen D. Malony Department of Computer and Information Science Performance Research Laboratory University of Oregon Performance Technology.

John Mellor-Crummey Robert Fowler Nathan Tallent Gabriel Marin Department of Computer Science, Rice University Los Alamos Computer Science Institute HPCToolkit.

Profile Analysis with ParaProf Sameer Shende Performance Reseaerch Lab, University of Oregon

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

Dynamic performance measurement control Dynamic event grouping Multiple configurable counters Selective instrumentation Application-Level Performance Access.

ASC Tri-Lab Code Development Tools Workshop Thursday, July 29, 2010 Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA This work.

Allen D. Malony Department of Computer and Information Science TAU Performance Research Laboratory University of Oregon Discussion:

1 SciDAC High-End Computer System Performance: Science and Engineering Jack Dongarra Innovative Computing Laboratory University of Tennesseehttp://

Allen D. Malony, Sameer S. Shende, Robert Bell Kai Li, Li Li, Kevin Huck Department of Computer.

Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.

Allen D. Malony Department of Computer and Information Science Performance Research Laboratory University.

Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.

Allen D. Malony Department of Computer and Information Science Performance Research Laboratory.

Performane Analyzer Performance Analysis and Visualization of Large-Scale Uintah Simulations Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance.

Online Performance Analysis and Visualization of Large-Scale Parallel Applications Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance Research.

Kai Li, Allen D. Malony, Sameer Shende, Robert Bell

Introduction to the TAU Performance System®

Performance Technology for Scalable Parallel Systems

TAU integration with Score-P

Tutorial Outline – Part 1

Allen D. Malony, Sameer Shende

TAU Parallel Performance System

Performance Technology for Complex Parallel and Distributed Systems

A configurable binary instrumenter

TAU Parallel Performance System

TAU: A Framework for Parallel Performance Analysis

Outline Introduction Motivation for performance mapping SEAA model

Allen D. Malony, Sameer Shende

Parallel Program Analysis Framework for the DOE ACTS Toolkit

TAU Performance DataBase Framework (PerfDBF)

Presentation transcript:

Allen D. Malony, Sameer S. Shende, Alan Morris, Robert Bell, Kevin Huck, Nick Trebon, Suravee Suthikulpanit, Kai Li, Li Li Department of Computer and Information Science Performance Research Laboratory University of Oregon TAU Parallel Performance System

DOE NNSA/ASC Presentation, SC20042 Outline  Motivation  TAU architecture and toolkit  Instrumentation  Measurement  Analysis  Example DOE ASC applications  TAU status  Conclusion

TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC20043 Problem Domain  ASC defines leading edge parallel systems and software  Large-scale systems and heterogenous platforms  Multi-model, multi-module simulation  Complex, multi-layered software integration  Multi-language programming  Mixed-model parallelism and hybrid parallelism  Complexity challenges performance analysis tools  System diversity requires portable tools  Need for cross-language support  Support different parallel computation models  Operate at scale

TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC20044 Research Motivation  Tools for parallel performance problem solving  Empirical-based performance optimization process  Performance technology concerns characterization Performance Tuning Performance Diagnosis Performance Experimentation Performance Observation hypotheses properties Instrumentation Measurement Analysis Visualization Performance Technology Experiment management Performance database Performance Technology

TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC20045 TAU Performance System  Tuning and Analysis Utilities (12+ year project effort)  Performance system framework for HPC systems  Integrated, scalable, flexible, and parallel  Targets a general complex system computation model  Entities: nodes / contexts / threads  Multi-level: system / software / parallelism  Measurement and analysis abstraction  Integrated toolkit for performance instrumentation, measurement, analysis, and visualization  Portable performance profiling and tracing facility  Open software approach with technology integration  University of Oregon, Research Center Jülich, LANL

TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC20046 TAU Performance System Objectives  Multi-level performance instrumentation  Multi-language automatic source instrumentation  Flexible and configurable performance measurement  Widely-ported parallel performance profiling system  Computer system architectures and operating systems  Different programming languages and compilers  Support for multiple parallel programming paradigms  Multi-threading, message passing, mixed-mode, hybrid  Enable performance mapping across semantic levels  Support for object-oriented and generic programming  Integration in complex software systems and applications

TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC20047 memory Node VM space Context SMP Threads node memory … … Interconnection Network Inter-node message communication * * physical view model view General Complex System Computation Model  Node: physically distinct shared memory machine  Message passing node interconnection network  Context: distinct virtual memory space within node  Thread: execution threads (user/system) in context

TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC20048 TAU Performance System Architecture

TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC20049 TAU Performance System Architecture

TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC TAU Instrumentation Approach  Support for standard program events  Routines and statement-level blocks  Classes and templates  Based on begin/end (paired) event semantics  Support for user-defined events  “User-defined timers” (begin/end semantics)  Atomic events (standard and user-defined)  Selection of event statistics  Support definition of “semantic” entities for mapping  Support for event groups  Instrumentation optimization

TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC TAU Instrumentation Mechanisms  Flexible instrumentation mechanisms at multiple levels  Source code  manual  automatic C, C++, F77/90/95 (Program Database Toolkit (PDT)) OpenMP (directive rewriting (Opari), POMP spec)  Object code  pre-instrumented libraries (e.g., MPI using PMPI)  statically-linked and dynamically-linked  Executable code  dynamic instrumentation (pre-execution) (DyninstAPI)  virtual machine instrumentation (e.g., Java using JVMPI)  TAU_COMPILER to automate instrumentation process

TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC Multi-Level Instrumentation and Mapping  Multiple interfaces  Simultaneously active  Information sharing between interfaces  Selective instrumentation  Within/between levels  Associate performance data with high-level semantic abstractions User-level abstractions problem domain source code object codelibraries instrumentation executable runtime image compiler linkerOS VM instrumentation performance data run preprocessor

TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC TAU Source Instrumentation  Automatic source instrumentation (tau_instrumentor)  Routine entry/exit and class method entry/exit  Block entry/exit and statement level (to be added)  Uses an instrumentation specification file  Include/exclude list for events and files  Uses command line options for group selection  Instrumentation event selection (tau_select)  Automatic generation of instrumentation specification file  Instrumentation language to describe event constraints  Event identity and location  Event performance properties (e.g., overhead analysis)  Create tau_select scripts for performance experiments

TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC Program Database Toolkit (PDT)  Program code analysis framework  Use to develop source-based tools  High-level interface to source code information  Integrated toolkit for source code parsing, database creation, and database query  Commercial grade front-end parsers  Portable IL analyzer, database format, and access API  Open software approach for tool development  Multiple source languages  Implement automatic performance instrumentation tools  tau_instrumentor

TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC Program Database Toolkit (PDT) Application / Library C / C++ parser Fortran parser F77/90/95 C / C++ IL analyzer Fortran IL analyzer Program Database Files IL DUCTAPE PDBhtml SILOON CHASM TAU_instr Program documentation Application component glue C++ / F90/95 interoperability Automatic source instrumentation

TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC PDT Status  Cleanscape Flint parser fully integrated for F90/95  Flint parser is very robust  Produces PDB records for TAU instrumentation  Linux x86, HP Tru64, IBM AIX  Tested on SAGE, POP, ESMF, PET benchmarking codes  C++ and Fortran statement-level information  for/while loops, declarations, initialization, assignment,…  PDB records defined for most constructs  PDT applications  CHASM: C++ / Fortran 90/95 interoperability  CCA: proxy generation, component instrumentation

TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC TAU Performance Measurement  TAU supports profiling and tracing measurement  Robust timing and hardware performance support  Online profile access and sampling  Extension of TAU measurement for multiple counters  User-defined TAU counters and system-level metrics  Integration with trace measurement  Support for memory and callpath profiling  Fully portable parallel performance tracing solution  Hierarchical trace merging and trace translation  Component software monitoring  Online performance profile overhead compensation

TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC TAU Measurement Mechanisms  Performance data sources  High-resolution timer library (real-time / virtual clocks)  General software counter library (user-defined events)  Hardware performance counters  PCL (Performance Counter Library) (ZAM, Germany)  PAPI (Performance API) (UTK, Ptools Consortium)  consistent, abstract, portable API  Organization  Node, context, thread levels  Profile groups for collective events (runtime selective)  Performance data mapping between software levels

TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC TAU Measurement Mechanisms (continued)  Parallel profiling  Function-level, block-level, statement-level  Supports user-defined events  TAU parallel profile data stored during execution  Hardware counts values  Support for multiple counters  Support for callgraph and callpath profiling  Tracing  All profile-level events  Inter-process communication events  Inclusion of counter data in traced events

TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC Performance Analysis and Visualization  Analysis of parallel profile and trace measurement  Parallel profile analysis  ParaProf: parallel profile analysis and presentation  ParaVis: parallel performance data visualization (proto)  Profile generation from trace data  Performance data management framework (PerfDMF)  Parallel trace analysis  Translation to VTF (V3.0) and EPILOG formats  Integration with VNG (Technical University of Dresden)  Online parallel analysis and visualization  Integration with CUBE browser (UTK, FZJ)

TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC ParaProf Framework Architecture  Portable, extensible, and scalable tool for profile analysis  Try to offer “best of breed” capabilities to analysts  Build as profile analysis framework for extensibility

TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC TAU PerfDMF Architecture

TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC Selected Applications of TAU  Center for Simulation of Accidental Fires and Explosion  University of Utah, ASCI ASAP Center, C-SAFE  Uintah Computational Framework (UCF) (C++)  Center for Simulation of Dynamic Response of Materials  California Institute of Technology, ASCI ASAP Center  Virtual Testshock Facility (VTF) (Python, Fortran 90)  Earth Systems Modeling Framework (ESMF)  NSF, NOAA, DOE, NASA, …  Instrumentation for ESMF framework and applications  C, C++, and Fortran 95 code modules  MPI wrapper library for MPI calls

TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC Selected Applications of TAU (continued)  Lawrence Livermore National Lab  Hydrodynamics (Miranda)  Radiation diffusion (KULL)  C++ automatic instrumentation, callpath profiling  Sandia National Lab  DOE CCTTSS SciDAC project  Common component architecture (CCA) integration  Combustion code (C++, Fortran 90, GrACE, MPI)  Los Alamos National Lab  Monte Carlo transport (MCNP) (Susan Post)  ASCI Q validation and scaling  SAIC’s Adaptive Grid Eulerian (SAGE) (Jack Horner)  Fortran 90 automatic instrumentation testcase

TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC Component-Based Scientific Applications  How to support performance analysis and tuning process consistent with application development methodology?  Common Component Architecture (CCA) applications  Performance tools should integrate with software  Design performance observation component  Measurement port and measurement interfaces  Build support for application component instrumentation  Interpose a proxy component for each port  Inside proxy, track caller/callee invocations and timings  Automate the process of proxy component creation  using PDT for static analysis of components  include support for selective instrumentation

TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC Flame Reaction-Diffusion (Sandia, J. Ray) CCAFFEINE

TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC Earth Systems Modeling Framework  Coupled modeling with modular software framework  Instrumentation for ESMF framework and applications  PDT automatic instrumentation  Fortran 95 code modules  C / C++ code modules  MPI wrapper library for MPI calls  ESMF Component instrumentation (using CCA)  CCA measurement port manual instrumentation  Proxy generation using PDT and runtime interposition  Significant use of callpath profiling by ESMF team

TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC TAU’s Paraprof Profile Browser (ESMF Data) Callpath profile Global profile

TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC CUBE Browser (UTK, FZJ) (ESMF Data) metric calltree location TAU callpath profile data converted to CUBE form

TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC TAU Traces with Hardware Counters (ESMF)

TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC TAU Traces with User-Defined Counters

TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC Uintah Computational Framework (UCF)  University of Utah, Center for Simulation of Accidental Fires and Explosions (C-SAFE), DOE ASCI Center  UCF analysis  Scheduling  MPI library  Components  Performance mapping  Use for online and offline visualization  ParaVis tools F 500 processes

TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC Scatterplot Displays (UCF, 500 processes)  Each point coordinate determined by three values: MPI_Reduce MPI_Recv MPI_Waitsome  Min/Max value range  Effective for cluster analysis Relation between MPI_Recv and MPI_Waitsome

TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC Online Unitah Performance Profiling  Demonstration of profiling sampling capability  Multiple profile samples  Each profile taken at major iteration (~ 60 seconds)  Colliding elastic disks C-SAFE application  Test material point method (MPM) code  Executed on 512 processors ASCI Blue Pacific at LLNL  Example  3D bargraph visualization  MPI execution time  Performance mapping  Multiple time steps

TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC Online Unitah Performance Profiling

TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC Miranda Performance Analysis  Miranda is a research hydrodynamics code at LLNL  Fortran 95, MPI  Mostly synchronous  MPI_ALLTOALL on  Np x,y communicators  Some MPI reductions and broadcasts for statistics  Good communications scaling  ACL and MCR Linux cluster  Up to 1728 CPUs  Fixed workload per CPU  Ported to BlueGene/L

TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC Profiling of Miranda on BG/L (Miller, LLNL) 128 Nodes512 Nodes1024 Nodes  Profile code performance (automatic instrumentation)  Scaling studies (problem size, number of processors)  Run on 8K and 16K processors just two week ago!

TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC Fine Grained Profiling via Tracing on Miranda  Use TAU to generate VTF3 traces for Vampir analysis  Combines MPI calls with HW counter information  Detailed code behavior to focus optimization efforts

TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC Max Heap Memory (KB) used for problem on 16 processors of ASC Frost at LLNL Memory Usage Analysis in Miranda on BG/L  BG/L will have limited memory per node (512 MB)  Miranda uses TAU to profile memory usage  Streamlines code  Squeeze larger problems on the machine  TAU’s footprint is small  Approximately 60 bytes per event per thread

TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC TAU Performance System Status  Computing platforms (selected)  IBM SP / pSeries, SGI Origin 2K/3K, Cray T3E / SV-1 / X1, HP (Compaq) SC (Tru64), Sun, Hitachi SR8000, NEC SX-5/6, Linux (IA-32/64, Alpha, PPC, PA-RISC, Power, Opteron), Apple (G4/5, OS X), Windows, BG/L  Programming languages  C, C++, Fortran 77/90/95, HPF, Java, OpenMP, Python  Thread libraries  pthreads, SGI sproc, Java,Windows, OpenMP  Compilers (selected)  Intel (KCC, KAP/Pro), PGI, GNU, Fujitsu, Sun, Microsoft, SGI, Cray, IBM (xlc, xlf), HP, NEC, Absoft

TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC Concluding Remarks  Complex ASC parallel systems and software pose challenging performance analysis problems that require robust methodologies and tools  To build more sophisticated performance tools, existing proven performance technology must be utilized  Performance tools must be integrated with software and systems models and technology  Performance engineered software  Function consistently and coherently  TAU performance system offers robust performance technology that can be broadly integrated in next- generation scalable software and systems

TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC Acknowledgements  Department of Energy (DOE)  MICS office  “Performance Technology for Tera-class Parallel Computer Systems: Evolution of the TAU Performance System”  “Performance Analysis of Parallel Component Software”  NNSA/ASC  University of Utah DOE ASCI Level 1 sub-contract  ASCI Level 3 project (LANL, LLNL, SNL)  Research Centre Juelich  John von Neumann Institute for Computing  Dr. Bernd Mohr  Los Alamos National Laboratory