Allen D. Malony, Sameer S. Shende, Alan Morris, Robert Bell, Kevin Huck, Nick Trebon, Suravee Suthikulpanit, Kai Li, Li Li Department of Computer and Information Science Performance Research Laboratory University of Oregon TAU Parallel Performance System
DOE NNSA/ASC Presentation, SC20042 Outline Motivation TAU architecture and toolkit Instrumentation Measurement Analysis Example DOE ASC applications TAU status Conclusion
TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC20043 Problem Domain ASC defines leading edge parallel systems and software Large-scale systems and heterogenous platforms Multi-model, multi-module simulation Complex, multi-layered software integration Multi-language programming Mixed-model parallelism and hybrid parallelism Complexity challenges performance analysis tools System diversity requires portable tools Need for cross-language support Support different parallel computation models Operate at scale
TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC20044 Research Motivation Tools for parallel performance problem solving Empirical-based performance optimization process Performance technology concerns characterization Performance Tuning Performance Diagnosis Performance Experimentation Performance Observation hypotheses properties Instrumentation Measurement Analysis Visualization Performance Technology Experiment management Performance database Performance Technology
TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC20045 TAU Performance System Tuning and Analysis Utilities (12+ year project effort) Performance system framework for HPC systems Integrated, scalable, flexible, and parallel Targets a general complex system computation model Entities: nodes / contexts / threads Multi-level: system / software / parallelism Measurement and analysis abstraction Integrated toolkit for performance instrumentation, measurement, analysis, and visualization Portable performance profiling and tracing facility Open software approach with technology integration University of Oregon, Research Center Jülich, LANL
TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC20046 TAU Performance System Objectives Multi-level performance instrumentation Multi-language automatic source instrumentation Flexible and configurable performance measurement Widely-ported parallel performance profiling system Computer system architectures and operating systems Different programming languages and compilers Support for multiple parallel programming paradigms Multi-threading, message passing, mixed-mode, hybrid Enable performance mapping across semantic levels Support for object-oriented and generic programming Integration in complex software systems and applications
TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC20047 memory Node VM space Context SMP Threads node memory … … Interconnection Network Inter-node message communication * * physical view model view General Complex System Computation Model Node: physically distinct shared memory machine Message passing node interconnection network Context: distinct virtual memory space within node Thread: execution threads (user/system) in context
TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC20048 TAU Performance System Architecture
TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC20049 TAU Performance System Architecture
TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC TAU Instrumentation Approach Support for standard program events Routines and statement-level blocks Classes and templates Based on begin/end (paired) event semantics Support for user-defined events “User-defined timers” (begin/end semantics) Atomic events (standard and user-defined) Selection of event statistics Support definition of “semantic” entities for mapping Support for event groups Instrumentation optimization
TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC TAU Instrumentation Mechanisms Flexible instrumentation mechanisms at multiple levels Source code manual automatic C, C++, F77/90/95 (Program Database Toolkit (PDT)) OpenMP (directive rewriting (Opari), POMP spec) Object code pre-instrumented libraries (e.g., MPI using PMPI) statically-linked and dynamically-linked Executable code dynamic instrumentation (pre-execution) (DyninstAPI) virtual machine instrumentation (e.g., Java using JVMPI) TAU_COMPILER to automate instrumentation process
TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC Multi-Level Instrumentation and Mapping Multiple interfaces Simultaneously active Information sharing between interfaces Selective instrumentation Within/between levels Associate performance data with high-level semantic abstractions User-level abstractions problem domain source code object codelibraries instrumentation executable runtime image compiler linkerOS VM instrumentation performance data run preprocessor
TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC TAU Source Instrumentation Automatic source instrumentation (tau_instrumentor) Routine entry/exit and class method entry/exit Block entry/exit and statement level (to be added) Uses an instrumentation specification file Include/exclude list for events and files Uses command line options for group selection Instrumentation event selection (tau_select) Automatic generation of instrumentation specification file Instrumentation language to describe event constraints Event identity and location Event performance properties (e.g., overhead analysis) Create tau_select scripts for performance experiments
TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC Program Database Toolkit (PDT) Program code analysis framework Use to develop source-based tools High-level interface to source code information Integrated toolkit for source code parsing, database creation, and database query Commercial grade front-end parsers Portable IL analyzer, database format, and access API Open software approach for tool development Multiple source languages Implement automatic performance instrumentation tools tau_instrumentor
TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC Program Database Toolkit (PDT) Application / Library C / C++ parser Fortran parser F77/90/95 C / C++ IL analyzer Fortran IL analyzer Program Database Files IL DUCTAPE PDBhtml SILOON CHASM TAU_instr Program documentation Application component glue C++ / F90/95 interoperability Automatic source instrumentation
TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC PDT Status Cleanscape Flint parser fully integrated for F90/95 Flint parser is very robust Produces PDB records for TAU instrumentation Linux x86, HP Tru64, IBM AIX Tested on SAGE, POP, ESMF, PET benchmarking codes C++ and Fortran statement-level information for/while loops, declarations, initialization, assignment,… PDB records defined for most constructs PDT applications CHASM: C++ / Fortran 90/95 interoperability CCA: proxy generation, component instrumentation
TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC TAU Performance Measurement TAU supports profiling and tracing measurement Robust timing and hardware performance support Online profile access and sampling Extension of TAU measurement for multiple counters User-defined TAU counters and system-level metrics Integration with trace measurement Support for memory and callpath profiling Fully portable parallel performance tracing solution Hierarchical trace merging and trace translation Component software monitoring Online performance profile overhead compensation
TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC TAU Measurement Mechanisms Performance data sources High-resolution timer library (real-time / virtual clocks) General software counter library (user-defined events) Hardware performance counters PCL (Performance Counter Library) (ZAM, Germany) PAPI (Performance API) (UTK, Ptools Consortium) consistent, abstract, portable API Organization Node, context, thread levels Profile groups for collective events (runtime selective) Performance data mapping between software levels
TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC TAU Measurement Mechanisms (continued) Parallel profiling Function-level, block-level, statement-level Supports user-defined events TAU parallel profile data stored during execution Hardware counts values Support for multiple counters Support for callgraph and callpath profiling Tracing All profile-level events Inter-process communication events Inclusion of counter data in traced events
TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC Performance Analysis and Visualization Analysis of parallel profile and trace measurement Parallel profile analysis ParaProf: parallel profile analysis and presentation ParaVis: parallel performance data visualization (proto) Profile generation from trace data Performance data management framework (PerfDMF) Parallel trace analysis Translation to VTF (V3.0) and EPILOG formats Integration with VNG (Technical University of Dresden) Online parallel analysis and visualization Integration with CUBE browser (UTK, FZJ)
TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC ParaProf Framework Architecture Portable, extensible, and scalable tool for profile analysis Try to offer “best of breed” capabilities to analysts Build as profile analysis framework for extensibility
TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC TAU PerfDMF Architecture
TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC Selected Applications of TAU Center for Simulation of Accidental Fires and Explosion University of Utah, ASCI ASAP Center, C-SAFE Uintah Computational Framework (UCF) (C++) Center for Simulation of Dynamic Response of Materials California Institute of Technology, ASCI ASAP Center Virtual Testshock Facility (VTF) (Python, Fortran 90) Earth Systems Modeling Framework (ESMF) NSF, NOAA, DOE, NASA, … Instrumentation for ESMF framework and applications C, C++, and Fortran 95 code modules MPI wrapper library for MPI calls
TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC Selected Applications of TAU (continued) Lawrence Livermore National Lab Hydrodynamics (Miranda) Radiation diffusion (KULL) C++ automatic instrumentation, callpath profiling Sandia National Lab DOE CCTTSS SciDAC project Common component architecture (CCA) integration Combustion code (C++, Fortran 90, GrACE, MPI) Los Alamos National Lab Monte Carlo transport (MCNP) (Susan Post) ASCI Q validation and scaling SAIC’s Adaptive Grid Eulerian (SAGE) (Jack Horner) Fortran 90 automatic instrumentation testcase
TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC Component-Based Scientific Applications How to support performance analysis and tuning process consistent with application development methodology? Common Component Architecture (CCA) applications Performance tools should integrate with software Design performance observation component Measurement port and measurement interfaces Build support for application component instrumentation Interpose a proxy component for each port Inside proxy, track caller/callee invocations and timings Automate the process of proxy component creation using PDT for static analysis of components include support for selective instrumentation
TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC Flame Reaction-Diffusion (Sandia, J. Ray) CCAFFEINE
TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC Earth Systems Modeling Framework Coupled modeling with modular software framework Instrumentation for ESMF framework and applications PDT automatic instrumentation Fortran 95 code modules C / C++ code modules MPI wrapper library for MPI calls ESMF Component instrumentation (using CCA) CCA measurement port manual instrumentation Proxy generation using PDT and runtime interposition Significant use of callpath profiling by ESMF team
TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC TAU’s Paraprof Profile Browser (ESMF Data) Callpath profile Global profile
TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC CUBE Browser (UTK, FZJ) (ESMF Data) metric calltree location TAU callpath profile data converted to CUBE form
TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC TAU Traces with Hardware Counters (ESMF)
TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC TAU Traces with User-Defined Counters
TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC Uintah Computational Framework (UCF) University of Utah, Center for Simulation of Accidental Fires and Explosions (C-SAFE), DOE ASCI Center UCF analysis Scheduling MPI library Components Performance mapping Use for online and offline visualization ParaVis tools F 500 processes
TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC Scatterplot Displays (UCF, 500 processes) Each point coordinate determined by three values: MPI_Reduce MPI_Recv MPI_Waitsome Min/Max value range Effective for cluster analysis Relation between MPI_Recv and MPI_Waitsome
TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC Online Unitah Performance Profiling Demonstration of profiling sampling capability Multiple profile samples Each profile taken at major iteration (~ 60 seconds) Colliding elastic disks C-SAFE application Test material point method (MPM) code Executed on 512 processors ASCI Blue Pacific at LLNL Example 3D bargraph visualization MPI execution time Performance mapping Multiple time steps
TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC Online Unitah Performance Profiling
TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC Miranda Performance Analysis Miranda is a research hydrodynamics code at LLNL Fortran 95, MPI Mostly synchronous MPI_ALLTOALL on Np x,y communicators Some MPI reductions and broadcasts for statistics Good communications scaling ACL and MCR Linux cluster Up to 1728 CPUs Fixed workload per CPU Ported to BlueGene/L
TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC Profiling of Miranda on BG/L (Miller, LLNL) 128 Nodes512 Nodes1024 Nodes Profile code performance (automatic instrumentation) Scaling studies (problem size, number of processors) Run on 8K and 16K processors just two week ago!
TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC Fine Grained Profiling via Tracing on Miranda Use TAU to generate VTF3 traces for Vampir analysis Combines MPI calls with HW counter information Detailed code behavior to focus optimization efforts
TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC Max Heap Memory (KB) used for problem on 16 processors of ASC Frost at LLNL Memory Usage Analysis in Miranda on BG/L BG/L will have limited memory per node (512 MB) Miranda uses TAU to profile memory usage Streamlines code Squeeze larger problems on the machine TAU’s footprint is small Approximately 60 bytes per event per thread
TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC TAU Performance System Status Computing platforms (selected) IBM SP / pSeries, SGI Origin 2K/3K, Cray T3E / SV-1 / X1, HP (Compaq) SC (Tru64), Sun, Hitachi SR8000, NEC SX-5/6, Linux (IA-32/64, Alpha, PPC, PA-RISC, Power, Opteron), Apple (G4/5, OS X), Windows, BG/L Programming languages C, C++, Fortran 77/90/95, HPF, Java, OpenMP, Python Thread libraries pthreads, SGI sproc, Java,Windows, OpenMP Compilers (selected) Intel (KCC, KAP/Pro), PGI, GNU, Fujitsu, Sun, Microsoft, SGI, Cray, IBM (xlc, xlf), HP, NEC, Absoft
TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC Concluding Remarks Complex ASC parallel systems and software pose challenging performance analysis problems that require robust methodologies and tools To build more sophisticated performance tools, existing proven performance technology must be utilized Performance tools must be integrated with software and systems models and technology Performance engineered software Function consistently and coherently TAU performance system offers robust performance technology that can be broadly integrated in next- generation scalable software and systems
TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC Acknowledgements Department of Energy (DOE) MICS office “Performance Technology for Tera-class Parallel Computer Systems: Evolution of the TAU Performance System” “Performance Analysis of Parallel Component Software” NNSA/ASC University of Utah DOE ASCI Level 1 sub-contract ASCI Level 3 project (LANL, LLNL, SNL) Research Centre Juelich John von Neumann Institute for Computing Dr. Bernd Mohr Los Alamos National Laboratory