Allen D. Malony Performance Research Laboratory (PRL) Neuroinformatics Center (NIC) Department of Computer and Information Science University of Oregon The TAU Parallel Performance System
OSDL Outline Research interests and motivation TAU performance system Instrumentation Measurement Analysis tools Parallel profile analysis (ParaProf) Performance data management (PerfDMF) Performance data mining (PerfExplorer) TAU status Open Trace Format (OTF) ZeptoOS and KTAU
The TAU Parallel Performance SystemOSDL Research Motivation Tools for performance problem solving Empirical-based performance optimization process Performance technology concerns characterization Performance Tuning Performance Diagnosis Performance Experimentation Performance Observation hypotheses properties Instrumentation Measurement Analysis Visualization Performance Technology Experiment management Performance data storage Performance Technology
The TAU Parallel Performance SystemOSDL TAU Performance System Tuning and Analysis Utilities (14+ year project effort) Performance system framework for HPC systems Integrated, scalable, flexible, and parallel Targets a general complex system computation model Entities: nodes / contexts / threads Multi-level: system / software / parallelism Measurement and analysis abstraction Integrated toolkit for performance problem solving Instrumentation, measurement, analysis, and visualization Portable performance profiling and tracing facility Performance data management and data mining Partners: LLNL, ANL, Research Center Jülich, LANL
The TAU Parallel Performance SystemOSDL TAU Parallel Performance System Goals Portable (open source) parallel performance system Computer system architectures and operating systems Different programming languages and compilers Multi-level, multi-language performance instrumentation Flexible and configurable performance measurement Support for multiple parallel programming paradigms Multi-threading, message passing, mixed-mode, hybrid, object oriented (generic), component Support for performance mapping Integration of leading performance technology Scalable (very large) parallel performance analysis
The TAU Parallel Performance SystemOSDL memory Node VM space Context SMP Threads node memory … … Interconnection Network Inter-node message communication * * physical view model view General Complex System Computation Model Node: physically distinct shared memory machine Message passing node interconnection network Context: distinct virtual memory space within node Thread: execution threads (user/system) in context
The TAU Parallel Performance SystemOSDL TAU Performance System Architecture
The TAU Parallel Performance SystemOSDL TAU Performance System Architecture
The TAU Parallel Performance SystemOSDL TAU Instrumentation Approach Support for standard program events Routines, classes and templates Statement-level blocks Support for user-defined events Begin/End events (“user-defined timers”) Atomic events (e.g., size of memory allocated/freed) Selection of event statistics Support definition of “semantic” entities for mapping Support for event groups (aggregation, selection) Instrumentation optimization Eliminate instrumentation in lightweight routines
The TAU Parallel Performance SystemOSDL TAU Instrumentation Mechanisms Source code Manual (TAU API, TAU component API) Automatic (robust) C, C++, F77/90/95 (Program Database Toolkit (PDT)) OpenMP (directive rewriting (Opari), POMP2 spec) Object code Pre-instrumented libraries (e.g., MPI using PMPI) Statically-linked and dynamically-linked Executable code Dynamic instrumentation (pre-execution) (DynInstAPI) Virtual machine instrumentation (e.g., Java using JVMPI) TAU_COMPILER to automate instrumentation process
The TAU Parallel Performance SystemOSDL User-level abstractions problem domain source code object codelibraries instrumentation executable runtime image compiler linkerOS VM instrumentation performance data run preprocessor Multi-Level Instrumentation and Mapping Multiple interfaces Information sharing Between interfaces Event selection Within/between levels Mapping Associate performance data with high-level semantic abstractions
The TAU Parallel Performance SystemOSDL Program Database Toolkit (PDT) Application / Library C / C++ parser Fortran parser F77/90/95 C / C++ IL analyzer Fortran IL analyzer Program Database Files IL DUCTAPE PDBhtml SILOON CHASM tau_instrument or Program documentation Application component glue C++ / F90/95 interoperability Automatic source instrumentation
The TAU Parallel Performance SystemOSDL Program Database Toolkit (PDT) Program code analysis framework Develop source-based tools High-level interface to source code information Integrated toolkit for source code parsing, database creation, and database query Commercial grade front-end parsers Portable IL analyzer, database format, and access API Open software approach for tool development Multiple source languages Implement automatic performance instrumentation tools tau_instrumentor
The TAU Parallel Performance SystemOSDL TAU Measurement Approach Portable and scalable parallel profiling solution Multiple profiling types and options Event selection and control (enabling/disabling, throttling) Online profile access and sampling Online performance profile overhead compensation Portable and scalable parallel tracing solution Trace translation to Open Trace Format (OTF) Trace streams and hierarchical trace merging Robust timing and hardware performance support Multiple counters (hardware, user-defined, system) Performance measurement for CCA component software
The TAU Parallel Performance SystemOSDL TAU Measurement Mechanisms Parallel profiling Function-level, block-level, statement-level Supports user-defined events and mapping events TAU parallel profile stored (dumped) during execution Support for flat, callgraph/callpath, phase profiling Support for memory profiling Tracing All profile-level events Inter-process communication events Inclusion of multiple counter data in traced events
The TAU Parallel Performance SystemOSDL Types of Parallel Performance Profiling Flat profiles Metric (e.g., time) spent in an event (callgraph nodes) Exclusive/inclusive, # of calls, child calls Callpath profiles (Calldepth profiles) Time spent along a calling path (edges in callgraph) “main=> f1 => f2 => MPI_Send” (event name) TAU_CALLPATH_LENGTH environment variable Phase profiles Flat profiles under a phase (nested phases are allowed) Default “main” phase Supports static or dynamic (per-iteration) phases
The TAU Parallel Performance SystemOSDL Performance Analysis and Visualization Analysis of parallel profile and trace measurement Parallel profile analysis ParaProf: parallel profile analysis and presentation ParaVis: parallel performance visualization package Profile generation from trace data (tau2pprof) Performance data management framework (PerfDMF) Parallel trace analysis Translation to VTF (V3.0), EPILOG, OTF formats Integration with VNG (Technical University of Dresden) Online parallel analysis and visualization Integration with CUBE browser (KOJAK, UTK, FZJ)
The TAU Parallel Performance SystemOSDL ParaProf Parallel Performance Profile Analysis HPMToolkit MpiP TAU Raw files PerfDMF managed (database) Metadata Application Experiment Trial
The TAU Parallel Performance SystemOSDL Example Applications sPPM ASCI benchmark, Fortran, C, MPI, OpenMP or pthreads Miranda research hydrodynamics code, Fortran, MPI GYRO tokamak turbulence simulation, Fortran, MPI FLASH physics simulation, Fortran, MPI WRF weather research and forecasting, Fortran, MPI S3D 3D combustion, Fortran, MPI
The TAU Parallel Performance SystemOSDL ParaProf – Flat Profile (Miranda, BG/L) 8K processors node, context, thread Miranda hydrodynamics Fortran + MPI LLNL Run to 64K
The TAU Parallel Performance SystemOSDL ParaProf – Stacked View (Miranda)
The TAU Parallel Performance SystemOSDL ParaProf – Callpath Profile (Flash) Flash thermonuclear flashes Fortran + MPI Argonne
The TAU Parallel Performance SystemOSDL ParaProf – Histogram View (Miranda) 8k processors 16k processors
The TAU Parallel Performance SystemOSDL NAS BT – Flat Profile How is MPI_Wait() distributed relative to solver direction? Application routine names reflect phase semantics
The TAU Parallel Performance SystemOSDL NAS BT – Phase Profile (Main and X, Y, Z) Main phase shows nested phases and immediate events
The TAU Parallel Performance SystemOSDL ParaProf – 3D Full Profile (Miranda) 16k processors
The TAU Parallel Performance SystemOSDL ParaProf – 3D Full Profile (Flash) 128 processors
The TAU Parallel Performance SystemOSDL ParaProf Bar Plot (Zoom in/out +/-)
The TAU Parallel Performance SystemOSDL ParaProf – 3D Scatterplot (Miranda) Each point is a “thread” of execution A total of four metrics shown in relation ParaVis 3D profile visualization library JOGL
The TAU Parallel Performance SystemOSDL ParaProf – Callgraph Zoom (Flash) Zoom in (+) Zoom out (-)
The TAU Parallel Performance SystemOSDL Performance Tracing on Miranda Use TAU to generate VTF3 traces for Vampir analysis MPI calls with HW counter information (not shown) Detailed code behavior to focus optimization efforts
The TAU Parallel Performance SystemOSDL S3D on Lemieux (TAU-to-VTF3, Vampir) S3D 3D combustion Fortran + MPI PSC
The TAU Parallel Performance SystemOSDL S3D on Lemieux (Zoomed)
The TAU Parallel Performance SystemOSDL Runtime MPI Shared Library Instrumentation We can now interpose the MPI wrapper library for applications that have already been compiled (no re- compilation or re-linking necessary!) Uses LD_PRELOAD for Linux Soon on AIX using MPI_EUILIB/MPI_EUILIBPATH Simply compile TAU with MPI support and prefix your MPI program with tau_load.sh Requires shared library MPI Approach will work with other shared libraries % mpirun –np 4 tau_load.sh a.out
The TAU Parallel Performance SystemOSDL Workload Characterization Idea: partition performance data for individual functions based on runtime parameters Enable by configuring with –PROFILEPARAM TAU_PROFILE_PARAM1L (value, “name”) Simple example: void foo(int input) { TAU_PROFILE("foo", "", TAU_DEFAULT); TAU_PROFILE_PARAM1L(input, "input");... }
The TAU Parallel Performance SystemOSDL Workload Characterization (continued) 5 seconds spent in function “ foo ” becomes 2 seconds for “ foo [ = ] ” 1 seconds for “ foo [ = ] ” … Currently used in MPI wrapper library Allows for partitioning of time spent in MPI routines based on parameters (message size, message tag, destination node) Can be extrapolated to infer specifics about the MPI subsystem and system as a whole
The TAU Parallel Performance SystemOSDL Characterization Based on Message Size Simple example, send/receive squared message sizes (0-32MB) #include int main(int argc, char **argv) { int rank, size, i, j; int buffer[16*1024*1024]; MPI_Init(&argc, &argv); MPI_Comm_size( MPI_COMM_WORLD, &size ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); for (i=0;i<1000;i++) for (j=1;j<16*1024*1024;j*=2) { if (rank == 0) { MPI_Send(buffer,j,MPI_INT,1,42,MPI_COMM_WORLD); } else { MPI_Status status; MPI_Recv(buffer,j,MPI_INT,0,42,MPI_COMM_WORLD,&status); } MPI_Finalize(); }
The TAU Parallel Performance SystemOSDL Characterization Results Two different message sizes (~3.3MB and ~4K)
The TAU Parallel Performance SystemOSDL Important Questions for Application Developers How does performance vary with different compilers? Is poor performance correlated with certain OS features? Has a recent change caused unanticipated performance? How does performance vary with MPI variants? Why is one application version faster than another? What is the reason for the observed scaling behavior? Did two runs exhibit similar performance? How are performance data related to application events? Which machines will run my code the fastest and why? Which benchmarks predict my code performance best?
The TAU Parallel Performance SystemOSDL Performance Problem Solving Goals Answer questions at multiple levels of interest Data from low-level measurements and simulations use to predict application performance High-level performance data spanning dimensions machine, applications, code revisions, data sets examine broad performance trends Discover general correlations application performance and features of their external environment Develop methods to predict application performance on lower-level metrics Discover performance correlations between a small set of benchmarks and a collection of applications that represent a typical workload for a given system
The TAU Parallel Performance SystemOSDL Performance Data Management (PerfDMF) K. Huck, A. Malony, R. Bell, A. Morris, “Design and Implementation of a Parallel Performance Data Management Framework,” ICPP (awarded best paper)
The TAU Parallel Performance SystemOSDL Performance Data Mining (Objectives) Conduct parallel performance analysis in a systematic, collaborative and reusable manner Manage performance complexity Discover performance relationship and properties Automate process Multi-experiment performance analysis Large-scale performance data reduction Summarize characteristics of large processor runs Implement extensible analysis framework Abtraction / automation of data mining operations Interface to existing analysis and data mining tools
The TAU Parallel Performance SystemOSDL Performance Data Mining (PerfExplorer) Performance knowledge discovery framework Data mining analysis applied to parallel performance data comparative, clustering, correlation, dimension reduction, … Use the existing TAU infrastructure TAU performance profiles, PerfDMF Client-server based system architecture Technology integration Java API and toolkit for portability PerfDMF R-project/Omegahat, Octave/Matlab statistical analysis WEKA data mining package JFreeChart for visualization, vector output (EPS, SVG)
The TAU Parallel Performance SystemOSDL Performance Data Mining (PerfExplorer) K. Huck and A. Malony, “PerfExplorer: A Performance Data Mining Framework For Large-Scale Parallel Computing,” SC 2005.
The TAU Parallel Performance SystemOSDL PerfExplorer Analysis Methods Data summaries, distributions, scatterplots Clustering k-means Hierarchical Correlation analysis Dimension reduction PCA Random linear projection Thresholds Comparative analysis Data management views
The TAU Parallel Performance SystemOSDL Cluster Analysis Performance data represented as vectors - each dimension is the cumulative time for an event k-means: k random centers are selected and instances are grouped with the "closest" (Euclidean) center New centers are calculated and the process repeated until stabilization or max iterations Dimension reduction necessary for meaningful results Virtual topology, summaries constructed
The TAU Parallel Performance SystemOSDL sPPM Cluster Analysis
The TAU Parallel Performance SystemOSDL Flash Clustering on 16K BG/L Processors Four significant events automatically selected Clusters and correlations are visible
The TAU Parallel Performance SystemOSDL Correlation Analysis Describes strength and direction of a linear relationship between two variables (events) in the data
The TAU Parallel Performance SystemOSDL Comparative Analysis Relative speedup, efficiency total runtime, by event, one event, by phase Breakdown of total runtime Group fraction of total runtime Correlating events to total runtime Timesteps per second Performance Evaluation Research Center (PERC) PERC tools study (led by ORNL, Pat Worley) In-depth performance analysis of select applications Evaluation performance analysis requirements Test tool functionality and ease of use
The TAU Parallel Performance SystemOSDL PerfExplorer Interface Select experiments and trials of interest Data organized in application, experiment, trial structure (will allow arbitrary in future) Experiment metadata
The TAU Parallel Performance SystemOSDL PerfExplorer Interface Select analysis
The TAU Parallel Performance SystemOSDL B1-std B3-gtc Timesteps per Second Cray X1 is the fastest to solution in all 3 tests FFT (nl2) improves time for B3-gtc only TeraGrid faster than p690 for B1-std? Plots generated automatically B1-std B2-cy B3-gtc TeraGrid
The TAU Parallel Performance SystemOSDL Relative Efficiency (B1-std) By experiment (B1-std) Total runtime (Cheetah (red)) By event for one experiment Coll_tr (blue) is significant By experiment for one event Shows how Coll_tr behaves for all experiments 16 processor base case CheetahColl_tr
The TAU Parallel Performance SystemOSDL PerfExplorer Future Work Extensions to PerfExplorer framework Examine properties of performance data Automated guidance of analysis Workflow scripting for repeatable analysis Dependency modeling (go beyond correlation) Time-series analysis of phase-based data
The TAU Parallel Performance SystemOSDL Open Trace Format (OTF) Features Hierarchical trace format Replacement for proprietary formats such as STF Pallas and Intel Efficient streams based parallel access Tracing library available on IBM BG/L platform Development of OTF supported by LLNL Joint development effort ZiH / Technical University of Dresden ParaTools, Inc.
The TAU Parallel Performance SystemOSDL OTF Options
The TAU Parallel Performance SystemOSDL Vampir and VNG Commercial trace based tools Developed at ZiH, T.U. Dresden Wolfgang Nagel, Holger Brunst and others… Vampir Trace Visualizer Known also as Intel ® Trace Analyzer v4.0 Sequential program Vampir Next Generation (VNG) Client (vng) runs on a desktop, server (vngd) on a cluster Parallel trace analysis Orders of magnitude bigger traces (more memory) State of the art in parallel trace visualization
The TAU Parallel Performance SystemOSDL Vampir Next Generation (VNG) Architecture Merged Traces Analysis Server Classic Analysis: monolithic sequential Worker 1 Worker 2 Worker m Master Trace 1 Trace 2 Trace 3 Trace N File System Internet Parallel Program Monitor System Event Streams Visualization Client Segment Indicator 768 Processes Thumbnail Timeline with 16 visible Traces Process Parallel I/O Message Passing
The TAU Parallel Performance SystemOSDL TAU Tracing Enhancements Configure TAU with -TRACE –vtf= –otf= options % configure –TRACE –vtf= … % configure –TRACE –otf= … Generates tau_merge, tau2vtf, tau2otf tools in / /bin % tau_f90.sh app.f90 –o app Instrument and execute application % mpirun -np 4 app Merge and convert trace files to VTF3/OTF format % tau_treemerge.pl % tau2vtf tau.trc tau.edf app.vpt.gz % vampir foo.vpt.gz OR % tau2otf tau.trc tau.edf app.otf –n % vampir app.otf OR use VNG to analyze OTF/VTF trace files
The TAU Parallel Performance SystemOSDL TAU Eclipse Integration Eclipse GUI integration of existing TAU tools New Eclipse plug-in for code instrumentation Integration with CDT and FDT Java, C/C++, and Fortran projects Can be instrumented and run from within eclipse Each project can be given multiple build configurations corresponding to available TAU makefiles All TAU configuration options are available Paraprof tool can be launched automatically
The TAU Parallel Performance SystemOSDL TAU Eclipse Integration TAU configuration TAU experimentation
The TAU Parallel Performance SystemOSDL TAU Eclipse Future Work Development of the TAU Eclipse plugins for Java and the CDT/FDT is ongoing Planned features include: Full integration with the Eclipse Parallel Tools project Database storage of project performance data Refinement of the plugin settings interface to allow easier selection of TAU runtime and compiletime options Accessibility of TAU configuration and commandline tools via the Eclipse UI
The TAU Parallel Performance SystemOSDL ZeptoOS and TAU DOE OS/RTS for Extreme Scale Scientific Computation OS research for petascale systems ZeptoOS project scalable, adaptive components for petascale architectures Argonne National Laboratory and University of Oregon University of Oregon Kernel-level performance monitoring OS component performance assessment and tuning KTAU (Kernel Tuning and Analysis Utilities) integration of TAU infrastructure in Linux kernel integration with ZeptoOS (light-weight Linux-based kernel) installation on BG/L and other platforms (e.g., Cray XT3) Port to 32-bit and 64-bit Linux platforms
The TAU Parallel Performance SystemOSDL Linux Kernel Profiling using TAU – Goals Fine-grained kernel-level performance measurement Parallel applications Support both profiling and tracing Both process-centric and system-wide view Merge user-space performance with kernel-space User-space: (TAU) profile/trace Kernel-space: (KTAU) profile/trace Detailed program-OS interaction data Including interrupts (IRQ) Analysis and visualization compatible with TAU
The TAU Parallel Performance SystemOSDL KTAU Architecture
The TAU Parallel Performance SystemOSDL KTAU On BG/L
The TAU Parallel Performance SystemOSDL KTAU Future Work Dynamic measurement control Enable/disable events w/o recompilation or reboot Add new performance data sources Look into hardware counters Improve user-space integration Full callpaths and phase-based profiling Merged user/kernel traces Integration with monitoring technology SuperMon, MRNet, TAUg New porting efforts IA-64, PPC-64 and AMD Opteron System characterization studies
The TAU Parallel Performance SystemOSDL TAU Performance System Status Computing platforms IBM, SGI, Cray, HP, Sun, Hitachi, NEC, Linux clusters, Apple, Windows, … Programming languages C, C++, Fortran 90/95, UPC, HPF, Java, OpenMP, Python Thread libraries pthreads, SGI sproc, Java,Windows, OpenMP Communications libraries MPI-1/2, PVM, shmem, … Compilers IBM, Intel, PGI, GNU, Fujitsu, Sun, NAG, Microsoft, SGI, Cray, HP, NEC, Absoft, Lahey, PathScale, Open64
The TAU Parallel Performance SystemOSDL Project Affiliations (selected) Lawrence Livermore National Lab Hydrodynamics (Miranda), radiation diffusion (KULL) Open Trace Format (OTF) implementation on BG/L Argonne National Lab ZeptoOS project and KTAU Astrophysical thermonuclear flashes (Flash) Center for Simulation of Accidental Fires and Explosion University of Utah, ASCI ASAP Center, C-SAFE Uintah Computational Framework (UCF) Oak Ridge National Lab Contribution to the Joule Report (S3D, AORSA3D)
The TAU Parallel Performance SystemOSDL Project Affiliations (continued) Sandia National Lab Simulation of turbulent reactive flows (S3D) Combustion code (CFRFS) Los Alamos National Lab Monte Carlo transport (MCNP) SAIC’s Adaptive Grid Eulerian (SAGE) CCSM / ESMF / WRF climate/earth/weather simulation NSF, NOAA, DOE, NASA, … Common component architecture (CCA) integration Performance Evaluation Research Center (PERC) DOE SciDAC center
The TAU Parallel Performance SystemOSDL Support Acknowledgements Department of Energy (DOE) Office of Science MICS, Argonne National Lab ASC/NNSA University of Utah ASC/NNSA Level 1 ASC/NNSA, Lawrence Livermore National Lab Department of Defense (DoD) HPC Modernization Office (HPCMO) Programming Environment and Training (PET) NSF Software and Tools for High-End Computing Research Centre Juelich Los Alamos National Laboratory ParaTools
The TAU Parallel Performance SystemOSDL Acknowledgements Dr. Sameer Shende, Senior Scientist Alan Morris, Senior Software Engineer Wyatt Spear, PRL staff Scott Biersdorff, PRL staff Robert Yelle, PRL staff Kevin Huck, Ph.D. student Aroon Nataraj, Ph.D. student Kai Li, Ph.D. student Li Li, Ph.D. student Suravee Suthikulpanit, M.S. student