Allen D. Malony Performance Research Laboratory (PRL) Neuroinformatics Center (NIC) Department of Computer and Information Science University of Oregon Performance Technology for Productive, High-End Parallel Computing: the TAU Parallel Performance System
HLRS 2006Performance Technology for Productive, High-End Parallel Computing Outline Research interests and motivation TAU performance system Instrumentation Measurement Analysis tools Parallel profile analysis (ParaProf) Performance data management (PerfDMF) Performance data mining (PerfExplorer) Open Trace Format (OTF) Conclusions and Future Work
HLRS 2006Performance Technology for Productive, High-End Parallel Computing Research Motivation Tools for performance problem solving Empirical-based performance optimization process Performance technology concerns characterization Performance Tuning Performance Diagnosis Performance Experimentation Performance Observation hypotheses properties Instrumentation Measurement Analysis Visualization Experiment management Performance data storage Performance data mining Model-based
HLRS 2006Performance Technology for Productive, High-End Parallel Computing Challenges in Performance Problem Solving How to make the process more effective (productive)? Process likely to change as parallel systems evolve What are the important events and performance metrics? Tied to application structure and computational model Tied to application domain and algorithms What are the significant issues that will affect the technology used to support the process? Enhance application development and optimization Process and tools can/must be more application-aware Tools have poor support for application-specific aspects Integrate performance technology and process
HLRS 2006Performance Technology for Productive, High-End Parallel Computing Performance Process, Technology, and Scale How does our view of this process change when we consider very large-scale parallel systems? Scaling complicates observation and analysis Performance data size standard approaches deliver a lot of data with little value Measurement overhead and intrusion tradeoff with analysis accuracy “noise” in the system Analysis complexity increases What will enhance productive application development? Process and technology evolution Nature of application development may change
HLRS 2006Performance Technology for Productive, High-End Parallel Computing Role of Intelligence, Automation, and Knowledge Scale forces the process to become more intelligent Even with intelligent and application-specific tools, the decisions of what to analyze is difficult and intractable More automation and knowledge-based decision making Build automatic/autonomic capabilities into the tools Support broader experimentation methods and refinement Access and correlate data from several sources Automate performance data analysis / mining / learning Include predictive features and experiment refinement Knowledge-driven adaptation and optimization guidance Address scale issues through increased expertise
HLRS 2006Performance Technology for Productive, High-End Parallel Computing TAU Performance System Tuning and Analysis Utilities (14+ year project effort) Performance system framework for HPC systems Integrated, scalable, flexible, and parallel Targets a general complex system computation model Entities: nodes / contexts / threads Multi-level: system / software / parallelism Measurement and analysis abstraction Integrated toolkit for performance problem solving Instrumentation, measurement, analysis, and visualization Portable performance profiling and tracing facility Performance data management and data mining Partners: LLNL, ANL, Research Center Jülich, LANL
HLRS 2006Performance Technology for Productive, High-End Parallel Computing TAU Parallel Performance System Goals Portable (open source) parallel performance system Computer system architectures and operating systems Different programming languages and compilers Multi-level, multi-language performance instrumentation Flexible and configurable performance measurement Support for multiple parallel programming paradigms Multi-threading, message passing, mixed-mode, hybrid, object oriented (generic), component-based Support for performance mapping Integration of leading performance technology Scalable (very large) parallel performance analysis
HLRS 2006Performance Technology for Productive, High-End Parallel Computing memory Node VM space Context SMP Threads node memory … … Interconnection Network Inter-node message communication * * physical view model view General Complex System Computation Model Node: physically distinct shared memory machine Message passing node interconnection network Context: distinct virtual memory space within node Thread: execution threads (user/system) in context
HLRS 2006Performance Technology for Productive, High-End Parallel Computing TAU Performance System Architecture
HLRS 2006Performance Technology for Productive, High-End Parallel Computing TAU Performance System Architecture
HLRS 2006Performance Technology for Productive, High-End Parallel Computing TAU Instrumentation Approach Support for standard program events Routines, classes and templates Statement-level blocks Support for user-defined events Begin/End events (“user-defined timers”) Atomic events (e.g., size of memory allocated/freed) Selection of event statistics Support definition of “semantic” entities for mapping Support for event groups (aggregation, selection) Instrumentation optimization Eliminate instrumentation in lightweight routines
HLRS 2006Performance Technology for Productive, High-End Parallel Computing TAU Instrumentation Mechanisms Source code Manual (TAU API, TAU component API) Automatic (robust) C, C++, F77/90/95 (Program Database Toolkit (PDT)) OpenMP (directive rewriting (Opari), POMP2 spec) Object code Pre-instrumented libraries (e.g., MPI using PMPI) Statically-linked and dynamically-linked Executable code Dynamic instrumentation (pre-execution) (DynInstAPI) Virtual machine instrumentation (e.g., Java using JVMPI) TAU_COMPILER to automate instrumentation process
HLRS 2006Performance Technology for Productive, High-End Parallel Computing User-level abstractions problem domain source code object codelibraries instrumentation executable runtime image compiler linkerOS VM instrumentation performance data run preprocessor Multi-Level Instrumentation and Mapping Multiple interfaces Information sharing Between interfaces Event selection Within/between levels Mapping Associate performance data with high-level semantic abstractions
HLRS 2006Performance Technology for Productive, High-End Parallel Computing Program Database Toolkit (PDT) Application / Library C / C++ parser Fortran parser F77/90/95 C / C++ IL analyzer Fortran IL analyzer Program Database Files IL DUCTAPE PDBhtml SILOON CHASM tau_instrument or Program documentation Application component glue C++ / F90/95 interoperability Automatic source instrumentation
HLRS 2006Performance Technology for Productive, High-End Parallel Computing Program Database Toolkit (PDT) Program code analysis framework Develop source-based tools High-level interface to source code information Integrated toolkit for source code parsing, database creation, and database query Commercial grade front-end parsers Portable IL analyzer, database format, and access API Open software approach for tool development Multiple source languages Implement automatic performance instrumentation tools tau_instrumentor
HLRS 2006Performance Technology for Productive, High-End Parallel Computing TAU Measurement Approach Portable and scalable parallel profiling solution Multiple profiling types and options Event selection and control (enabling/disabling, throttling) Online profile access and sampling Online performance profile overhead compensation Portable and scalable parallel tracing solution Trace translation to EPILOG, VTF3, and OTF Trace streams (OTF) and hierarchical trace merging Robust timing and hardware performance support Multiple counters (hardware, user-defined, system) Performance measurement for CCA component software
HLRS 2006Performance Technology for Productive, High-End Parallel Computing TAU Measurement Mechanisms Parallel profiling Function-level, block-level, statement-level Supports user-defined events and mapping events TAU parallel profile stored (dumped) during execution Support for flat, callgraph/callpath, phase profiling Support for memory profiling Tracing All profile-level events Inter-process communication events Inclusion of multiple counter data in traced events
HLRS 2006Performance Technology for Productive, High-End Parallel Computing Types of Parallel Performance Profiling Flat profiles Metric (e.g., time) spent in an event (callgraph nodes) Exclusive/inclusive, # of calls, child calls Callpath profiles (Calldepth profiles) Time spent along a calling path (edges in callgraph) “main=> f1 => f2 => MPI_Send” (event name) TAU_CALLPATH_LENGTH environment variable Phase profiles Flat profiles under a phase (nested phases are allowed) Default “main” phase Supports static or dynamic (per-iteration) phases
HLRS 2006Performance Technology for Productive, High-End Parallel Computing Performance Analysis and Visualization Analysis of parallel profile and trace measurement Parallel profile analysis ParaProf: parallel profile analysis and presentation ParaVis: parallel performance visualization package Profile generation from trace data (tau2pprof) Performance data management framework (PerfDMF) Parallel trace analysis Translation to VTF (V3.0), EPILOG, OTF formats Integration with VNG (Technical University of Dresden) Online parallel analysis and visualization Integration with CUBE browser (KOJAK, UTK, FZJ)
HLRS 2006Performance Technology for Productive, High-End Parallel Computing ParaProf Parallel Performance Profile Analysis HPMToolkit MpiP TAU Raw files PerfDMF managed (database) Metadata Application Experiment Trial
HLRS 2006Performance Technology for Productive, High-End Parallel Computing Example Applications sPPM ASCI benchmark, Fortran, C, MPI, OpenMP or pthreads Miranda research hydrodynamics code, Fortran, MPI GYRO tokamak turbulence simulation, Fortran, MPI FLASH physics simulation, Fortran, MPI WRF weather research and forecasting, Fortran, MPI S3D 3D combustion, Fortran, MPI
HLRS 2006Performance Technology for Productive, High-End Parallel Computing ParaProf – Flat Profile (Miranda, BG/L) 8K processors node, context, thread Miranda hydrodynamics Fortran + MPI LLNL Run to 64K
HLRS 2006Performance Technology for Productive, High-End Parallel Computing ParaProf – Stacked View (Miranda)
HLRS 2006Performance Technology for Productive, High-End Parallel Computing ParaProf – Callpath Profile (Flash) Flash thermonuclear flashes Fortran + MPI Argonne
HLRS 2006Performance Technology for Productive, High-End Parallel Computing ParaProf – Histogram View (Miranda) 8k processors 16k processors
HLRS 2006Performance Technology for Productive, High-End Parallel Computing NAS BT – Flat Profile How is MPI_Wait() distributed relative to solver direction? Application routine names reflect phase semantics
HLRS 2006Performance Technology for Productive, High-End Parallel Computing NAS BT – Phase Profile (Main and X, Y, Z) Main phase shows nested phases and immediate events
HLRS 2006Performance Technology for Productive, High-End Parallel Computing ParaProf – 3D Full Profile (Miranda) 16k processors
HLRS 2006Performance Technology for Productive, High-End Parallel Computing ParaProf – 3D Full Profile (Flash) 128 processors
HLRS 2006Performance Technology for Productive, High-End Parallel Computing ParaProf Bar Plot (Zoom in/out +/-)
HLRS 2006Performance Technology for Productive, High-End Parallel Computing ParaProf – 3D Scatterplot (Miranda) Each point is a “thread” of execution A total of four metrics shown in relation ParaVis 3D profile visualization library JOGL
HLRS 2006Performance Technology for Productive, High-End Parallel Computing ParaProf – Callgraph Zoom (Flash) Zoom in (+) Zoom out (-)
HLRS 2006Performance Technology for Productive, High-End Parallel Computing Performance Tracing on Miranda Use TAU to generate VTF3 traces for Vampir analysis MPI calls with HW counter information (not shown) Detailed code behavior to focus optimization efforts
HLRS 2006Performance Technology for Productive, High-End Parallel Computing S3D on Lemieux (TAU-to-VTF3, Vampir) S3D 3D combustion Fortran + MPI PSC
HLRS 2006Performance Technology for Productive, High-End Parallel Computing S3D on Lemieux (Zoomed)
HLRS 2006Performance Technology for Productive, High-End Parallel Computing Runtime MPI Shared Library Instrumentation We can now interpose the MPI wrapper library for applications that have already been compiled (no re- compilation or re-linking necessary!) Uses LD_PRELOAD for Linux Soon on AIX using MPI_EUILIB/MPI_EUILIBPATH Simply compile TAU with MPI support and prefix your MPI program with tau_load.sh Requires shared library MPI % mpirun –np 4 tau_load.sh a.out
HLRS 2006Performance Technology for Productive, High-End Parallel Computing Workload Characterization Idea Partition performance data for individual functions Based on runtime parameters Enable by configuring with –PROFILEPARAM TAU_PROFILE_PARAM1L (value, “name”) Simple example: void foo(int input) { TAU_PROFILE("foo", "", TAU_DEFAULT); TAU_PROFILE_PARAM1L(input, "input");... }
HLRS 2006Performance Technology for Productive, High-End Parallel Computing Workload Characterization 5 seconds spent in function “ foo ” becomes 2 seconds for “ foo [ = ] ” 1 seconds for “ foo [ = ] ” … Currently used in MPI wrapper library Allows for partitioning of time spent in MPI routines based on parameters (message size, message tag, destination node) Can be extrapolated to infer specifics about the MPI subsystem and system as a whole
HLRS 2006Performance Technology for Productive, High-End Parallel Computing Workload Characterization Simple example, send/receive squared message sizes (0-32MB) #include int main(int argc, char **argv) { int rank, size, i, j; int buffer[16*1024*1024]; MPI_Init(&argc, &argv); MPI_Comm_size( MPI_COMM_WORLD, &size ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); for (i=0;i<1000;i++) for (j=1;j<16*1024*1024;j*=2) { if (rank == 0) { MPI_Send(buffer,j,MPI_INT,1,42,MPI_COMM_WORLD); } else { MPI_Status status; MPI_Recv(buffer,j,MPI_INT,0,42,MPI_COMM_WORLD,&status); } MPI_Finalize(); }
HLRS 2006Performance Technology for Productive, High-End Parallel Computing Workload Characterization Two different message sizes (~3.3MB and ~4K)
HLRS 2006Performance Technology for Productive, High-End Parallel Computing Hypothetical Mapping Example Particles distributed on surfaces of a cube Particle* P[MAX]; /* Array of particles */ int GenerateParticles() { /* distribute particles over all faces of the cube */ for (int face=0, last=0; face < 6; face++){ /* particles on this face */ int particles_on_this_face = num(face); for (int i=last; i < particles_on_this_face; i++) { /* particle properties are a function of face */ P[i] =... f(face);... } last+= particles_on_this_face; }
HLRS 2006Performance Technology for Productive, High-End Parallel Computing Hypothetical Mapping Example (continued) How much time (flops) spent processing face i particles? What is the distribution of performance among faces? How is this determined if execution is parallel? int ProcessParticle(Particle *p) { /* perform some computation on p */ } int main() { GenerateParticles(); /* create a list of particles */ for (int i = 0; i < N; i++) /* iterates over the list */ ProcessParticle(P[i]); } … engine work packets
HLRS 2006Performance Technology for Productive, High-End Parallel Computing No Performance Mapping versus Mapping Typical performance tools report performance with respect to routines Does not provide support for mapping TAU’s performance mapping can observe performance with respect to scientist’s programming and problem abstractions TAU (no mapping) TAU (w/ mapping)
HLRS 2006Performance Technology for Productive, High-End Parallel Computing Component-Based Scientific Applications How to support performance analysis and tuning process consistent with application development methodology? Common Component Architecture (CCA) applications Performance tools should integrate with software Design performance observation component Measurement port and measurement interfaces Build support for application component instrumentation Interpose a proxy component for each port Inside the proxy, track caller/callee invocations, timings Automate the process of proxy component creation using PDT for static analysis of components include support for selective instrumentation
HLRS 2006Performance Technology for Productive, High-End Parallel Computing Flame Reaction-Diffusion (Sandia) CCAFFEINE
HLRS 2006Performance Technology for Productive, High-End Parallel Computing Earth Systems Modeling Framework Coupled modeling with modular software framework Instrumentation for ESMF framework and applications PDT automatic instrumentation Fortran 95 code modules C / C++ code modules MPI wrapper library for MPI calls ESMF component instrumentation (using CCA) CCA measurement port manual instrumentation Proxy generation using PDT and runtime interposition Significant callpath profiling used by ESMF team
HLRS 2006Performance Technology for Productive, High-End Parallel Computing Using TAU Component in ESMF/CCA
HLRS 2006Performance Technology for Productive, High-End Parallel Computing Important Questions for Application Developers How does performance vary with different compilers? Is poor performance correlated with certain OS features? Has a recent change caused unanticipated performance? How does performance vary with MPI variants? Why is one application version faster than another? What is the reason for the observed scaling behavior? Did two runs exhibit similar performance? How are performance data related to application events? Which machines will run my code the fastest and why? Which benchmarks predict my code performance best?
HLRS 2006Performance Technology for Productive, High-End Parallel Computing Performance Problem Solving Goals Answer questions at multiple levels of interest Data from low-level measurements and simulations use to predict application performance High-level performance data spanning dimensions machine, applications, code revisions, data sets examine broad performance trends Discover general correlations application performance and features of their external environment Develop methods to predict application performance on lower-level metrics Discover performance correlations between a small set of benchmarks and a collection of applications that represent a typical workload for a given system
HLRS 2006Performance Technology for Productive, High-End Parallel Computing Automatic Performance Analysis Tool (Concept) Performance database Build application Execute application Simple analysis feedback 105% Faster! 72% Faster! build information environment / performance data Offline analysis
HLRS 2006Performance Technology for Productive, High-End Parallel Computing Performance Data Management (PerfDMF) K. Huck, A. Malony, R. Bell, A. Morris, “Design and Implementation of a Parallel Performance Data Management Framework,” ICPP (awarded best paper)
HLRS 2006Performance Technology for Productive, High-End Parallel Computing Performance Data Mining (Objectives) Conduct parallel performance analysis in a systematic, collaborative and reusable manner Manage performance complexity Discover performance relationship and properties Automate process Multi-experiment performance analysis Large-scale performance data reduction Summarize characteristics of large processor runs Implement extensible analysis framework Abtraction / automation of data mining operations Interface to existing analysis and data mining tools
HLRS 2006Performance Technology for Productive, High-End Parallel Computing Performance Data Mining (PerfExplorer) Performance knowledge discovery framework Data mining analysis applied to parallel performance data comparative, clustering, correlation, dimension reduction, … Use the existing TAU infrastructure TAU performance profiles, PerfDMF Client-server based system architecture Technology integration Java API and toolkit for portability PerfDMF R-project/Omegahat, Octave/Matlab statistical analysis WEKA data mining package JFreeChart for visualization, vector output (EPS, SVG)
HLRS 2006Performance Technology for Productive, High-End Parallel Computing Performance Data Mining (PerfExplorer) K. Huck and A. Malony, “PerfExplorer: A Performance Data Mining Framework For Large-Scale Parallel Computing,” SC 2005.
HLRS 2006Performance Technology for Productive, High-End Parallel Computing PerfExplorer Analysis Methods Data summaries, distributions, scatterplots Clustering k-means Hierarchical Correlation analysis Dimension reduction PCA Random linear projection Thresholds Comparative analysis Data management views
HLRS 2006Performance Technology for Productive, High-End Parallel Computing Cluster Analysis Performance data represented as vectors - each dimension is the cumulative time for an event k-means: k random centers are selected and instances are grouped with the "closest" (Euclidean) center New centers are calculated and the process repeated until stabilization or max iterations Dimension reduction necessary for meaningful results Virtual topology, summaries constructed
HLRS 2006Performance Technology for Productive, High-End Parallel Computing sPPM Cluster Analysis
HLRS 2006Performance Technology for Productive, High-End Parallel Computing Hierarchical and K-means Clustering (sPPM)
HLRS 2006Performance Technology for Productive, High-End Parallel Computing Miranda Clusters, Average Values (16K CPUs) Two primary clusters due to MPI_Alltoall behavior … … also inverse relationship between MPI_Barrier and MPI_Group_translate_ranks
HLRS 2006Performance Technology for Productive, High-End Parallel Computing Miranda Modified After code modifications, work distribution is even MPI_Barrier and MPI_Group_translate_ranks are no longer significant contributors to run time
HLRS 2006Performance Technology for Productive, High-End Parallel Computing Flash Clustering on 16K BG/L Processors Four significant events automatically selected Clusters and correlations are visible
HLRS 2006Performance Technology for Productive, High-End Parallel Computing Correlation Analysis Describes strength and direction of a linear relationship between two variables (events) in the data
HLRS 2006Performance Technology for Productive, High-End Parallel Computing Comparative Analysis Relative speedup, efficiency total runtime, by event, one event, by phase Breakdown of total runtime Group fraction of total runtime Correlating events to total runtime Timesteps per second
HLRS 2006Performance Technology for Productive, High-End Parallel Computing User-Defined Views Reorganization of data for multiple parametric studies Construction of views / sub-views with simple operators Simple “wizard” like interface for creating view Application Processors Problem size Application Problem type Processors
HLRS 2006Performance Technology for Productive, High-End Parallel Computing PerfExplorer Future Work Extensions to PerfExplorer framework Examine properties of performance data Automated guidance of analysis Workflow scripting for repeatable analysis Dependency modeling (go beyond correlation) Time-series analysis of phase-based data
HLRS 2006Performance Technology for Productive, High-End Parallel Computing Open Trace Format (OTF) Features Hierarchical trace format Replacement for proprietary formats such as STF Pallas and Intel Efficient streams based parallel access Tracing library available on IBM BG/L platform Development of OTF supported by LLNL Joint development effort ZiH / Technical University of Dresden ParaTools, Inc.
HLRS 2006Performance Technology for Productive, High-End Parallel Computing Open Trace Format (OTF)
HLRS 2006Performance Technology for Productive, High-End Parallel Computing Vampir and VNG Commercial trace based tools Developed at ZiH, T.U. Dresden Wolfgang Nagel, Holger Brunst and others… Vampir Trace Visualizer Formerly known also as Intel ® Trace Analyzer v4.0 Based on sequential trace analysis Vampir Next Generation (VNG) Client (vng) runs on a desktop, server (vngd) on a cluster Parallel trace analysis Orders of magnitude bigger traces (more memory) State of the art in parallel trace visualization
HLRS 2006Performance Technology for Productive, High-End Parallel Computing Vampir Next Generation (VNG) Architecture Merged Traces Analysis Server Classic Analysis: monolithic sequential Worker 1 Worker 2 Worker m Master Trace 1 Trace 2 Trace 3 Trace N File System Internet Parallel Program Monitor System Event Streams Visualization Client Segment Indicator 768 Processes Thumbnail Timeline with 16 visible Traces Process Parallel I/O Message Passing
HLRS 2006Performance Technology for Productive, High-End Parallel Computing VNG Timeline Display (Miranda on BGL)
HLRS 2006Performance Technology for Productive, High-End Parallel Computing VNG Timeline Zoomed In
HLRS 2006Performance Technology for Productive, High-End Parallel Computing VNG Grouping of Interprocess Communications
HLRS 2006Performance Technology for Productive, High-End Parallel Computing VNG Process Timeline with PAPI Counters
HLRS 2006Performance Technology for Productive, High-End Parallel Computing VNG Calltree Display
HLRS 2006Performance Technology for Productive, High-End Parallel Computing OTF/VNG Support for Counters
HLRS 2006Performance Technology for Productive, High-End Parallel Computing TAU Tracing Enhancements Configure TAU with -TRACE –vtf= –otf= options % configure –TRACE –vtf= … % configure –TRACE –otf= … Generates tau_merge, tau2vtf, tau2otf tools in / /bin % tau_f90.sh app.f90 –o app Instrument and execute application % mpirun -np 4 app Merge and convert trace files to VTF3/OTF format % tau_treemerge.pl % tau2vtf tau.trc tau.edf app.vpt.gz % vampir foo.vpt.gz OR % tau2otf tau.trc tau.edf app.otf –n % vampir app.otf OR use VNG to analyze OTF/VTF trace files
HLRS 2006Performance Technology for Productive, High-End Parallel Computing VNG Communication Matrix Display
HLRS 2006Performance Technology for Productive, High-End Parallel Computing VNG Process Activity Chart
HLRS 2006Performance Technology for Productive, High-End Parallel Computing TAU Eclipse Integration Eclipse GUI integration of existing TAU tools New Eclipse plug-in for code instrumentation Integration with CDT and FDT Java, C/C++, and Fortran projects Can be instrumented and run from within eclipse Each project can be given multiple build configurations corresponding to available TAU makefiles All TAU configuration options are available Paraprof tool can be launched automatically
HLRS 2006Performance Technology for Productive, High-End Parallel Computing TAU Eclipse Integration TAU configuration TAU experimentation
HLRS 2006Performance Technology for Productive, High-End Parallel Computing TAU Eclipse Future Work Development of the TAU Eclipse plugins for Java and the CDT/FDT is ongoing Planned features include: Full integration with the Eclipse Parallel Tools project Database storage of project performance data Refinement of the plugin settings interface to allow easier selection of TAU runtime and compiletime options Accessibility of TAU configuration and commandline tools via the Eclipse UI
HLRS 2006Performance Technology for Productive, High-End Parallel Computing ZeptoOS and TAU DOE OS/RTS for Extreme Scale Scientific Computation ZeptoOS scalable components for petascale architectures Argonne National Laboratory and University of Oregon University of Oregon Kernel-level performance monitoring OS component performance assessment and tuning KTAU (Kernel Tuning and Analysis Utilities) integration of TAU infrastructure in Linux kernel integration with ZeptoOS installation on BG/L Port to 32-bit and 64-bit Linux platforms
HLRS 2006Performance Technology for Productive, High-End Parallel Computing Linux Kernel Profiling using TAU – Goals Fine-grained kernel-level performance measurement Parallel applications Support both profiling and tracing Both process-centric and system-wide view Merge user-space performance with kernel-space User-space: (TAU) profile/trace Kernel-space: (KTAU) profile/trace Detailed program-OS interaction data Including interrupts (IRQ) Analysis and visualization compatible with TAU
HLRS 2006Performance Technology for Productive, High-End Parallel Computing TAU Performance System Status Computing platforms IBM, SGI, Cray, HP, Sun, Hitachi, NEC, Linux clusters, Apple, Windows, … Programming languages C, C++, Fortran 90/95, UPC, HPF, Java, OpenMP, Python Thread libraries pthreads, SGI sproc, Java,Windows, OpenMP Communications libraries MPI-1/2, PVM, shmem, … Compilers IBM, Intel, PGI, GNU, Fujitsu, Sun, NAG, Microsoft, SGI, Cray, HP, NEC, Absoft, Lahey, PathScale, Open64
HLRS 2006Performance Technology for Productive, High-End Parallel Computing Papers at European Conferences 2006 L. Li and A. Malony, “Model-based Performance Diagnosis for Master-Worker Parallel Computations,” EuroPar A. Nataraj, A. Malony, A. Morris, and S. Shende, “Early Experiences with KTAU on the IBM BG/L,” EuroPar L. Li and A. Malony, “Model-based Performance Diagnosis of Wavefront Parallel Computations,” HPCC W. Spear, A. Malony, A. Morris, and S. Shende, “Integrating TAU with Eclipse: A Performance Analysis System in an Integrated Development Environment,” HPCC K. Huck, A. Malony, S. Shende, and A. Morris, “TAUg: Runtime Global Performance Data Access using MPI,” EuroPVM-MPI C. Hoge, A. Malony, and D. Keith, “Client-side Task Support in Matlab for Concurrent Distributed Execution,” DAPSYS A. Nataraj, A. Malony, S. Shende, and A. Morris, “Kernel-level Measurement for Integrated Performance Views: the KTAU Project,” Cluster 2006, distinguished paper.
HLRS 2006Performance Technology for Productive, High-End Parallel Computing Project Affiliations (selected) Lawrence Livermore National Lab Hydrodynamics (Miranda), radiation diffusion (KULL) Open Trace Format (OTF) implementation on BG/L Argonne National Lab ZeptoOS project and KTAU Astrophysical thermonuclear flashes (Flash) Center for Simulation of Accidental Fires and Explosion University of Utah, ASCI ASAP Center, C-SAFE Uintah Computational Framework (UCF) Oak Ridge National Lab Contribution to the Joule Report (S3D, AORSA3D)
HLRS 2006Performance Technology for Productive, High-End Parallel Computing Project Affiliations (continued) Sandia National Lab Simulation of turbulent reactive flows (S3D) Combustion code (CFRFS) Los Alamos National Lab Monte Carlo transport (MCNP) SAIC’s Adaptive Grid Eulerian (SAGE) CCSM / ESMF / WRF climate/earth/weather simulation NSF, NOAA, DOE, NASA, … Common component architecture (CCA) integration Performance Evaluation Research Center (PERC) DOE SciDAC center
HLRS 2006Performance Technology for Productive, High-End Parallel Computing Support Acknowledgements Department of Energy (DOE) Office of Science MICS, Argonne National Lab ASC/NNSA University of Utah ASC/NNSA Level 1 ASC/NNSA, Lawrence Livermore National Lab Department of Defense (DoD) HPC Modernization Office (HPCMO) Programming Environment and Training (PET) NSF Software and Tools for High-End Computing Research Centre Juelich Los Alamos National Laboratory ParaTools
HLRS 2006Performance Technology for Productive, High-End Parallel Computing Acknowledgements Dr. Sameer Shende, Senior Scientist Alan Morris, Senior Software Engineer Wyatt Spear, PRL staff Scott Biersdorff, PRL staff Kevin Huck, Ph.D. student Aroon Nataraj, Ph.D. student Kai Li, Ph.D. student Li Li, Ph.D. student Adnan Salman, Ph.D. student Suravee Suthikulpanit, M.S. student