Allen D. Malony Performance Research Laboratory (PRL) Neuroinformatics Center (NIC) Department.

Slides:



Advertisements
Similar presentations
Machine Learning-based Autotuning with TAU and Active Harmony Nicholas Chaimov University of Oregon Paradyn Week 2013 April 29, 2013.
Advertisements

K T A U Kernel Tuning and Analysis Utilities Department of Computer and Information Science Performance Research Laboratory University of Oregon.
Dynamic performance measurement control Dynamic event grouping Multiple configurable counters Selective instrumentation Application-Level Performance Access.
Workload Characterization using the TAU Performance System Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer,
Robert Bell, Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science.
Scalability Study of S3D using TAU Sameer Shende
Sameer Shende Department of Computer and Information Science Neuro Informatics Center University of Oregon Tool Interoperability.
Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science Institute University.
Profiling S3D on Cray XT3 using TAU Sameer Shende
The TAU Performance Technology for Complex Parallel Systems (Performance Analysis Bring Your Own Code Workshop, NRL Washington D.C.) Sameer Shende, Allen.
Nick Trebon, Alan Morris, Jaideep Ray, Sameer Shende, Allen Malony {ntrebon, amorris, Department of.
On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Bernd Mohr 1, Allen D. Malony 2, Rudi Eigenmann 3 1 Forschungszentrum.
Allen D. Malony Department of Computer and Information Science Performance Research Laboratory University of Oregon Performance Technology.
Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science Institute University.
The TAU Performance System: Advances in Performance Mapping Sameer Shende University of Oregon.
Workshop on Performance Tools for Petascale Computing 9:30 – 10:30am, Tuesday, July 17, 2007, Snowbird, UT Sameer S. Shende
TAU Performance System Alan Morris, Sameer Shende, Allen D. Malony University of Oregon {amorris, sameer,
Performance Tools BOF, SC’07 5:30pm – 7pm, Tuesday, A9 Sameer S. Shende Performance Research Laboratory University.
Early Experiences with KTAU on the IBM Blue Gene / L A. Nataraj, A. Malony, A. Morris, S. Shende Performance.
Allen D. Malony Department of Computer and Information Science Computational Science Institute University of Oregon TAU Performance.
Allen D. Malony Department of Computer and Information Science Performance Research Laboratory NeuroInformatics Center University.
Workshop on Performance Tools for Petascale Computing 9:30 – 10:30am, Tuesday, July 17, 2007, Snowbird, UT Sameer S. Shende
Performance Evaluation of S3D using TAU Sameer Shende
Scalability Study of S3D using TAU Sameer Shende
Allen D. Malony, Sameer Shende, Robert Bell Department of Computer and Information Science Computational Science Institute, NeuroInformatics.
Kai Li, Allen D. Malony, Robert Bell, Sameer Shende Department of Computer and Information Science Computational.
The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon.
Sameer Shende, Allen D. Malony Computer & Information Science Department Computational Science Institute University of Oregon.
Performance Tools for Empirical Autotuning Allen D. Malony, Nick Chaimov, Kevin Huck, Scott Biersdorff, Sameer Shende
Allen D. Malony Performance Research Laboratory (PRL) Neuroinformatics Center (NIC) Department.
Paradyn Week – April 14, 2004 – Madison, WI DPOMP: A DPCL Based Infrastructure for Performance Monitoring of OpenMP Applications Bernd Mohr Forschungszentrum.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Score-P – A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir Alexandru Calotoiu German Research School for.
Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.
Using TAU on SiCortex Alan Morris, Aroon Nataraj Sameer Shende, Allen D. Malony University of Oregon {amorris, anataraj, sameer,
Allen D. Malony Department of Computer and Information Science Performance Research Laboratory University of Oregon Performance Technology.
Profile Analysis with ParaProf Sameer Shende Performance Reseaerch Lab, University of Oregon
1 Performance Analysis with Vampir ZIH, Technische Universität Dresden.
Center for Component Technology for Terascale Simulation Software CCA is about: Enhancing Programmer Productivity without sacrificing performance. Supporting.
Dynamic performance measurement control Dynamic event grouping Multiple configurable counters Selective instrumentation Application-Level Performance Access.
ASC Tri-Lab Code Development Tools Workshop Thursday, July 29, 2010 Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA This work.
Early Experiences with KTAU on the Blue Gene / L A. Nataraj, A. Malony, A. Morris, S. Shende Performance Research Lab University of Oregon.
PerfExplorer Component for Performance Data Analysis Kevin Huck – University of Oregon Boyana Norris – Argonne National Lab Li Li – Argonne National Lab.
Allen D. Malony, Sameer S. Shende, Alan Morris, Robert Bell, Kevin Huck, Nick Trebon, Suravee Suthikulpanit, Kai Li, Li Li
Allen D. Malony Performance Research Laboratory (PRL) Neuroinformatics Center (NIC) Department.
Allen D. Malony, Sameer Shende, Li Li, Kevin Huck Department of Computer and Information Science Performance.
Allen D. Malony Department of Computer and Information Science TAU Performance Research Laboratory University of Oregon Discussion:
© 2006, National Research Council Canada © 2006, IBM Corporation Solving performance issues in OTS-based systems Erik Putrycz Software Engineering Group.
Allen D. Malony, Sameer S. Shende, Robert Bell Kai Li, Li Li, Kevin Huck Department of Computer.
Connections to Other Packages The Cactus Team Albert Einstein Institute
Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.
Allen D. Malony Department of Computer and Information Science Performance Research Laboratory University.
Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.
Allen D. Malony Department of Computer and Information Science Performance Research Laboratory.
Aroon Nataraj, Matthew Sottile, Alan Morris, Allen D. Malony, Sameer Shende { anataraj, matt, amorris, malony,
SDM Center High-Performance Parallel I/O Libraries (PI) Alok Choudhary, (Co-I) Wei-Keng Liao Northwestern University In Collaboration with the SEA Group.
Performane Analyzer Performance Analysis and Visualization of Large-Scale Uintah Simulations Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance.
Allen D. Malony Performance Research Laboratory (PRL) Neuroinformatics Center (NIC) Department.
Online Performance Analysis and Visualization of Large-Scale Parallel Applications Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance Research.
Kai Li, Allen D. Malony, Sameer Shende, Robert Bell
Productive Performance Tools for Heterogeneous Parallel Computing
Introduction to the TAU Performance System®
Performance Technology for Scalable Parallel Systems
TAU integration with Score-P
Allen D. Malony, Sameer Shende
TAU Parallel Performance System
TAU Parallel Performance System
TAU: A Framework for Parallel Performance Analysis
Outline Introduction Motivation for performance mapping SEAA model
Parallel Program Analysis Framework for the DOE ACTS Toolkit
Presentation transcript:

Allen D. Malony Performance Research Laboratory (PRL) Neuroinformatics Center (NIC) Department of Computer and Information Science University of Oregon Performance Technology for Productive, High-End Parallel Computing: the TAU Parallel Performance System

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Outline  Research interests and motivation  TAU performance system  Instrumentation  Measurement  Analysis tools  Parallel profile analysis (ParaProf)  Performance data management (PerfDMF)  Performance data mining (PerfExplorer)  Open Trace Format (OTF)  Conclusions and Future Work

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Research Motivation  Tools for performance problem solving  Empirical-based performance optimization process  Performance technology concerns characterization Performance Tuning Performance Diagnosis Performance Experimentation Performance Observation hypotheses properties Instrumentation Measurement Analysis Visualization Experiment management Performance data storage Performance data mining Model-based

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Challenges in Performance Problem Solving  How to make the process more effective (productive)?  Process likely to change as parallel systems evolve  What are the important events and performance metrics?  Tied to application structure and computational model  Tied to application domain and algorithms  What are the significant issues that will affect the technology used to support the process?  Enhance application development and optimization  Process and tools can/must be more application-aware  Tools have poor support for application-specific aspects  Integrate performance technology and process

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Performance Process, Technology, and Scale  How does our view of this process change when we consider very large-scale parallel systems?  Scaling complicates observation and analysis  Performance data size  standard approaches deliver a lot of data with little value  Measurement overhead and intrusion  tradeoff with analysis accuracy  “noise” in the system  Analysis complexity increases  What will enhance productive application development?  Process and technology evolution  Nature of application development may change

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Role of Intelligence, Automation, and Knowledge  Scale forces the process to become more intelligent  Even with intelligent and application-specific tools, the decisions of what to analyze is difficult and intractable  More automation and knowledge-based decision making  Build automatic/autonomic capabilities into the tools  Support broader experimentation methods and refinement  Access and correlate data from several sources  Automate performance data analysis / mining / learning  Include predictive features and experiment refinement  Knowledge-driven adaptation and optimization guidance  Address scale issues through increased expertise

HLRS 2006Performance Technology for Productive, High-End Parallel Computing TAU Performance System  Tuning and Analysis Utilities (14+ year project effort)  Performance system framework for HPC systems  Integrated, scalable, flexible, and parallel  Targets a general complex system computation model  Entities: nodes / contexts / threads  Multi-level: system / software / parallelism  Measurement and analysis abstraction  Integrated toolkit for performance problem solving  Instrumentation, measurement, analysis, and visualization  Portable performance profiling and tracing facility  Performance data management and data mining  Partners: LLNL, ANL, Research Center Jülich, LANL

HLRS 2006Performance Technology for Productive, High-End Parallel Computing TAU Parallel Performance System Goals  Portable (open source) parallel performance system  Computer system architectures and operating systems  Different programming languages and compilers  Multi-level, multi-language performance instrumentation  Flexible and configurable performance measurement  Support for multiple parallel programming paradigms  Multi-threading, message passing, mixed-mode, hybrid, object oriented (generic), component-based  Support for performance mapping  Integration of leading performance technology  Scalable (very large) parallel performance analysis

HLRS 2006Performance Technology for Productive, High-End Parallel Computing memory Node VM space Context SMP Threads node memory … … Interconnection Network Inter-node message communication * * physical view model view General Complex System Computation Model  Node: physically distinct shared memory machine  Message passing node interconnection network  Context: distinct virtual memory space within node  Thread: execution threads (user/system) in context

HLRS 2006Performance Technology for Productive, High-End Parallel Computing TAU Performance System Architecture

HLRS 2006Performance Technology for Productive, High-End Parallel Computing TAU Performance System Architecture

HLRS 2006Performance Technology for Productive, High-End Parallel Computing TAU Instrumentation Approach  Support for standard program events  Routines, classes and templates  Statement-level blocks  Support for user-defined events  Begin/End events (“user-defined timers”)  Atomic events (e.g., size of memory allocated/freed)  Selection of event statistics  Support definition of “semantic” entities for mapping  Support for event groups (aggregation, selection)  Instrumentation optimization  Eliminate instrumentation in lightweight routines

HLRS 2006Performance Technology for Productive, High-End Parallel Computing TAU Instrumentation Mechanisms  Source code  Manual (TAU API, TAU component API)  Automatic (robust)  C, C++, F77/90/95 (Program Database Toolkit (PDT))  OpenMP (directive rewriting (Opari), POMP2 spec)  Object code  Pre-instrumented libraries (e.g., MPI using PMPI)  Statically-linked and dynamically-linked  Executable code  Dynamic instrumentation (pre-execution) (DynInstAPI)  Virtual machine instrumentation (e.g., Java using JVMPI)  TAU_COMPILER to automate instrumentation process

HLRS 2006Performance Technology for Productive, High-End Parallel Computing User-level abstractions problem domain source code object codelibraries instrumentation executable runtime image compiler linkerOS VM instrumentation performance data run preprocessor Multi-Level Instrumentation and Mapping  Multiple interfaces  Information sharing  Between interfaces  Event selection  Within/between levels  Mapping  Associate performance data with high-level semantic abstractions

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Program Database Toolkit (PDT) Application / Library C / C++ parser Fortran parser F77/90/95 C / C++ IL analyzer Fortran IL analyzer Program Database Files IL DUCTAPE PDBhtml SILOON CHASM tau_instrument or Program documentation Application component glue C++ / F90/95 interoperability Automatic source instrumentation

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Program Database Toolkit (PDT)  Program code analysis framework  Develop source-based tools  High-level interface to source code information  Integrated toolkit for source code parsing, database creation, and database query  Commercial grade front-end parsers  Portable IL analyzer, database format, and access API  Open software approach for tool development  Multiple source languages  Implement automatic performance instrumentation tools  tau_instrumentor

HLRS 2006Performance Technology for Productive, High-End Parallel Computing TAU Measurement Approach  Portable and scalable parallel profiling solution  Multiple profiling types and options  Event selection and control (enabling/disabling, throttling)  Online profile access and sampling  Online performance profile overhead compensation  Portable and scalable parallel tracing solution  Trace translation to EPILOG, VTF3, and OTF  Trace streams (OTF) and hierarchical trace merging  Robust timing and hardware performance support  Multiple counters (hardware, user-defined, system)  Performance measurement for CCA component software

HLRS 2006Performance Technology for Productive, High-End Parallel Computing TAU Measurement Mechanisms  Parallel profiling  Function-level, block-level, statement-level  Supports user-defined events and mapping events  TAU parallel profile stored (dumped) during execution  Support for flat, callgraph/callpath, phase profiling  Support for memory profiling  Tracing  All profile-level events  Inter-process communication events  Inclusion of multiple counter data in traced events

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Types of Parallel Performance Profiling  Flat profiles  Metric (e.g., time) spent in an event (callgraph nodes)  Exclusive/inclusive, # of calls, child calls  Callpath profiles (Calldepth profiles)  Time spent along a calling path (edges in callgraph)  “main=> f1 => f2 => MPI_Send” (event name)  TAU_CALLPATH_LENGTH environment variable  Phase profiles  Flat profiles under a phase (nested phases are allowed)  Default “main” phase  Supports static or dynamic (per-iteration) phases

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Performance Analysis and Visualization  Analysis of parallel profile and trace measurement  Parallel profile analysis  ParaProf: parallel profile analysis and presentation  ParaVis: parallel performance visualization package  Profile generation from trace data (tau2pprof)  Performance data management framework (PerfDMF)  Parallel trace analysis  Translation to VTF (V3.0), EPILOG, OTF formats  Integration with VNG (Technical University of Dresden)  Online parallel analysis and visualization  Integration with CUBE browser (KOJAK, UTK, FZJ)

HLRS 2006Performance Technology for Productive, High-End Parallel Computing ParaProf Parallel Performance Profile Analysis HPMToolkit MpiP TAU Raw files PerfDMF managed (database) Metadata Application Experiment Trial

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Example Applications  sPPM  ASCI benchmark, Fortran, C, MPI, OpenMP or pthreads  Miranda  research hydrodynamics code, Fortran, MPI  GYRO  tokamak turbulence simulation, Fortran, MPI  FLASH  physics simulation, Fortran, MPI  WRF  weather research and forecasting, Fortran, MPI  S3D  3D combustion, Fortran, MPI

HLRS 2006Performance Technology for Productive, High-End Parallel Computing ParaProf – Flat Profile (Miranda, BG/L) 8K processors node, context, thread Miranda  hydrodynamics  Fortran + MPI  LLNL Run to 64K

HLRS 2006Performance Technology for Productive, High-End Parallel Computing ParaProf – Stacked View (Miranda)

HLRS 2006Performance Technology for Productive, High-End Parallel Computing ParaProf – Callpath Profile (Flash) Flash  thermonuclear flashes  Fortran + MPI  Argonne

HLRS 2006Performance Technology for Productive, High-End Parallel Computing ParaProf – Histogram View (Miranda) 8k processors 16k processors

HLRS 2006Performance Technology for Productive, High-End Parallel Computing NAS BT – Flat Profile How is MPI_Wait() distributed relative to solver direction? Application routine names reflect phase semantics

HLRS 2006Performance Technology for Productive, High-End Parallel Computing NAS BT – Phase Profile (Main and X, Y, Z) Main phase shows nested phases and immediate events

HLRS 2006Performance Technology for Productive, High-End Parallel Computing ParaProf – 3D Full Profile (Miranda) 16k processors

HLRS 2006Performance Technology for Productive, High-End Parallel Computing ParaProf – 3D Full Profile (Flash) 128 processors

HLRS 2006Performance Technology for Productive, High-End Parallel Computing ParaProf Bar Plot (Zoom in/out +/-)

HLRS 2006Performance Technology for Productive, High-End Parallel Computing ParaProf – 3D Scatterplot (Miranda)  Each point is a “thread” of execution  A total of four metrics shown in relation  ParaVis 3D profile visualization library  JOGL

HLRS 2006Performance Technology for Productive, High-End Parallel Computing ParaProf – Callgraph Zoom (Flash) Zoom in (+) Zoom out (-)

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Performance Tracing on Miranda  Use TAU to generate VTF3 traces for Vampir analysis  MPI calls with HW counter information (not shown)  Detailed code behavior to focus optimization efforts

HLRS 2006Performance Technology for Productive, High-End Parallel Computing S3D on Lemieux (TAU-to-VTF3, Vampir) S3D  3D combustion  Fortran + MPI  PSC

HLRS 2006Performance Technology for Productive, High-End Parallel Computing S3D on Lemieux (Zoomed)

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Runtime MPI Shared Library Instrumentation  We can now interpose the MPI wrapper library for applications that have already been compiled (no re- compilation or re-linking necessary!)  Uses LD_PRELOAD for Linux  Soon on AIX using MPI_EUILIB/MPI_EUILIBPATH  Simply compile TAU with MPI support and prefix your MPI program with tau_load.sh  Requires shared library MPI % mpirun –np 4 tau_load.sh a.out

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Workload Characterization  Idea  Partition performance data for individual functions  Based on runtime parameters  Enable by configuring with –PROFILEPARAM  TAU_PROFILE_PARAM1L (value, “name”)  Simple example: void foo(int input) { TAU_PROFILE("foo", "", TAU_DEFAULT); TAU_PROFILE_PARAM1L(input, "input");... }

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Workload Characterization  5 seconds spent in function “ foo ” becomes  2 seconds for “ foo [ = ] ”  1 seconds for “ foo [ = ] ”  …  Currently used in MPI wrapper library  Allows for partitioning of time spent in MPI routines based on parameters (message size, message tag, destination node)  Can be extrapolated to infer specifics about the MPI subsystem and system as a whole

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Workload Characterization  Simple example, send/receive squared message sizes (0-32MB) #include int main(int argc, char **argv) { int rank, size, i, j; int buffer[16*1024*1024]; MPI_Init(&argc, &argv); MPI_Comm_size( MPI_COMM_WORLD, &size ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); for (i=0;i<1000;i++) for (j=1;j<16*1024*1024;j*=2) { if (rank == 0) { MPI_Send(buffer,j,MPI_INT,1,42,MPI_COMM_WORLD); } else { MPI_Status status; MPI_Recv(buffer,j,MPI_INT,0,42,MPI_COMM_WORLD,&status); } MPI_Finalize(); }

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Workload Characterization  Two different message sizes (~3.3MB and ~4K)

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Hypothetical Mapping Example  Particles distributed on surfaces of a cube Particle* P[MAX]; /* Array of particles */ int GenerateParticles() { /* distribute particles over all faces of the cube */ for (int face=0, last=0; face < 6; face++){ /* particles on this face */ int particles_on_this_face = num(face); for (int i=last; i < particles_on_this_face; i++) { /* particle properties are a function of face */ P[i] =... f(face);... } last+= particles_on_this_face; }

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Hypothetical Mapping Example (continued)  How much time (flops) spent processing face i particles?  What is the distribution of performance among faces?  How is this determined if execution is parallel? int ProcessParticle(Particle *p) { /* perform some computation on p */ } int main() { GenerateParticles(); /* create a list of particles */ for (int i = 0; i < N; i++) /* iterates over the list */ ProcessParticle(P[i]); } … engine work packets

HLRS 2006Performance Technology for Productive, High-End Parallel Computing No Performance Mapping versus Mapping  Typical performance tools report performance with respect to routines  Does not provide support for mapping  TAU’s performance mapping can observe performance with respect to scientist’s programming and problem abstractions TAU (no mapping) TAU (w/ mapping)

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Component-Based Scientific Applications  How to support performance analysis and tuning process consistent with application development methodology?  Common Component Architecture (CCA) applications  Performance tools should integrate with software  Design performance observation component  Measurement port and measurement interfaces  Build support for application component instrumentation  Interpose a proxy component for each port  Inside the proxy, track caller/callee invocations, timings  Automate the process of proxy component creation  using PDT for static analysis of components  include support for selective instrumentation

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Flame Reaction-Diffusion (Sandia) CCAFFEINE

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Earth Systems Modeling Framework  Coupled modeling with modular software framework  Instrumentation for ESMF framework and applications  PDT automatic instrumentation  Fortran 95 code modules  C / C++ code modules  MPI wrapper library for MPI calls  ESMF component instrumentation (using CCA)  CCA measurement port manual instrumentation  Proxy generation using PDT and runtime interposition  Significant callpath profiling used by ESMF team

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Using TAU Component in ESMF/CCA

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Important Questions for Application Developers  How does performance vary with different compilers?  Is poor performance correlated with certain OS features?  Has a recent change caused unanticipated performance?  How does performance vary with MPI variants?  Why is one application version faster than another?  What is the reason for the observed scaling behavior?  Did two runs exhibit similar performance?  How are performance data related to application events?  Which machines will run my code the fastest and why?  Which benchmarks predict my code performance best?

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Performance Problem Solving Goals  Answer questions at multiple levels of interest  Data from low-level measurements and simulations  use to predict application performance  High-level performance data spanning dimensions  machine, applications, code revisions, data sets  examine broad performance trends  Discover general correlations application performance and features of their external environment  Develop methods to predict application performance on lower-level metrics  Discover performance correlations between a small set of benchmarks and a collection of applications that represent a typical workload for a given system

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Automatic Performance Analysis Tool (Concept) Performance database Build application Execute application Simple analysis feedback 105% Faster! 72% Faster! build information environment / performance data Offline analysis

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Performance Data Management (PerfDMF) K. Huck, A. Malony, R. Bell, A. Morris, “Design and Implementation of a Parallel Performance Data Management Framework,” ICPP (awarded best paper)

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Performance Data Mining (Objectives)  Conduct parallel performance analysis in a systematic, collaborative and reusable manner  Manage performance complexity  Discover performance relationship and properties  Automate process  Multi-experiment performance analysis  Large-scale performance data reduction  Summarize characteristics of large processor runs  Implement extensible analysis framework  Abtraction / automation of data mining operations  Interface to existing analysis and data mining tools

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Performance Data Mining (PerfExplorer)  Performance knowledge discovery framework  Data mining analysis applied to parallel performance data  comparative, clustering, correlation, dimension reduction, …  Use the existing TAU infrastructure  TAU performance profiles, PerfDMF  Client-server based system architecture  Technology integration  Java API and toolkit for portability  PerfDMF  R-project/Omegahat, Octave/Matlab statistical analysis  WEKA data mining package  JFreeChart for visualization, vector output (EPS, SVG)

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Performance Data Mining (PerfExplorer) K. Huck and A. Malony, “PerfExplorer: A Performance Data Mining Framework For Large-Scale Parallel Computing,” SC 2005.

HLRS 2006Performance Technology for Productive, High-End Parallel Computing PerfExplorer Analysis Methods  Data summaries, distributions, scatterplots  Clustering  k-means  Hierarchical  Correlation analysis  Dimension reduction  PCA  Random linear projection  Thresholds  Comparative analysis  Data management views

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Cluster Analysis  Performance data represented as vectors - each dimension is the cumulative time for an event  k-means: k random centers are selected and instances are grouped with the "closest" (Euclidean) center  New centers are calculated and the process repeated until stabilization or max iterations  Dimension reduction necessary for meaningful results  Virtual topology, summaries constructed

HLRS 2006Performance Technology for Productive, High-End Parallel Computing sPPM Cluster Analysis

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Hierarchical and K-means Clustering (sPPM)

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Miranda Clusters, Average Values (16K CPUs)  Two primary clusters due to MPI_Alltoall behavior …  … also inverse relationship between MPI_Barrier and MPI_Group_translate_ranks

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Miranda Modified  After code modifications, work distribution is even  MPI_Barrier and MPI_Group_translate_ranks are no longer significant contributors to run time

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Flash Clustering on 16K BG/L Processors  Four significant events automatically selected  Clusters and correlations are visible

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Correlation Analysis  Describes strength and direction of a linear relationship between two variables (events) in the data

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Comparative Analysis  Relative speedup, efficiency  total runtime, by event, one event, by phase  Breakdown of total runtime  Group fraction of total runtime  Correlating events to total runtime  Timesteps per second

HLRS 2006Performance Technology for Productive, High-End Parallel Computing User-Defined Views  Reorganization of data for multiple parametric studies  Construction of views / sub-views with simple operators  Simple “wizard” like interface for creating view Application Processors Problem size Application Problem type Processors

HLRS 2006Performance Technology for Productive, High-End Parallel Computing PerfExplorer Future Work  Extensions to PerfExplorer framework  Examine properties of performance data  Automated guidance of analysis  Workflow scripting for repeatable analysis  Dependency modeling (go beyond correlation)  Time-series analysis of phase-based data

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Open Trace Format (OTF)  Features  Hierarchical trace format  Replacement for proprietary formats such as STF  Pallas and Intel  Efficient streams based parallel access  Tracing library available on IBM BG/L platform  Development of OTF supported by LLNL  Joint development effort  ZiH / Technical University of Dresden  ParaTools, Inc. 

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Open Trace Format (OTF)

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Vampir and VNG  Commercial trace based tools  Developed at ZiH, T.U. Dresden  Wolfgang Nagel, Holger Brunst and others…   Vampir Trace Visualizer  Formerly known also as Intel ® Trace Analyzer v4.0  Based on sequential trace analysis  Vampir Next Generation (VNG)  Client (vng) runs on a desktop, server (vngd) on a cluster  Parallel trace analysis  Orders of magnitude bigger traces (more memory)  State of the art in parallel trace visualization

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Vampir Next Generation (VNG) Architecture Merged Traces Analysis Server Classic Analysis:  monolithic  sequential Worker 1 Worker 2 Worker m Master Trace 1 Trace 2 Trace 3 Trace N File System Internet Parallel Program Monitor System Event Streams Visualization Client Segment Indicator 768 Processes Thumbnail Timeline with 16 visible Traces Process Parallel I/O Message Passing

HLRS 2006Performance Technology for Productive, High-End Parallel Computing VNG Timeline Display (Miranda on BGL)

HLRS 2006Performance Technology for Productive, High-End Parallel Computing VNG Timeline Zoomed In

HLRS 2006Performance Technology for Productive, High-End Parallel Computing VNG Grouping of Interprocess Communications

HLRS 2006Performance Technology for Productive, High-End Parallel Computing VNG Process Timeline with PAPI Counters

HLRS 2006Performance Technology for Productive, High-End Parallel Computing VNG Calltree Display

HLRS 2006Performance Technology for Productive, High-End Parallel Computing OTF/VNG Support for Counters

HLRS 2006Performance Technology for Productive, High-End Parallel Computing TAU Tracing Enhancements  Configure TAU with -TRACE –vtf= –otf= options % configure –TRACE –vtf= … % configure –TRACE –otf= … Generates tau_merge, tau2vtf, tau2otf tools in / /bin % tau_f90.sh app.f90 –o app  Instrument and execute application % mpirun -np 4 app  Merge and convert trace files to VTF3/OTF format % tau_treemerge.pl % tau2vtf tau.trc tau.edf app.vpt.gz % vampir foo.vpt.gz OR % tau2otf tau.trc tau.edf app.otf –n % vampir app.otf OR use VNG to analyze OTF/VTF trace files

HLRS 2006Performance Technology for Productive, High-End Parallel Computing VNG Communication Matrix Display

HLRS 2006Performance Technology for Productive, High-End Parallel Computing VNG Process Activity Chart

HLRS 2006Performance Technology for Productive, High-End Parallel Computing TAU Eclipse Integration  Eclipse GUI integration of existing TAU tools  New Eclipse plug-in for code instrumentation  Integration with CDT and FDT  Java, C/C++, and Fortran projects  Can be instrumented and run from within eclipse  Each project can be given multiple build configurations corresponding to available TAU makefiles  All TAU configuration options are available  Paraprof tool can be launched automatically

HLRS 2006Performance Technology for Productive, High-End Parallel Computing TAU Eclipse Integration TAU configuration TAU experimentation

HLRS 2006Performance Technology for Productive, High-End Parallel Computing TAU Eclipse Future Work  Development of the TAU Eclipse plugins for Java and the CDT/FDT is ongoing  Planned features include:  Full integration with the Eclipse Parallel Tools project  Database storage of project performance data  Refinement of the plugin settings interface to allow easier selection of TAU runtime and compiletime options  Accessibility of TAU configuration and commandline tools via the Eclipse UI

HLRS 2006Performance Technology for Productive, High-End Parallel Computing ZeptoOS and TAU  DOE OS/RTS for Extreme Scale Scientific Computation  ZeptoOS  scalable components for petascale architectures  Argonne National Laboratory and University of Oregon  University of Oregon  Kernel-level performance monitoring  OS component performance assessment and tuning  KTAU (Kernel Tuning and Analysis Utilities)  integration of TAU infrastructure in Linux kernel  integration with ZeptoOS  installation on BG/L  Port to 32-bit and 64-bit Linux platforms

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Linux Kernel Profiling using TAU – Goals  Fine-grained kernel-level performance measurement  Parallel applications  Support both profiling and tracing  Both process-centric and system-wide view  Merge user-space performance with kernel-space  User-space: (TAU) profile/trace  Kernel-space: (KTAU) profile/trace  Detailed program-OS interaction data  Including interrupts (IRQ)  Analysis and visualization compatible with TAU

HLRS 2006Performance Technology for Productive, High-End Parallel Computing TAU Performance System Status  Computing platforms  IBM, SGI, Cray, HP, Sun, Hitachi, NEC, Linux clusters, Apple, Windows, …  Programming languages  C, C++, Fortran 90/95, UPC, HPF, Java, OpenMP, Python  Thread libraries  pthreads, SGI sproc, Java,Windows, OpenMP  Communications libraries  MPI-1/2, PVM, shmem, …  Compilers  IBM, Intel, PGI, GNU, Fujitsu, Sun, NAG, Microsoft, SGI, Cray, HP, NEC, Absoft, Lahey, PathScale, Open64

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Papers at European Conferences 2006  L. Li and A. Malony, “Model-based Performance Diagnosis for Master-Worker Parallel Computations,” EuroPar  A. Nataraj, A. Malony, A. Morris, and S. Shende, “Early Experiences with KTAU on the IBM BG/L,” EuroPar  L. Li and A. Malony, “Model-based Performance Diagnosis of Wavefront Parallel Computations,” HPCC  W. Spear, A. Malony, A. Morris, and S. Shende, “Integrating TAU with Eclipse: A Performance Analysis System in an Integrated Development Environment,” HPCC  K. Huck, A. Malony, S. Shende, and A. Morris, “TAUg: Runtime Global Performance Data Access using MPI,” EuroPVM-MPI  C. Hoge, A. Malony, and D. Keith, “Client-side Task Support in Matlab for Concurrent Distributed Execution,” DAPSYS  A. Nataraj, A. Malony, S. Shende, and A. Morris, “Kernel-level Measurement for Integrated Performance Views: the KTAU Project,” Cluster 2006, distinguished paper.

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Project Affiliations (selected)  Lawrence Livermore National Lab  Hydrodynamics (Miranda), radiation diffusion (KULL)  Open Trace Format (OTF) implementation on BG/L  Argonne National Lab  ZeptoOS project and KTAU  Astrophysical thermonuclear flashes (Flash)  Center for Simulation of Accidental Fires and Explosion  University of Utah, ASCI ASAP Center, C-SAFE  Uintah Computational Framework (UCF)  Oak Ridge National Lab  Contribution to the Joule Report (S3D, AORSA3D)

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Project Affiliations (continued)  Sandia National Lab  Simulation of turbulent reactive flows (S3D)  Combustion code (CFRFS)  Los Alamos National Lab  Monte Carlo transport (MCNP)  SAIC’s Adaptive Grid Eulerian (SAGE)  CCSM / ESMF / WRF climate/earth/weather simulation  NSF, NOAA, DOE, NASA, …  Common component architecture (CCA) integration  Performance Evaluation Research Center (PERC)  DOE SciDAC center

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Support Acknowledgements  Department of Energy (DOE)  Office of Science  MICS, Argonne National Lab  ASC/NNSA  University of Utah ASC/NNSA Level 1  ASC/NNSA, Lawrence Livermore National Lab  Department of Defense (DoD)  HPC Modernization Office (HPCMO)  Programming Environment and Training (PET)  NSF Software and Tools for High-End Computing  Research Centre Juelich  Los Alamos National Laboratory  ParaTools

HLRS 2006Performance Technology for Productive, High-End Parallel Computing Acknowledgements  Dr. Sameer Shende, Senior Scientist  Alan Morris, Senior Software Engineer  Wyatt Spear, PRL staff  Scott Biersdorff, PRL staff  Kevin Huck, Ph.D. student  Aroon Nataraj, Ph.D. student  Kai Li, Ph.D. student  Li Li, Ph.D. student  Adnan Salman, Ph.D. student  Suravee Suthikulpanit, M.S. student