Allen D. Malony Performance Research Laboratory (PRL) Neuroinformatics Center (NIC) Department.

Slides:

Advertisements

Similar presentations

Machine Learning-based Autotuning with TAU and Active Harmony Nicholas Chaimov University of Oregon Paradyn Week 2013 April 29, 2013.

Advertisements

K T A U Kernel Tuning and Analysis Utilities Department of Computer and Information Science Performance Research Laboratory University of Oregon.

Dynamic performance measurement control Dynamic event grouping Multiple configurable counters Selective instrumentation Application-Level Performance Access.

Workload Characterization using the TAU Performance System Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer,

Robert Bell, Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science.

Scalability Study of S3D using TAU Sameer Shende

Sameer Shende Department of Computer and Information Science Neuro Informatics Center University of Oregon Tool Interoperability.

Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science Institute University.

Profiling S3D on Cray XT3 using TAU Sameer Shende

The TAU Performance Technology for Complex Parallel Systems (Performance Analysis Bring Your Own Code Workshop, NRL Washington D.C.) Sameer Shende, Allen.

Nick Trebon, Alan Morris, Jaideep Ray, Sameer Shende, Allen Malony {ntrebon, amorris, Department of.

On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Bernd Mohr 1, Allen D. Malony 2, Rudi Eigenmann 3 1 Forschungszentrum.

Allen D. Malony Department of Computer and Information Science Performance Research Laboratory University of Oregon Performance Technology.

Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science Institute University.

The TAU Performance System: Advances in Performance Mapping Sameer Shende University of Oregon.

Workshop on Performance Tools for Petascale Computing 9:30 – 10:30am, Tuesday, July 17, 2007, Snowbird, UT Sameer S. Shende

TAU Performance System Alan Morris, Sameer Shende, Allen D. Malony University of Oregon {amorris, sameer,

Performance Tools BOF, SC’07 5:30pm – 7pm, Tuesday, A9 Sameer S. Shende Performance Research Laboratory University.

Early Experiences with KTAU on the IBM Blue Gene / L A. Nataraj, A. Malony, A. Morris, S. Shende Performance.

Allen D. Malony Department of Computer and Information Science Computational Science Institute University of Oregon TAU Performance.

Allen D. Malony Department of Computer and Information Science Performance Research Laboratory NeuroInformatics Center University.

Workshop on Performance Tools for Petascale Computing 9:30 – 10:30am, Tuesday, July 17, 2007, Snowbird, UT Sameer S. Shende

Performance Evaluation of S3D using TAU Sameer Shende

Scalability Study of S3D using TAU Sameer Shende

Kai Li, Allen D. Malony, Robert Bell, Sameer Shende Department of Computer and Information Science Computational.

The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon.

Sameer Shende, Allen D. Malony Computer & Information Science Department Computational Science Institute University of Oregon.

Performance Tools for Empirical Autotuning Allen D. Malony, Nick Chaimov, Kevin Huck, Scott Biersdorff, Sameer Shende

Allen D. Malony Performance Research Laboratory (PRL) Neuroinformatics Center (NIC) Department.

Paradyn Week – April 14, 2004 – Madison, WI DPOMP: A DPCL Based Infrastructure for Performance Monitoring of OpenMP Applications Bernd Mohr Forschungszentrum.

LLNL-PRES-XXXXXX This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

CCA Common Component Architecture Manoj Krishnan Pacific Northwest National Laboratory MCMD Programming and Implementation Issues.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Allen D. Malony Performance Research Laboratory (PRL) Neuroinformatics Center (NIC) Department.

Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.

Using TAU on SiCortex Alan Morris, Aroon Nataraj Sameer Shende, Allen D. Malony University of Oregon {amorris, anataraj, sameer,

Allen D. Malony Department of Computer and Information Science Performance Research Laboratory University of Oregon Performance Technology.

Profile Analysis with ParaProf Sameer Shende Performance Reseaerch Lab, University of Oregon

Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,

Center for Component Technology for Terascale Simulation Software CCA is about: Enhancing Programmer Productivity without sacrificing performance. Supporting.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

Dynamic performance measurement control Dynamic event grouping Multiple configurable counters Selective instrumentation Application-Level Performance Access.

Early Experiences with KTAU on the Blue Gene / L A. Nataraj, A. Malony, A. Morris, S. Shende Performance Research Lab University of Oregon.

PerfExplorer Component for Performance Data Analysis Kevin Huck – University of Oregon Boyana Norris – Argonne National Lab Li Li – Argonne National Lab.

Allen D. Malony, Sameer S. Shende, Alan Morris, Robert Bell, Kevin Huck, Nick Trebon, Suravee Suthikulpanit, Kai Li, Li Li

Allen D. Malony Performance Research Laboratory (PRL) Neuroinformatics Center (NIC) Department.

Allen D. Malony, Sameer Shende, Li Li, Kevin Huck Department of Computer and Information Science Performance.

Allen D. Malony Department of Computer and Information Science TAU Performance Research Laboratory University of Oregon Discussion:

© 2006, National Research Council Canada © 2006, IBM Corporation Solving performance issues in OTS-based systems Erik Putrycz Software Engineering Group.

Allen D. Malony, Sameer S. Shende, Robert Bell Kai Li, Li Li, Kevin Huck Department of Computer.

Connections to Other Packages The Cactus Team Albert Einstein Institute

Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.

Allen D. Malony Department of Computer and Information Science Performance Research Laboratory University.

Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.

Allen D. Malony Department of Computer and Information Science Performance Research Laboratory.

Aroon Nataraj, Matthew Sottile, Alan Morris, Allen D. Malony, Sameer Shende { anataraj, matt, amorris, malony,

SDM Center High-Performance Parallel I/O Libraries (PI) Alok Choudhary, (Co-I) Wei-Keng Liao Northwestern University In Collaboration with the SEA Group.

Performane Analyzer Performance Analysis and Visualization of Large-Scale Uintah Simulations Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance.

Online Performance Analysis and Visualization of Large-Scale Parallel Applications Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance Research.

Kai Li, Allen D. Malony, Sameer Shende, Robert Bell

Productive Performance Tools for Heterogeneous Parallel Computing

Introduction to the TAU Performance System®

Performance Technology for Scalable Parallel Systems

TAU integration with Score-P

Allen D. Malony, Sameer Shende

TAU Parallel Performance System

TAU Parallel Performance System

TAU: A Framework for Parallel Performance Analysis

Outline Introduction Motivation for performance mapping SEAA model

Parallel Program Analysis Framework for the DOE ACTS Toolkit

Presentation transcript:

Allen D. Malony Performance Research Laboratory (PRL) Neuroinformatics Center (NIC) Department of Computer and Information Science University of Oregon Performance Technology for Productive, High-End Parallel Computing: the TAU Parallel Performance System

Performance Technology for Productive, High-End Parallel ComputingSun2 Outline  Research interests and motivation  TAU performance system  Instrumentation  Measurement  Analysis tools  Parallel profile analysis (ParaProf)  Performance data management (PerfDMF)  Performance data mining (PerfExplorer)  TAU on Solaris 10  ZeptoOS and KTAU

Performance Technology for Productive, High-End Parallel ComputingSun3 Research Motivation  Tools for performance problem solving  Empirical-based performance optimization process  Performance technology concerns characterization Performance Tuning Performance Diagnosis Performance Experimentation Performance Observation hypotheses properties Instrumentation Measurement Analysis Visualization Performance Technology Experiment management Performance data storage Performance Technology

Performance Technology for Productive, High-End Parallel ComputingSun4 Challenges in Performance Problem Solving  How to make the process more effective (productive)?  Process likely to change as parallel systems evolve  What are the important events and performance metrics?  Tied to application structure and computational model  Tied to application domain and algorithms  What are the significant issues that will affect the technology used to support the process?  Enhance application development and optimization  Process and tools can/must be more application-aware  Tools have poor support for application-specific aspects  Integrate performance technology and process

Performance Technology for Productive, High-End Parallel ComputingSun5 Performance Process, Technology, and Scale  How does our view of this process change when we consider very large-scale parallel systems?  Scaling complicates observation and analysis  Performance data size  standard approaches deliver a lot of data with little value  Measurement overhead and intrusion  tradeoff with analysis accuracy  “noise” in the system  Analysis complexity increases  What will enhance productive application development?  Process and technology evolution  Nature of application development may change

Performance Technology for Productive, High-End Parallel ComputingSun6 Role of Intelligence, Automation, and Knowledge  Scale forces the process to become more intelligent  Even with intelligent and application-specific tools, the decisions of what to analyze is difficult and intractable  More automation and knowledge-based decision making  Build automatic/autonomic capabilities into the tools  Support broader experimentation methods and refinement  Access and correlate data from several sources  Automate performance data analysis / mining / learning  Include predictive features and experiment refinement  Knowledge-driven adaptation and optimization guidance  Address scale issues through increased expertise

Performance Technology for Productive, High-End Parallel ComputingSun7 TAU Performance System  Tuning and Analysis Utilities (14+ year project effort)  Performance system framework for HPC systems  Integrated, scalable, flexible, and parallel  Targets a general complex system computation model  Entities: nodes / contexts / threads  Multi-level: system / software / parallelism  Measurement and analysis abstraction  Integrated toolkit for performance problem solving  Instrumentation, measurement, analysis, and visualization  Portable performance profiling and tracing facility  Performance data management and data mining  Partners: LLNL, ANL, Research Center Jülich, LANL

Performance Technology for Productive, High-End Parallel ComputingSun8 TAU Parallel Performance System Goals  Portable (open source) parallel performance system  Computer system architectures and operating systems  Different programming languages and compilers  Multi-level, multi-language performance instrumentation  Flexible and configurable performance measurement  Support for multiple parallel programming paradigms  Multi-threading, message passing, mixed-mode, hybrid, object oriented (generic), component  Support for performance mapping  Integration of leading performance technology  Scalable (very large) parallel performance analysis

Performance Technology for Productive, High-End Parallel ComputingSun9 memory Node VM space Context SMP Threads node memory … … Interconnection Network Inter-node message communication * * physical view model view General Complex System Computation Model  Node: physically distinct shared memory machine  Message passing node interconnection network  Context: distinct virtual memory space within node  Thread: execution threads (user/system) in context

Performance Technology for Productive, High-End Parallel ComputingSun10 TAU Performance System Architecture

Performance Technology for Productive, High-End Parallel ComputingSun11 TAU Performance System Architecture

Performance Technology for Productive, High-End Parallel ComputingSun12 TAU Instrumentation Approach  Support for standard program events  Routines, classes and templates  Statement-level blocks  Support for user-defined events  Begin/End events (“user-defined timers”)  Atomic events (e.g., size of memory allocated/freed)  Selection of event statistics  Support definition of “semantic” entities for mapping  Support for event groups (aggregation, selection)  Instrumentation optimization  Eliminate instrumentation in lightweight routines

Performance Technology for Productive, High-End Parallel ComputingSun13 TAU Instrumentation Mechanisms  Source code  Manual (TAU API, TAU component API)  Automatic (robust)  C, C++, F77/90/95 (Program Database Toolkit (PDT))  OpenMP (directive rewriting (Opari), POMP2 spec)  Object code  Pre-instrumented libraries (e.g., MPI using PMPI)  Statically-linked and dynamically-linked  Executable code  Dynamic instrumentation (pre-execution) (DynInstAPI)  Virtual machine instrumentation (e.g., Java using JVMPI)  TAU_COMPILER to automate instrumentation process

Performance Technology for Productive, High-End Parallel ComputingSun14 User-level abstractions problem domain source code object codelibraries instrumentation executable runtime image compiler linkerOS VM instrumentation performance data run preprocessor Multi-Level Instrumentation and Mapping  Multiple interfaces  Information sharing  Between interfaces  Event selection  Within/between levels  Mapping  Associate performance data with high-level semantic abstractions

Performance Technology for Productive, High-End Parallel ComputingSun15 Program Database Toolkit (PDT) Application / Library C / C++ parser Fortran parser F77/90/95 C / C++ IL analyzer Fortran IL analyzer Program Database Files IL DUCTAPE PDBhtml SILOON CHASM tau_instrument or Program documentation Application component glue C++ / F90/95 interoperability Automatic source instrumentation

Performance Technology for Productive, High-End Parallel ComputingSun16 Program Database Toolkit (PDT)  Program code analysis framework  Develop source-based tools  High-level interface to source code information  Integrated toolkit for source code parsing, database creation, and database query  Commercial grade front-end parsers  Portable IL analyzer, database format, and access API  Open software approach for tool development  Multiple source languages  Implement automatic performance instrumentation tools  tau_instrumentor

Performance Technology for Productive, High-End Parallel ComputingSun17 TAU Measurement Approach  Portable and scalable parallel profiling solution  Multiple profiling types and options  Event selection and control (enabling/disabling, throttling)  Online profile access and sampling  Online performance profile overhead compensation  Portable and scalable parallel tracing solution  Trace translation to Open Trace Format (OTF)  Trace streams and hierarchical trace merging  Robust timing and hardware performance support  Multiple counters (hardware, user-defined, system)  Performance measurement for CCA component software

Performance Technology for Productive, High-End Parallel ComputingSun18 TAU Measurement Mechanisms  Parallel profiling  Function-level, block-level, statement-level  Supports user-defined events and mapping events  TAU parallel profile stored (dumped) during execution  Support for flat, callgraph/callpath, phase profiling  Support for memory profiling  Tracing  All profile-level events  Inter-process communication events  Inclusion of multiple counter data in traced events

Performance Technology for Productive, High-End Parallel ComputingSun19 Types of Parallel Performance Profiling  Flat profiles  Metric (e.g., time) spent in an event (callgraph nodes)  Exclusive/inclusive, # of calls, child calls  Callpath profiles (Calldepth profiles)  Time spent along a calling path (edges in callgraph)  “main=> f1 => f2 => MPI_Send” (event name)  TAU_CALLPATH_LENGTH environment variable  Phase profiles  Flat profiles under a phase (nested phases are allowed)  Default “main” phase  Supports static or dynamic (per-iteration) phases

Performance Technology for Productive, High-End Parallel ComputingSun20 Performance Analysis and Visualization  Analysis of parallel profile and trace measurement  Parallel profile analysis  ParaProf: parallel profile analysis and presentation  ParaVis: parallel performance visualization package  Profile generation from trace data (tau2pprof)  Performance data management framework (PerfDMF)  Parallel trace analysis  Translation to VTF (V3.0), EPILOG, OTF formats  Integration with VNG (Technical University of Dresden)  Online parallel analysis and visualization  Integration with CUBE browser (KOJAK, UTK, FZJ)

Performance Technology for Productive, High-End Parallel ComputingSun21 ParaProf Parallel Performance Profile Analysis HPMToolkit MpiP TAU Raw files PerfDMF managed (database) Metadata Application Experiment Trial

Performance Technology for Productive, High-End Parallel ComputingSun22 Example Applications  sPPM  ASCI benchmark, Fortran, C, MPI, OpenMP or pthreads  Miranda  research hydrodynamics code, Fortran, MPI  GYRO  tokamak turbulence simulation, Fortran, MPI  FLASH  physics simulation, Fortran, MPI  WRF  weather research and forecasting, Fortran, MPI  S3D  3D combustion, Fortran, MPI

Performance Technology for Productive, High-End Parallel ComputingSun23 ParaProf – Flat Profile (Miranda, BG/L) 8K processors node, context, thread Miranda  hydrodynamics  Fortran + MPI  LLNL Run to 64K

Performance Technology for Productive, High-End Parallel ComputingSun24 ParaProf – Stacked View (Miranda)

Performance Technology for Productive, High-End Parallel ComputingSun25 ParaProf – Callpath Profile (Flash) Flash  thermonuclear flashes  Fortran + MPI  Argonne

Performance Technology for Productive, High-End Parallel ComputingSun26 ParaProf – Histogram View (Miranda) 8k processors 16k processors

Performance Technology for Productive, High-End Parallel ComputingSun27 NAS BT – Flat Profile How is MPI_Wait() distributed relative to solver direction? Application routine names reflect phase semantics

Performance Technology for Productive, High-End Parallel ComputingSun28 NAS BT – Phase Profile (Main and X, Y, Z) Main phase shows nested phases and immediate events

Performance Technology for Productive, High-End Parallel ComputingSun29 ParaProf – 3D Full Profile (Miranda) 16k processors

Performance Technology for Productive, High-End Parallel ComputingSun30 ParaProf – 3D Full Profile (Flash) 128 processors

Performance Technology for Productive, High-End Parallel ComputingSun31 ParaProf – 3D Scatterplot (Miranda)  Each point is a “thread” of execution  A total of four metrics shown in relation  ParaVis 3D profile visualization library  JOGL

Performance Technology for Productive, High-End Parallel ComputingSun32 ParaProf – Callgraph Zoom (Flash) Zoom in (+) Zoom out (-)

Performance Technology for Productive, High-End Parallel ComputingSun33 Performance Tracing on Miranda  Use TAU to generate VTF3 traces for Vampir analysis  MPI calls with HW counter information (not shown)  Detailed code behavior to focus optimization efforts

Performance Technology for Productive, High-End Parallel ComputingSun34 S3D on Lemieux (TAU-to-VTF3, Vampir) S3D  3D combustion  Fortran + MPI  PSC

Performance Technology for Productive, High-End Parallel ComputingSun35 S3D on Lemieux (Zoomed)

Performance Technology for Productive, High-End Parallel ComputingSun36 Hypothetical Mapping Example  Particles distributed on surfaces of a cube Particle* P[MAX]; /* Array of particles */ int GenerateParticles() { /* distribute particles over all faces of the cube */ for (int face=0, last=0; face < 6; face++){ /* particles on this face */ int particles_on_this_face = num(face); for (int i=last; i < particles_on_this_face; i++) { /* particle properties are a function of face */ P[i] =... f(face);... } last+= particles_on_this_face; }

Performance Technology for Productive, High-End Parallel ComputingSun37 Hypothetical Mapping Example (continued)  How much time (flops) spent processing face i particles?  What is the distribution of performance among faces?  How is this determined if execution is parallel? int ProcessParticle(Particle *p) { /* perform some computation on p */ } int main() { GenerateParticles(); /* create a list of particles */ for (int i = 0; i < N; i++) /* iterates over the list */ ProcessParticle(P[i]); } … engine work packets

Performance Technology for Productive, High-End Parallel ComputingSun38 No Performance Mapping versus Mapping  Typical performance tools report performance with respect to routines  Does not provide support for mapping  TAU’s performance mapping can observe performance with respect to scientist’s programming and problem abstractions TAU (no mapping) TAU (w/ mapping)

Performance Technology for Productive, High-End Parallel ComputingSun39 Component-Based Scientific Applications  How to support performance analysis and tuning process consistent with application development methodology?  Common Component Architecture (CCA) applications  Performance tools should integrate with software  Design performance observation component  Measurement port and measurement interfaces  Build support for application component instrumentation  Interpose a proxy component for each port  Inside the proxy, track caller/callee invocations, timings  Automate the process of proxy component creation  using PDT for static analysis of components  include support for selective instrumentation

Performance Technology for Productive, High-End Parallel ComputingSun40 Flame Reaction-Diffusion (Sandia) CCAFFEINE

Performance Technology for Productive, High-End Parallel ComputingSun41 Earth Systems Modeling Framework  Coupled modeling with modular software framework  Instrumentation for ESMF framework and applications  PDT automatic instrumentation  Fortran 95 code modules  C / C++ code modules  MPI wrapper library for MPI calls  ESMF component instrumentation (using CCA)  CCA measurement port manual instrumentation  Proxy generation using PDT and runtime interposition  Significant callpath profiling used by ESMF team

Performance Technology for Productive, High-End Parallel ComputingSun42 Using TAU Component in ESMF/CCA

Performance Technology for Productive, High-End Parallel ComputingSun43 Important Questions for Application Developers  How does performance vary with different compilers?  Is poor performance correlated with certain OS features?  Has a recent change caused unanticipated performance?  How does performance vary with MPI variants?  Why is one application version faster than another?  What is the reason for the observed scaling behavior?  Did two runs exhibit similar performance?  How are performance data related to application events?  Which machines will run my code the fastest and why?  Which benchmarks predict my code performance best?

Performance Technology for Productive, High-End Parallel ComputingSun44 Performance Problem Solving Goals  Answer questions at multiple levels of interest  Data from low-level measurements and simulations  use to predict application performance  High-level performance data spanning dimensions  machine, applications, code revisions, data sets  examine broad performance trends  Discover general correlations application performance and features of their external environment  Develop methods to predict application performance on lower-level metrics  Discover performance correlations between a small set of benchmarks and a collection of applications that represent a typical workload for a given system

Performance Technology for Productive, High-End Parallel ComputingSun45 Automatic Performance Analysis Tool (Concept) Performance database Build application Execute application Simple analysis feedback 105% Faster! 72% Faster! build information environment / performance data Offline analysis

Performance Technology for Productive, High-End Parallel ComputingSun46 Performance Data Management (PerfDMF) K. Huck, A. Malony, R. Bell, A. Morris, “Design and Implementation of a Parallel Performance Data Management Framework,” ICPP (awarded best paper)

Performance Technology for Productive, High-End Parallel ComputingSun47 Performance Data Mining (Objectives)  Conduct parallel performance analysis in a systematic, collaborative and reusable manner  Manage performance complexity  Discover performance relationship and properties  Automate process  Multi-experiment performance analysis  Large-scale performance data reduction  Summarize characteristics of large processor runs  Implement extensible analysis framework  Abtraction / automation of data mining operations  Interface to existing analysis and data mining tools

Performance Technology for Productive, High-End Parallel ComputingSun48 Performance Data Mining (PerfExplorer)  Performance knowledge discovery framework  Data mining analysis applied to parallel performance data  comparative, clustering, correlation, dimension reduction, …  Use the existing TAU infrastructure  TAU performance profiles, PerfDMF  Client-server based system architecture  Technology integration  Java API and toolkit for portability  PerfDMF  R-project/Omegahat, Octave/Matlab statistical analysis  WEKA data mining package  JFreeChart for visualization, vector output (EPS, SVG)

Performance Technology for Productive, High-End Parallel ComputingSun49 Performance Data Mining (PerfExplorer) K. Huck and A. Malony, “PerfExplorer: A Performance Data Mining Framework For Large-Scale Parallel Computing,” SC 2005.

Performance Technology for Productive, High-End Parallel ComputingSun50 PerfExplorer Analysis Methods  Data summaries, distributions, scatterplots  Clustering  k-means  Hierarchical  Correlation analysis  Dimension reduction  PCA  Random linear projection  Thresholds  Comparative analysis  Data management views

Performance Technology for Productive, High-End Parallel ComputingSun51 Cluster Analysis  Performance data represented as vectors - each dimension is the cumulative time for an event  k-means: k random centers are selected and instances are grouped with the "closest" (Euclidean) center  New centers are calculated and the process repeated until stabilization or max iterations  Dimension reduction necessary for meaningful results  Virtual topology, summaries constructed

Performance Technology for Productive, High-End Parallel ComputingSun52 sPPM Cluster Analysis

Performance Technology for Productive, High-End Parallel ComputingSun53 Hierarchical and K-means Clustering (sPPM)

Performance Technology for Productive, High-End Parallel ComputingSun54 Miranda Clusters, Average Values (16K CPUs)  Two primary clusters due to MPI_Alltoall behavior …  … also inverse relationship between MPI_Barrier and MPI_Group_translate_ranks

Performance Technology for Productive, High-End Parallel ComputingSun55 Miranda Modified  After code modifications, work distribution is even  MPI_Barrier and MPI_Group_translate_ranks are no longer significant contributors to run time

Performance Technology for Productive, High-End Parallel ComputingSun56 Flash Clustering on 16K BG/L Processors  Four significant events automatically selected  Clusters and correlations are visible

Performance Technology for Productive, High-End Parallel ComputingSun57 Correlation Analysis  Describes strength and direction of a linear relationship between two variables (events) in the data

Performance Technology for Productive, High-End Parallel ComputingSun58 Comparative Analysis  Relative speedup, efficiency  total runtime, by event, one event, by phase  Breakdown of total runtime  Group fraction of total runtime  Correlating events to total runtime  Timesteps per second

Performance Technology for Productive, High-End Parallel ComputingSun59 User-Defined Views  Reorganization of data for multiple parametric studies  Construction of views / sub-views with simple operators  Simple “wizard” like interface for creating view Application Processors Problem size Application Problem type Processors

Performance Technology for Productive, High-End Parallel ComputingSun60 PerfExplorer Future Work  Extensions to PerfExplorer framework  Examine properties of performance data  Automated guidance of analysis  Workflow scripting for repeatable analysis  Dependency modeling (go beyond correlation)  Time-series analysis of phase-based data

Performance Technology for Productive, High-End Parallel ComputingSun61 TAU Eclipse Integration  Eclipse GUI integration of existing TAU tools  New Eclipse plug-in for code instrumentation  Integration with CDT and FDT  Java, C/C++, and Fortran projects  Can be instrumented and run from within eclipse  Each project can be given multiple build configurations corresponding to available TAU makefiles  All TAU configuration options are available  Paraprof tool can be launched automatically

Performance Technology for Productive, High-End Parallel ComputingSun62 TAU Eclipse Integration TAU configuration TAU experimentation

Performance Technology for Productive, High-End Parallel ComputingSun63 TAU Eclipse Future Work  Development of the TAU Eclipse plugins for Java and the CDT/FDT is ongoing  Planned features include:  Full integration with the Eclipse Parallel Tools project  Database storage of project performance data  Refinement of the plugin settings interface to allow easier selection of TAU runtime and compiletime options  Accessibility of TAU configuration and commandline tools via the Eclipse UI

Performance Technology for Productive, High-End Parallel ComputingSun64 Porting TAU to Sun Solaris 10 / Opteron  Already supported earlier verions of Solaris  Already supported Linux and Windows  Will work directly with Sun Opteron systems  Now have full support for Solaris 10  Intel-based systems  Opteron systems  Profiling and tracing with all options  Compilers supported  Sun Studio 10 compilers (C,C++, and F90)  Parallel models supported  MPI, OpenMP, hybrid

Performance Technology for Productive, High-End Parallel ComputingSun65 Porting Challenge – Nested OpenMP  TAU did not support nested parallelism before at all  Opari could instrument  but does not distinguish nesting  Lack of OpenMP support for performance tools  determining thread information for nested threads  Want to build a portable mechanism for OpenMP  OpenMP may provide runtime call in the future  Explore what can be implemented otherwise now  Now supports nested OpenMP parallelism  OpenMP runtime system independent  Depends on language feature

Performance Technology for Productive, High-End Parallel ComputingSun66 Nested Parallelism Implementation in TAU  TAU normally uses omp_get_thread_num()  Identify threads with globally unique identifier (tid)  Use to create measurement structures for thread  event callstack, profile objects, …  Approach breaks down for nested parallelism  omp_get_thread_num() returns id in current team  Not globally unique  Use #pragma threadprivate() directive  Gives thread local storage (TLS) for identifying threads  TAU generates tid for each thread first come, first serve  Supported in Intel and Sun compilers (probably IBM)  Tested on Sun (Sparc, x64) and SGI Prism (Itanium2)

Performance Technology for Productive, High-End Parallel ComputingSun67 ZeptoOS and TAU  DOE OS/RTS for Extreme Scale Scientific Computation  ZeptoOS  scalable components for petascale architectures  Argonne National Laboratory and University of Oregon  University of Oregon  Kernel-level performance monitoring  OS component performance assessment and tuning  KTAU (Kernel Tuning and Analysis Utilities)  integration of TAU infrastructure in Linux kernel  integration with ZeptoOS  installation on BG/L  Port to 32-bit and 64-bit Linux platforms

Performance Technology for Productive, High-End Parallel ComputingSun68 Linux Kernel Profiling using TAU – Goals  Fine-grained kernel-level performance measurement  Parallel applications  Support both profiling and tracing  Both process-centric and system-wide view  Merge user-space performance with kernel-space  User-space: (TAU) profile/trace  Kernel-space: (KTAU) profile/trace  Detailed program-OS interaction data  Including interrupts (IRQ)  Analysis and visualization compatible with TAU

Performance Technology for Productive, High-End Parallel ComputingSun69 KTAU System Architecture

Performance Technology for Productive, High-End Parallel ComputingSun70 KTAU on BG/L I/O Node

Performance Technology for Productive, High-End Parallel ComputingSun71 KTAU Usage Models  Daemon-based monitoring (KTAU-D)  Use KTAU-D to monitor (profile/trace) a single process (e.g., CIOD) or entire IO-Node kernel  No access to source code of user-space program  CIOD kernel-activity available though CIOD source N/A  ‘Self’ Monitoring  A user-space program can be instrumented (e.g., with TAU) to access its OWN kernel-level trace/profile data  ZIOD (ZeptoOS IO-D) source (when available) can be instrumented  Can produce MERGED user-kernel trace/profile

Performance Technology for Productive, High-End Parallel ComputingSun72 KTAU-D Profile Data  KTAU-D can be used to access profile data (system- wide and individual process) of BGL IO-Node  Data is obtained at the start and stop of KTAUD, and then the resulting profile is generated  (Work in progress)  Currrently flat profiles with inclusive/exclusive times and Function call counts are produced  (Future work: Call-graph profiles).  Profile data is viewed using ParaProf visualization tool

Performance Technology for Productive, High-End Parallel ComputingSun73 KTAU-D Trace  KTAU-D can be used to access system-wide and individual process trace data of BGL IO-Node  Trace from KTAU-D is converted into TAU trace-format which then can be converted into other formats  Vampir, Jumpshot  Trace from KTAU-D can be used together (merged) with trace from TAU to monitor both user and kernel space activities  (Work in progress)

Performance Technology for Productive, High-End Parallel ComputingSun74 Experiment to Show KTAU in Use  Workload  “iotest” benchmark on BGL  2, 4, 16, 32, 48, and 64 compute nodes  Use KTAU in IO-Node ZeptoOS Kernel  Collect trace data  KTAU-D on IO-Node periodically monitors system activities and dumps out trace-data  We visualize the activities in the trace using Vampir

Performance Technology for Productive, High-End Parallel ComputingSun75 Experiment Setup (Parameters)  KTAU:  Enable all instrumentation points  Number of kernel trace entries per proces = 10K  KTAU-D:  System-wide tracing  Accessing trace every 1 second and dump trace output to a file in user’s home directory through NFS  IOTEST:  An mpi-based benchmark (open/write/read/close)  Running with default parameters (blocksize = 16MB)

Performance Technology for Productive, High-End Parallel ComputingSun76 SYS_WRITE KTAU Trace of CIOD running 2, 4, 8, 16, 32 nodes As the number of compute node increase, CIOD has to handle larger amount of sys_call being forwarded. 1,769 sys_write 3,142 sys_write 5,838 sys_write 10,980 sys_write 37,985 sys_write

Performance Technology for Productive, High-End Parallel ComputingSun77 Zoomed View of CIOD Trace (8 compute nodes)

Performance Technology for Productive, High-End Parallel ComputingSun78 TAU Performance System Status  Computing platforms  IBM, SGI, Cray, HP, Sun, Hitachi, NEC, Linux clusters, Apple, Windows, …  Programming languages  C, C++, Fortran 90/95, UPC, HPF, Java, OpenMP, Python  Thread libraries  pthreads, SGI sproc, Java,Windows, OpenMP  Communications libraries  MPI-1/2, PVM, shmem, …  Compilers  IBM, Intel, PGI, GNU, Fujitsu, Sun, NAG, Microsoft, SGI, Cray, HP, NEC, Absoft, Lahey, PathScale, Open64

Performance Technology for Productive, High-End Parallel ComputingSun79 Project Affiliations (selected)  Lawrence Livermore National Lab  Hydrodynamics (Miranda), radiation diffusion (KULL)  Open Trace Format (OTF) implementation on BG/L  Argonne National Lab  ZeptoOS project and KTAU  Astrophysical thermonuclear flashes (Flash)  Center for Simulation of Accidental Fires and Explosion  University of Utah, ASCI ASAP Center, C-SAFE  Uintah Computational Framework (UCF)  Oak Ridge National Lab  Contribution to the Joule Report (S3D, AORSA3D)

Performance Technology for Productive, High-End Parallel ComputingSun80 Project Affiliations (continued)  Sandia National Lab  Simulation of turbulent reactive flows (S3D)  Combustion code (CFRFS)  Los Alamos National Lab  Monte Carlo transport (MCNP)  SAIC’s Adaptive Grid Eulerian (SAGE)  CCSM / ESMF / WRF climate/earth/weather simulation  NSF, NOAA, DOE, NASA, …  Common component architecture (CCA) integration  Performance Evaluation Research Center (PERC)  DOE SciDAC center

Performance Technology for Productive, High-End Parallel ComputingSun81 Support Acknowledgements  Department of Energy (DOE)  Office of Science  MICS, Argonne National Lab  ASC/NNSA  University of Utah ASC/NNSA Level 1  ASC/NNSA, Lawrence Livermore National Lab  Department of Defense (DoD)  HPC Modernization Office (HPCMO)  Programming Environment and Training (PET)  NSF Software and Tools for High-End Computing  Research Centre Juelich  Los Alamos National Laboratory  ParaTools

Performance Technology for Productive, High-End Parallel ComputingSun82 Acknowledgements  Dr. Sameer Shende, Senior Scientist  Alan Morris, Senior Software Engineer  Wyatt Spear, PRL staff  Scott Biersdorff, PRL staff  Kevin Huck, Ph.D. student  Aroon Nataraj, Ph.D. student  Kai Li, Ph.D. student  Li Li, Ph.D. student  Adnan Salman, Ph.D. student  Suravee Suthikulpanit, M.S. student