Allen D. Malony Department of Computer and Information Science Performance Research Laboratory University of Oregon Performance Technology.

Slides:

Advertisements

Similar presentations

Configuration management

Advertisements

Machine Learning-based Autotuning with TAU and Active Harmony Nicholas Chaimov University of Oregon Paradyn Week 2013 April 29, 2013.

K T A U Kernel Tuning and Analysis Utilities Department of Computer and Information Science Performance Research Laboratory University of Oregon.

Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.

Dynamic performance measurement control Dynamic event grouping Multiple configurable counters Selective instrumentation Application-Level Performance Access.

NUMA Tuning for Java Server Applications Mustafa M. Tikir.

Presented by Scalable Systems Software Project Al Geist Computer Science Research Group Computer Science and Mathematics Division Research supported by.

Allen D. Malony Department of Computer and Information Science Performance Research Laboratory University of Oregon Multi-Experiment.

Robert Bell, Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science.

Scalability Study of S3D using TAU Sameer Shende

Sameer Shende Department of Computer and Information Science Neuro Informatics Center University of Oregon Tool Interoperability.

Tools for Engineering Analysis of High Performance Parallel Programs David Culler, Frederick Wong, Alan Mainwaring Computer Science Division U.C.Berkeley.

Profiling S3D on Cray XT3 using TAU Sameer Shende

TAU Parallel Performance System DOD UGC 2004 Tutorial Allen D. Malony, Sameer Shende, Robert Bell Univesity of Oregon.

The TAU Performance Technology for Complex Parallel Systems (Performance Analysis Bring Your Own Code Workshop, NRL Washington D.C.) Sameer Shende, Allen.

Scripting Languages For Virtual Worlds. Outline Necessary Features Classes, Prototypes, and Mixins Static vs. Dynamic Typing Concurrency Versioning Distribution.

Nick Trebon, Alan Morris, Jaideep Ray, Sameer Shende, Allen Malony {ntrebon, amorris, Department of.

On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Bernd Mohr 1, Allen D. Malony 2, Rudi Eigenmann 3 1 Forschungszentrum.

Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science Institute University.

Performance Tools BOF, SC’07 5:30pm – 7pm, Tuesday, A9 Sameer S. Shende Performance Research Laboratory University.

Allen D. Malony Department of Computer and Information Science Computational Science Institute University of Oregon TAU Performance.

CASE Tools CIS 376 Bruce R. Maxim UM-Dearborn. Prerequisites to Software Tool Use Collection of useful tools that help in every step of building a product.

Allen D. Malony Department of Computer and Information Science Performance Research Laboratory NeuroInformatics Center University.

Performance Evaluation of S3D using TAU Sameer Shende

TAU: Performance Regression Testing Harness for FLASH Sameer Shende

Scalability Study of S3D using TAU Sameer Shende

S3D: Comparing Performance of XT3+XT4 with XT4 Sameer Shende

1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.

Allen D. Malony, Sameer Shende, Robert Bell Department of Computer and Information Science Computational Science Institute, NeuroInformatics.

Kai Li, Allen D. Malony, Robert Bell, Sameer Shende Department of Computer and Information Science Computational.

The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon.

Sameer Shende, Allen D. Malony Computer & Information Science Department Computational Science Institute University of Oregon.

Performance Tools for Empirical Autotuning Allen D. Malony, Nick Chaimov, Kevin Huck, Scott Biersdorff, Sameer Shende

Allen D. Malony Performance Research Laboratory (PRL) Neuroinformatics Center (NIC) Department.

Computer System Architectures Computer System Software

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.

Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.

Allen D. Malony Department of Computer and Information Science Performance Research Laboratory University of Oregon Performance Technology.

John Mellor-Crummey Robert Fowler Nathan Tallent Gabriel Marin Department of Computer Science, Rice University Los Alamos Computer Science Institute HPCToolkit.

Profile Analysis with ParaProf Sameer Shende Performance Reseaerch Lab, University of Oregon

Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,

Center for Component Technology for Terascale Simulation Software CCA is about: Enhancing Programmer Productivity without sacrificing performance. Supporting.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

Presented by An Overview of the Common Component Architecture (CCA) The CCA Forum and the Center for Technology for Advanced Scientific Component Software.

Dynamic performance measurement control Dynamic event grouping Multiple configurable counters Selective instrumentation Application-Level Performance Access.

Debugging parallel programs. Breakpoint debugging Probably the most widely familiar method of debugging programs is breakpoint debugging. In this method,

Allen D. Malony, Sameer S. Shende, Alan Morris, Robert Bell, Kevin Huck, Nick Trebon, Suravee Suthikulpanit, Kai Li, Li Li

Allen D. Malony, Sameer Shende, Li Li, Kevin Huck Department of Computer and Information Science Performance.

Allen D. Malony Department of Computer and Information Science TAU Performance Research Laboratory University of Oregon Discussion:

1 SciDAC High-End Computer System Performance: Science and Engineering Jack Dongarra Innovative Computing Laboratory University of Tennesseehttp://

Full and Para Virtualization

Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.

Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.

Chapter 1 Basic Concepts of Operating Systems Introduction Software A program is a sequence of instructions that enables the computer to carry.

Allen D. Malony Department of Computer and Information Science Performance Research Laboratory.

SDM Center High-Performance Parallel I/O Libraries (PI) Alok Choudhary, (Co-I) Wei-Keng Liao Northwestern University In Collaboration with the SEA Group.

Performane Analyzer Performance Analysis and Visualization of Large-Scale Uintah Simulations Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance.

Online Performance Analysis and Visualization of Large-Scale Parallel Applications Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance Research.

Chapter Goals Describe the application development process and the role of methodologies, models, and tools Compare and contrast programming language generations.

Kai Li, Allen D. Malony, Sameer Shende, Robert Bell

Performance Technology for Scalable Parallel Systems

TAU integration with Score-P

Allen D. Malony, Sameer Shende

TAU Parallel Performance System

TAU Parallel Performance System

TAU: A Framework for Parallel Performance Analysis

Allen D. Malony Computer & Information Science Department

Outline Introduction Motivation for performance mapping SEAA model

Parallel Program Analysis Framework for the DOE ACTS Toolkit

Presentation transcript:

Allen D. Malony Department of Computer and Information Science Performance Research Laboratory University of Oregon Performance Technology for Productive, High-End Parallel Computing

LLNL, Oct Outline of Talk  Research motivation  Scalability, productivity, and performance technology  Application-specific and autonomic performance tools  TAU parallel performance system developments  Application performance case studies  New project directions  Performance data mining and knowledge discovery  Concluding discussion

Performance Technology for Productive, High-End Parallel ComputingLLNL, Oct Research Motivation  Tools for performance problem solving  Empirical-based performance optimization process  Performance technology concerns characterization Performance Tuning Performance Diagnosis Performance Experimentation Performance Observation hypotheses properties Instrumentation Measurement Analysis Visualization Performance Technology Experiment management Performance database Performance Technology

Performance Technology for Productive, High-End Parallel ComputingLLNL, Oct Large Scale Performance Problem Solving  How does our view of this process change when we consider very large-scale parallel systems?  What are the significant issues that will affect the technology used to support the process?  Parallel performance observation is clearly needed  In general, there is the concern for intrusion  Seen as a tradeoff with performance diagnosis accuracy  Scaling complicates observation and analysis  Nature of application development may change  Paradigm shift in performance process and technology?  What will enhance productive application development?

Performance Technology for Productive, High-End Parallel ComputingLLNL, Oct Scaling and Performance Observation  Consider “traditional” measurement methods  Profiling: summary statistics calculated during execution  Tracing: time-stamped sequence of execution events  More parallelism  more performance data overall  Performance specific to each thread of execution  Possible increase in number interactions between threads  Harder to manage the data (memory, transfer, storage)  How does per thread profile size grow?  Instrumentation more difficult with greater parallelism?  More parallelism / performance data  harder analysis  More time consuming to analyze and difficult to visualize

Performance Technology for Productive, High-End Parallel ComputingLLNL, Oct Concern for Performance Measurement Intrusion  Performance measurement can affect the execution  Perturbation of “actual” performance behavior  Minor intrusion can lead to major execution effects  Problems exist even with small degree of parallelism  Intrusion is accepted consequence of standard practice  Consider intrusion (perturbation) of trace buffer overflow  Scale exacerbates the problem … or does it?  Traditional measurement techniques tend to be localized  Suggests scale may not compound local intrusion globally  Measuring parallel interactions likely will be affected  Use accepted measurement techniques intelligently

Performance Technology for Productive, High-End Parallel ComputingLLNL, Oct Role of Intelligence and Specificity  How to make the process more effective (productive)?  Scale forces performance observation to be intelligent  Standard approaches deliver a lot of data with little value  What are the important performance events and data?  Tied to application structure and computational mode  Tools have poor support for application-specific aspects  Process and tools can be more application-aware  Will allow scalability issues to be addressed in context  More control and precision of performance observation  More guided performance experimentation / exploration  Better integration with application development

Performance Technology for Productive, High-End Parallel ComputingLLNL, Oct Role of Automation and Knowledge Discovery  Even with intelligent and application-specific tools, the decisions of what to analyze may become intractable  Scale forces the process to become more automated  Performance extrapolation must be part of the process  Build autonomic capabilities into the tools  Support broader experimentation methods and refinement  Access and correlate data from several sources  Automate performance data analysis / mining / learning  Include predictive features and experiment refinement  Knowledge-driven adaptation and optimization guidance  Address scale issues through increased expertise

Performance Technology for Productive, High-End Parallel ComputingLLNL, Oct TAU Parallel Performance System Goals  Multi-level performance instrumentation  Multi-language automatic source instrumentation  Flexible and configurable performance measurement  Widely-ported parallel performance profiling system  Computer system architectures and operating systems  Different programming languages and compilers  Support for multiple parallel programming paradigms  Multi-threading, message passing, mixed-mode, hybrid  Support for performance mapping  Support for object-oriented and generic programming  Integration in complex software, systems, applications

Performance Technology for Productive, High-End Parallel ComputingLLNL, Oct TAU Parallel Performance System Architecture

Performance Technology for Productive, High-End Parallel ComputingLLNL, Oct TAU Parallel Performance System Architecture

Performance Technology for Productive, High-End Parallel ComputingLLNL, Oct Advances in TAU Instrumentation  Source instrumentation  Program Database Toolkit (PDT)  automated Fortran 90/95 support (Flint parser, very robust)  statement level support in C/C++ (Fortran soon)  TAU_COMPILER to automate instrumentation process  Automatic proxy generation for component applications  automatic CCA component instrumentation  Python instrumentation and automatic instrumentation  Continued integration with dynamic instrumentation  Update of OpenMP instrumentation (POMP2)  Selective instrumentation and overhead reduction  Improvements in performance mapping instrumentation

Performance Technology for Productive, High-End Parallel ComputingLLNL, Oct Advances in TAU Measurement  Profiling  Memory profiling  global heap memory tracking (several options)  Callpath profiling  user-controllable calling depth  Improved support for multiple counter profiling  Online profile access and sampling  Tracing  Generation of VTF3 traces files (fully portable)  Inclusion of hardware performance counts in trace files  Hierarchical trace merging  Online performance overhead compensation  Component software proxy generation and monitoring

Performance Technology for Productive, High-End Parallel ComputingLLNL, Oct Advances in TAU Performance Analysis  Enhanced parallel profile analysis (ParaProf)  Callpath analysis integration in ParaProf  Embedded Lisp interpreter  Performance Data Management Framework (PerfDMF)  First release of prototype  In use by several groups  S. Moore (UTK), P. Teller (UTEP), P. Hovland (ANL), …  Integration with Vampir Next Generation (VNG)  Online trace analysis  Performance visualization (ParaVis) prototype  Component performance modeling and QoS

Performance Technology for Productive, High-End Parallel ComputingLLNL, Oct TAU Performance System Status  Computing platforms (selected)  IBM SP / pSeries, SGI Origin 2K/3K, Cray T3E / SV-1 / X1, HP (Compaq) SC (Tru64), Sun, Hitachi SR8000, NEC SX-5/6, Linux clusters (IA-32/64, Alpha, PPC, PA- RISC, Power, Opteron), Apple (G4/5, OS X), Windows  Programming languages  C, C++, Fortran 77/90/95, HPF, Java, OpenMP, Python  Thread libraries  pthreads, SGI sproc, Java,Windows, OpenMP  Compilers (selected)  Intel KAI (KCC, KAP/Pro), PGI, GNU, Fujitsu, Sun, Microsoft, SGI, Cray, IBM (xlc, xlf), HP, NEC, Absoft

Performance Technology for Productive, High-End Parallel ComputingLLNL, Oct Component-Based Scientific Applications  How to support performance analysis and tuning process consistent with application development methodology?  Common Component Architecture (CCA) applications  Performance tools should integrate with software  Design performance observation component  Measurement port and measurement interfaces  Build support for application component instrumentation  Interpose a proxy component for each port  Inside the proxy, track caller/callee invocations, timings  Automate the process of proxy component creation  using PDT for static analysis of components  include support for selective instrumentation

Performance Technology for Productive, High-End Parallel ComputingLLNL, Oct Flame Reaction-Diffusion (Sandia, J. Ray) CCAFFEINE

Performance Technology for Productive, High-End Parallel ComputingLLNL, Oct Component Modeling and Optimization  Given a set of components, where each component has multiple implementations, what is the optimal subset of implementations that solve a given problem?  How to model a single component?  How to model a composition of components?  How to select optimal subset of implementations?  A component only has performance meaning in context  Applications are dynamically composed at runtime  Application developers use components from others  Instrumentation may only be at component interfaces  Performance measurements need to be non-intrusive  Users interested in a coarse-grained performance

Performance Technology for Productive, High-End Parallel ComputingLLNL, Oct MasterMind Component (Trebon, IPDPS 2004)

Performance Technology for Productive, High-End Parallel ComputingLLNL, Oct Proxy Generator for other Applications  TAU (PDT) proxy component for:  QoS tracking [Boyana, ANL]  Debugging Port Monitor for CCA (tracks arguments)  SCIRun2 Perfume components [Venkat, U. Utah]  Exploring Babel for auto-generation of proxies:  Direct SIDL-to-proxy code generation  Generating client component interface in C++  Using PDT for generating proxies

Performance Technology for Productive, High-End Parallel ComputingLLNL, Oct Earth Systems Modeling Framework  Coupled modeling with modular software framework  Instrumentation for ESMF framework and applications  PDT automatic instrumentation  Fortran 95 code modules  C / C++ code modules  MPI wrapper library for MPI calls  ESMF Component instrumentation (using CCA)  CCA measurement port manual instrumentation  Proxy generation using PDT and runtime interposition  Significant callpath profiling used by ESMF team

Performance Technology for Productive, High-End Parallel ComputingLLNL, Oct Using TAU Component in ESMF/CCA

Performance Technology for Productive, High-End Parallel ComputingLLNL, Oct TAU’s Paraprof Profile Browser (ESMF Data) Callpath profile

Performance Technology for Productive, High-End Parallel ComputingLLNL, Oct CUBE Browser (UTK, FZJ) (ESMF Data) metriccalltree location TAU profile data converted to CUBE form

Performance Technology for Productive, High-End Parallel ComputingLLNL, Oct TAU Traces with Counters (ESMF)

Performance Technology for Productive, High-End Parallel ComputingLLNL, Oct Visualizing TAU Traces with Counters/Samples

Performance Technology for Productive, High-End Parallel ComputingLLNL, Oct Uintah Computational Framework (UCF)  University of Utah, Center for Simulation of Accidental Fires and Explosions (C-SAFE), DOE ASCI Center  UCF analysis  Scheduling  MPI library  Components  Performance mapping  Use for online and offline visualization  ParaVis tools F 500 processes

Performance Technology for Productive, High-End Parallel ComputingLLNL, Oct Scatterplot Displays (UCF, 500 processes)  Each point coordinate determined by three values: MPI_Reduce MPI_Recv MPI_Waitsome  Min/Max value range  Effective for cluster analysis Relation between MPI_Recv and MPI_Waitsome

Performance Technology for Productive, High-End Parallel ComputingLLNL, Oct Online Unitah Performance Profiling  Demonstration of online profiling capability  Multiple profile samples  Each profile taken at major iteration (~ 60 seconds)  Colliding elastic disks  Test material point method (MPM) code  Executed on 512 processors ASCI Blue Pacific at LLNL  Example  3D bargraph visualization  MPI execution time  Performance mapping  Multiple time steps

Performance Technology for Productive, High-End Parallel ComputingLLNL, Oct Online Unitah Performance Profiling

Performance Technology for Productive, High-End Parallel ComputingLLNL, Oct Miranda Performance Analysis (Miller, LLNL)  Miranda is a research hydrodynamics code  Fortran 95, MPI  Mostly synchronous  MPI_ALLTOALL on  Np x,y communicators  Some MPI reductions and broadcasts for statistics  Good communications scaling  ACL and MCR Linux cluster  Up to 1728 CPUs  Fixed workload per CPU  Ported to BlueGene/L  Breaking News! (see next slide)

Performance Technology for Productive, High-End Parallel ComputingLLNL, Oct Profiling of Miranda on BG/L (Miller, LLNL) 128 Nodes512 Nodes1024 Nodes  Profile code performance (automatic instrumentation)  Scaling studies (problem size, number of processors)  Run on 8K and 16K processors this week!

Performance Technology for Productive, High-End Parallel ComputingLLNL, Oct Fine Grained Profiling via Tracing on Miranda  Use TAU to generate VTF3 traces for Vampir analysis  Combines MPI calls with HW counter information  Detailed code behavior to focus optimization efforts

Performance Technology for Productive, High-End Parallel ComputingLLNL, Oct Max Heap Memory (KB) used for problem on 16 processors of ASC Frost at LLNL Memory Usage Analysis  BG/L will have limited memory per node (512 MB)  Miranda uses TAU to profile memory usage  Streamlines code  Squeeze larger problems on the machine  TAU’s footprint is small  Approximately 100 bytes per event per thread

Performance Technology for Productive, High-End Parallel ComputingLLNL, Oct Kull Performance Optimization (Miller, LLNL)  Kull is a Lagrange hydrodynamics code  Physics packages written in C++ and Fortran  Parallel Python interpreter run-time environment!  Scalar test problem analysis  Serial execution to identify performance factors  Original code profile indicated expensive functions  CCSubzonalEffects member functions  Examination revealed optimization opportunities  Loop merging  Amortizing geometric lookup over more calculations  Apply to CSSubzonalEffects member functions

Performance Technology for Productive, High-End Parallel ComputingLLNL, Oct Kull Optimization Optimized Exclusive Profile Original Exclusive Profile  CSSubzonalEffects member functions total time  Reduced from 5.80 seconds to 0.82 seconds  Overall run time reduce from 28.1 to seconds

Performance Technology for Productive, High-End Parallel ComputingLLNL, Oct Important Questions for Application Developers  How does performance vary with different compilers?  Is poor performance correlated with certain OS features?  Has a recent change caused unanticipated performance?  How does performance vary with MPI variants?  Why is one application version faster than another?  What is the reason for the observed scaling behavior?  Did two runs exhibit similar performance?  How are performance data related to application events?  Which machines will run my code the fastest and why?  Which benchmarks predict my code performance best?

Performance Technology for Productive, High-End Parallel ComputingLLNL, Oct Multi-Level Performance Data Mining  New (just forming) research project  PSU: Karen L. Karavanic  Cornell: Sally A. McKee  UO: Allen D. Malony and Sameer Shende  LLNL: John M. May and Bronis R. de Supinski  Develop performance data mining technology  Scientific applications, benchmarks, other measurements  Systematic analysis for understanding and prediction  Better foundation for evaluation of leadership-class computer systems  “Scalable, Interoperable Tools to Support Autonomic Optimization of High-End Applications,” S. McKee, G. Tyson, A. Malony, begin Nov. 1, 2004.

Performance Technology for Productive, High-End Parallel ComputingLLNL, Oct General Goals  Answer questions at multiple levels of interest  Data from low-level measurements and simulations  use to predict application performance  data mining applied to optimize data gathering process  High-level performance data spanning dimensions  Machine, applications, code revisions  Examine broad performance trends  Need technology  Performance instrumentation and measurement  Performance data management  Performance analysis and results presentation  Automated performance experimentation and exploration

Performance Technology for Productive, High-End Parallel ComputingLLNL, Oct Specific Goals  Design, develop, and populate a performance database  Discover general correlations application performance and features of their external environment  Develop methods to predict application performance on lower-level metrics  Discover performance correlations between a small set of benchmarks and a collection of applications that represent a typical workload for a give system  Performance data mining infrastructure is important for all of these goals  Establish a more rational basis for evaluating the performance of leadership-class computers

Performance Technology for Productive, High-End Parallel ComputingLLNL, Oct PerfTrack: Performance DB and Analysis Tool PSU: Kathryn Mohror, Karen Karavanic UO: Kevin Huck LLNL: John May, Brian Miller (CASC)

Performance Technology for Productive, High-End Parallel ComputingLLNL, Oct TAU Performance Data Management Framework

Performance Technology for Productive, High-End Parallel ComputingLLNL, Oct TAU Performance Regression (PerfRegress)  Prototype developed by Alan Morris for Uintah  Re-implement using PerfDMF

Performance Technology for Productive, High-End Parallel ComputingLLNL, Oct Background – Ahn & Vetter, 2002  “Scalable Analysis Techniques for Microprocessor Performance Counter Metrics,” SC2002  Applied multivariate statistical analysis techniques to large datasets of performance data (PAPI events)  Cluster Analysis and F-Ratio  Agglomerative Hierarchical Method - dendogram identified groupings of master, slave threads in sPPM  K-means clustering and F-ratio - differences between master, slave related to communication and management  Factor Analysis  shows highly correlated metrics fall into peer groups  Combined techniques (recursively) leads to observations of application behavior hard to identify otherwise

Performance Technology for Productive, High-End Parallel ComputingLLNL, Oct Similarity Analysis  Can we recreate Ahn and Vetter’s results?  Apply techniques from the phase analysis (Sherwood)  Threads of execution can be compared for similarity  Threads with abnormal behavior show up as less similar  Each thread is represented as a vector (V) of dimension n  n is the number of functions in the application V = [f 1, f 2, …, f n ] (represent event mix)  Each value is the percentage of time spent in that function  normalized from 0.0 to 1.0  Distance calculated between the vectors U and V: ManhattanDistance(U, V) = ∑ |u i - v i | i=0 n

Performance Technology for Productive, High-End Parallel ComputingLLNL, Oct sPPM on Blue Horizon (64x4, OpenMP+MPI) TAU profiles 10 events PerfDMF threads 32-47

Performance Technology for Productive, High-End Parallel ComputingLLNL, Oct sPPM on MCR (total instructions, 16x2) TAU/PerfDMF 120 events master (even) worker (odd)

Performance Technology for Productive, High-End Parallel ComputingLLNL, Oct sPPM on MCR (PAPI_FP_INS, 16x2) TAU profiles PerfDMF master/worker higher/lower Same result as Ahn/Vetter

Performance Technology for Productive, High-End Parallel ComputingLLNL, Oct sPPM on Frost (PAPI_FP_INS, 256 threads)  View of fewer than half of the threads of execution is possible on the screen at one time  Three groups are obvious:  Lower ranking threads  One unique thread  Higher ranking threads  3% more FP  Finding subtle differences is difficult with this view

Performance Technology for Productive, High-End Parallel ComputingLLNL, Oct  Dendrogram shows 5 natural clusters:  Unique thread  High ranking master threads  Low ranking master threads  High ranking worker threads  Low ranking worker threads sPPM on Frost (PAPI_FP_INS, 256 threads) TAU profiles PerfDMF R direct access to DM R routine threads

Performance Technology for Productive, High-End Parallel ComputingLLNL, Oct sPPM on MCR (PAPI_FP_INS, 16x2 threads) masters slaves

Performance Technology for Productive, High-End Parallel ComputingLLNL, Oct sPPM on Frost (PAPI_FP_INS, 256 threads)  After K-means clustering into 5 clusters  Similar clusters are formed (seed with group means)  Each cluster’s performance characteristics analyzed  Dimensionality reduction (256 threads to 5 clusters!) SPPMINTERFDIFUZEDINTERFBarrier [OpenMP:runhyd3.F ]

Performance Technology for Productive, High-End Parallel ComputingLLNL, Oct PerfExplorer Design (K. Huck, UO)  Performance knowledge discovery framework  Use the existing TAU infrastructure  TAU instrumentation data, PerfDMF  Client-server based system architecture  Data mining analysis applied to parallel performance data  Technology integration  Relational DatabaseManagement Systems (RDBMS)  Java API and toolkit  R-project / Omegahat statistical analysis  Web-based client  Jakarta web server and Struts (for a thin web-client)

Performance Technology for Productive, High-End Parallel ComputingLLNL, Oct PerfExplorer Architecture Client is a traditional Java application with GUI (Swing) Server accepts multiple client requests and returns results PerfDMF Java API used to access DBMS via JDBC Server supports R data mining operations built using RSJava Analyses can be scripted, parameterized, and monitored Browsing of analysis results via automatic web page creation and thumbnails

Performance Technology for Productive, High-End Parallel ComputingLLNL, Oct ZeptoOS: Extreme Performance Scalable OS’s  DOE, Office of Science  OS / RTS for Extreme Scale Scientific Computation  Argonne National Lab and University of Oregon  Investigate operating system and run-time (OS/R) functionality required for scalable components used in petascale architectures  Flexible OS/R functionality  Scalable OS/R system calls  Performance tools, monitoring, and metrics  Fault tolerance and resiliency  Approach  Specify OS/R requirements across scalable components  Explore flexible functionality (Linux)  Hierarchical designs optimized with collective OS/R interfaces  Integrated (horizontal, vertical) performance measurement / analysis  Fault scenarios and injection to observe behavior

Performance Technology for Productive, High-End Parallel ComputingLLNL, Oct ZeptoOS Plans  Explore Linux functionality for BG/L  Explore efficiency for ultra-small kernels  Scheduler, memory, IO  Construct kernel-level collective operations  Support for dynamic library loading, …  Build Faulty Towers Linux kernel and system for replaying fault scenarios  Extend TAU  Profiling OS suites  Benchmarking collective OS calls  Observing effects of faults

Performance Technology for Productive, High-End Parallel ComputingLLNL, Oct Concluding Discussion  As high-end systems scale, it will be increasingly important that performance tools be used effectively  Performance observation methods do not necessarily need to change in a fundamental sense  Just need to be controlled and used efficiently  More intelligent performance systems for productive use  Evolve to application-specific performance technology  Deal with scale by “full range” performance exploration  Autonomic and integrated tools  Knowledge-based and knowledge-driven process  Deliver to community next-generation tools

Performance Technology for Productive, High-End Parallel ComputingLLNL, Oct Support Acknowledgements  Department of Energy (DOE)  Office of Science contracts  University of Utah ASCI Level 1 sub-contract  ASC/NNSA Level 3 contract  NSF  High-End Computing Grant  Research Centre Juelich  John von Neumann Institute  Dr. Bernd Mohr  Los Alamos National Laboratory