Aroon Nataraj, Matthew Sottile, Alan Morris, Allen D. Malony, Sameer Shende { anataraj, matt, amorris, malony,

Slides:



Advertisements
Similar presentations
K T A U Kernel Tuning and Analysis Utilities Department of Computer and Information Science Performance Research Laboratory University of Oregon.
Advertisements

Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.
Robert Bell, Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science.
Sameer Shende Department of Computer and Information Science Neuro Informatics Center University of Oregon Tool Interoperability.
Combining Static and Dynamic Data in Code Visualization David Eng Sable Research Group, McGill University PASTE 2002 Charleston, South Carolina November.
Nick Trebon, Alan Morris, Jaideep Ray, Sameer Shende, Allen Malony {ntrebon, amorris, Department of.
On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Bernd Mohr 1, Allen D. Malony 2, Rudi Eigenmann 3 1 Forschungszentrum.
The TAU Performance System: Advances in Performance Mapping Sameer Shende University of Oregon.
Workshop on Performance Tools for Petascale Computing 9:30 – 10:30am, Tuesday, July 17, 2007, Snowbird, UT Sameer S. Shende
Figure 1.1 Interaction between applications and the operating system.
Workshop on Performance Tools for Petascale Computing 9:30 – 10:30am, Tuesday, July 17, 2007, Snowbird, UT Sameer S. Shende
1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.
Allen D. Malony, Sameer Shende, Robert Bell Department of Computer and Information Science Computational Science Institute, NeuroInformatics.
Kai Li, Allen D. Malony, Robert Bell, Sameer Shende Department of Computer and Information Science Computational.
Sameer Shende, Allen D. Malony Computer & Information Science Department Computational Science Institute University of Oregon.
SS ZG653Second Semester, Topic Architectural Patterns Pipe and Filter.
Research & Development Roadmap 1. Outline A New Communication Framework Giving Bro Control over the Network Security Monitoring for Industrial Control.
German National Research Center for Information Technology Research Institute for Computer Architecture and Software Technology German National Research.
Monitoring Latency Sensitive Enterprise Applications on the Cloud Shankar Narayanan Ashiwan Sivakumar.
An Introduction to Software Architecture
Eric Keller, Evan Green Princeton University PRESTO /22/08 Virtualizing the Data Plane Through Source Code Merging.
Cluster Reliability Project ISIS Vanderbilt University.
Aroon Nataraj, Matthew Sottile, Alan Morris, Allen D. Malony, Sameer Shende { anataraj, matt, amorris, malony,
2.Sampling over the Ranks in each time Step. Sampling also reduces Amt of data (but over Diff. dimension). 9 Scalable Online Parallel Performance Measurement.
Score-P – A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir Alexandru Calotoiu German Research School for.
Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.
Using TAU on SiCortex Alan Morris, Aroon Nataraj Sameer Shende, Allen D. Malony University of Oregon {amorris, anataraj, sameer,
Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
SC 2012 © LLNL / JSC 1 HPCToolkit / Rice University Performance Analysis through callpath sampling  Designed for low overhead  Hot path analysis  Recovery.
BLU-ICE and the Distributed Control System Constraints for Software Development Strategies Timothy M. McPhillips Stanford Synchrotron Radiation Laboratory.
March 17, 2005 Roadmap of Upcoming Research, Features and Releases Bart Miller & Jeff Hollingsworth.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
Allen D. Malony, Aroon Nataraj Department of Computer and Information Science Performance.
Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.
02/09/2010 Industrial Project Course (234313) Virtualization-aware database engine Final Presentation Industrial Project Course (234313) Virtualization-aware.
1 Choices “Our object-oriented system architecture embodies the notion of customizing operating systems to tailor them to support particular hardware configuration.
Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.
Chapter 6 – Architectural Design Lecture 1 1Chapter 6 Architectural design.
Allen D. Malony Department of Computer and Information Science TAU Performance Research Laboratory University of Oregon Discussion:
Abstract A Structured Approach for Modular Design: A Plug and Play Middleware for Sensory Modules, Actuation Platforms, Task Descriptions and Implementations.
© 2006, National Research Council Canada © 2006, IBM Corporation Solving performance issues in OTS-based systems Erik Putrycz Software Engineering Group.
Full and Para Virtualization
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.
Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.
Allen D. Malony Department of Computer and Information Science Performance Research Laboratory.
Aroon Nataraj, Matthew Sottile, Alan Morris, Allen D. Malony, Sameer Shende { anataraj, matt, amorris, malony,
Satisfying Requirements BPF for DRA shall address: –DAQ Environment (Eclipse RCP): Gumtree ISEE workbench integration; –Design Composing and Configurability,
Performane Analyzer Performance Analysis and Visualization of Large-Scale Uintah Simulations Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance.
Motivation: dynamic apps Rocket center applications: –exhibit irregular structure, dynamic behavior, and need adaptive control strategies. Geometries are.
By Nitin Bahadur Gokul Nadathur Department of Computer Sciences University of Wisconsin-Madison Spring 2000.
Is MPI still part of the solution ? George Bosilca Innovative Computing Laboratory Electrical Engineering and Computer Science Department University of.
Michael J. Voss and Rudolf Eigenmann PPoPP, ‘01 (Presented by Kanad Sinha)
Introduction to HPC Debugging with Allinea DDT Nick Forrington
Online Performance Analysis and Visualization of Large-Scale Parallel Applications Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance Research.
Fermilab Scientific Computing Division Fermi National Accelerator Laboratory, Batavia, Illinois, USA. Off-the-Shelf Hardware and Software DAQ Performance.
Kai Li, Allen D. Malony, Sameer Shende, Robert Bell
WP18, High-speed data recording Krzysztof Wrona, European XFEL
Performance Technology for Scalable Parallel Systems
SOFTWARE DESIGN AND ARCHITECTURE
TAU integration with Score-P
Allen D. Malony, Sameer Shende
An Introduction to Software Architecture
Allen D. Malony Computer & Information Science Department
Outline Introduction Motivation for performance mapping SEAA model
Parallel Program Analysis Framework for the DOE ACTS Toolkit
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Aroon Nataraj, Matthew Sottile, Alan Morris, Allen D. Malony, Sameer Shende { anataraj, matt, amorris, malony, Department of Computer and Information Science Performance Research Laboratory University of Oregon TAUoverSupermon (ToS) Low-Overhead Online Parallel Performance Monitoring

TAUoverSupermon (ToS)EuroPar 2007, Rennes, France2 Outline  Problem, Motivations & Requirements  Our Approach - Coupling TAU and Supermon  What is TAU?  What is Supermon?  And how we coupled them?  The Rationale  Experimentation  Online monitored data visualized  Performance / Scalability results investigated  Work in progress and Future plans  Related Work  Conclusion

TAUoverSupermon (ToS)EuroPar 2007, Rennes, France3 Performance Transport Substrate - Motivations  Transport Substrate for Performance Measurement  Enables communication with (and between) the performance measurement subsystems  Enables movement of measurement data and control  Modes of Performance Observation  Offline / Post-mortem observation and analysis  least requirements for a specialized transport  Online observation  long running applications, especially at scale  Online observation with feedback into application  in addition, requires that the transport is bi-directional  Performance observation problems/requirements => Function of the mode => Addressed by substrate

TAUoverSupermon (ToS)EuroPar 2007, Rennes, France4 Motivations continued …  So why doesn’t Offline/Postmortem Observation suffice?  Why online observation at all?  long running applications, especially at scale  And why online observation with feedback  in addition, requires that the transport is bi-directional  Is this becoming more important? Why? Evidence?  Need to take from Paper… - read it.

TAUoverSupermon (ToS)EuroPar 2007, Rennes, France5 TAU Performance System  Tuning and Analysis Utilities (14+ year project effort)  Performance system framework for HPC systems  Integrated, scalable, flexible, and parallel  Multiple parallel programming paradigms  Parallel performance mapping methodology  Portable (open source) parallel performance system  Instrumentation, measurement, analysis, and visualization  Portable performance profiling and tracing facility  Performance data management and data mining  Scalable (very large) parallel performance analysis  Partners  Research Center Jülich, LLNL, ANL, LANL, UTK

TAUoverSupermon (ToS)EuroPar 2007, Rennes, France6 TAU Performance Observation Methodology  Advocate event-based, direct performance observation  Observe execution events  Types: control flow, state-based, user-defined  Modes: atomic, interval (enter/exit)  Instrument program code directly (defines events)  Modify program code at points of event occurrence  Different code forms (source, library, object, binary, VM)  Measurement code inserted (instantiates events)  Make events visible  Measures performance related to event occurrence  Contrast with event-based sampling

TAUoverSupermon (ToS)EuroPar 2007, Rennes, France7 TAU Performance System Architecture

TAUoverSupermon (ToS)EuroPar 2007, Rennes, France8 TAU Performance System Architecture

TAUoverSupermon (ToS)EuroPar 2007, Rennes, France9 User-level abstractions problem domain source code object codelibraries instrumentation executable runtime image compiler linkerOS VM instrumentation performance data run preprocessor Multi-Level Instrumentation and Mapping  Multiple interfaces  Information sharing  Between interfaces  Event selection  Within levels  Between levels  Mapping  Performance data is associated with high- level semantic abstractions

TAUoverSupermon (ToS)EuroPar 2007, Rennes, France10 TAU Instrumentation Approach  Support for standard program events  Routines, classes and templates  Statement-level blocks and loops  Support for user-defined events  Begin/End events (“user-defined timers”)  Atomic events (e.g., size of memory allocated/freed)  Selection of event statistics  Support definition of “semantic” entities for mapping  Support for event groups (aggregation, selection)  Instrumentation selection and optimization  Instrumentation enabling/disabling and runtime throttling

TAUoverSupermon (ToS)EuroPar 2007, Rennes, France11 TAU Instrumentation Mechanisms  Source code  Manual (TAU API, TAU component API)  Automatic (robust)  C, C++, F77/90/95 (Program Database Toolkit (PDT))  OpenMP (directive rewriting (Opari), POMP2 spec)  Object code  Pre-instrumented libraries (e.g., MPI using PMPI)  Statically-linked and dynamically-linked  Executable code  Dynamic instrumentation (pre-execution) (DyninstAPI)  Virtual machine instrumentation (e.g., Java using JVMPI)  TAU_COMPILER to automate instrumentation process

TAUoverSupermon (ToS)EuroPar 2007, Rennes, France12 TAU Measurement Approach  Portable and scalable parallel profiling solution  Multiple profiling types and options  Event selection and control (enabling/disabling, throttling)  Online profile access and sampling  Online performance profile overhead compensation  Portable and scalable parallel tracing solution  Trace translation to EPILOG, VTF3, and OTF  Trace streams (OTF) and hierarchical trace merging  Robust timing and hardware performance support  Multiple counters (hardware, user-defined, system)  Measurement specification separate from instrumentation

TAUoverSupermon (ToS)EuroPar 2007, Rennes, France13 TAU Measurement Mechanisms  Parallel profiling  Function-level, block-level, statement-level  Supports user-defined events and mapping events  TAU parallel profile stored (dumped) during execution  Support for flat, callgraph/callpath, phase profiling  Support for memory profiling (headroom, leaks)  Tracing  All profile-level events  Inter-process communication events  Inclusion of multiple counter data in traced events  Compile-time and runtime measurement selection

TAUoverSupermon (ToS)EuroPar 2007, Rennes, France14 Primary Requirements  Performance of substrate  Must be low-overhead  Robust and Fault-Tolerant  Must detect and repair failures  Must not adversely affect the application on failures  Bi-directional Transport (Control)  Selection of events, measurement technique, target nodes  What data to output, how often and in what form?  Feedback into the measurement system & application  Allows synchronization between sources & sinks (??)  Scalable  Must maintain the above properties at scale

TAUoverSupermon (ToS)EuroPar 2007, Rennes, France15 Secondary Requirements  Data Reduction  At scale, cost of moving data too high  Allow sampling/aggregation in different domains (node- wise, event-wise)  Online, Distributed Processing of generated performance data  Use compute resource of transport nodes  Global performance analyses within the topology  Distribute statistical analyses  easy (mean, variance, histogram), challenging (clustering)

TAUoverSupermon (ToS)EuroPar 2007, Rennes, France16 Approach  Option 1: Use a NFS from within TAU and monitors  Global shared-FS must be available  File I/O overheads can be high  Control through file-ops costly (e.g. meta-data ops)  All data not consumed - persistent storage wasteful  Option 2: Build new custom, light-weight transport  Allows tailoring to TAU  Significant programming investment  Portability concerns across  Re-inventing the wheel, may be  Our approach: Re-use existing transports  Transport plug-ins couple with and adapt to TAU

TAUoverSupermon (ToS)EuroPar 2007, Rennes, France17 Approach continued …  Measurement and data transport separated  No such distinction in TAU before  Created abstraction to separate and hide transport  TauOutput  TauOutput exposes subset of Posix File I/O API  Acts as a virtual transport layer  Most of TAU code-base unware of transport details  Very few changes in code-base outside plugin  Supermon (Sottile and Minnich, LANL) Plug-in  TAU instruments & measures  Supermon bridges monitors (sinks) to contexts (sources)

TAUoverSupermon (ToS)EuroPar 2007, Rennes, France18 Rationale  Moved away from NFS  Separation of concerns  Scalability, portability, robustness  Addressed independent of TAU  Re-use existing technologies where appropriate  Multiple bindings  Use different solutions best suited to particular platform  Implementation speed  Easy, fast to create adapter that binds to existing transport  Performance correlation : Bonus

TAUoverSupermon (ToS)EuroPar 2007, Rennes, France19 Performance & Scalability Investigated  Moved away from NFS %Overhead LU-PMLU-ToNFSLU-ToS N= N= N=

TAUoverSupermon (ToS)EuroPar 2007, Rennes, France20 Performance & Scalability Investigated  Where did the time go?  Why? ToNFS - PMToNFS DumpGap?Mpi_Recv Diff N= N= N=

TAUoverSupermon (ToS)EuroPar 2007, Rennes, France21 Performance & Scalability Investigated  Moved away from NFS

TAUoverSupermon (ToS)EuroPar 2007, Rennes, France22 ToM Architecture  TAU Components*  Front-End (FE)  Filters  Back-End (BE)  * Over MRNet API  No-Backend-Instantiation mode  Push-Pull model of data retrieval  No daemon  Instrumented application contains TAU and Back-End  Two channels (streams)  Data (BE to FE)  Control (FE to BE)

TAUoverSupermon (ToS)EuroPar 2007, Rennes, France23 ToM Architecture  Applicaton calls into TAU  Per-Iteration explicit call to output routine  Periodic calls using alarm  TauOutput object invoked  Configuration specific: compile or runtime  One per thread  TauOutput mimics subset of FS-style operations  Avoids changes to TAU code  If required rest of TAU can be made aware of output type  Non-blocking recv for control  Back-end pushes  Sink pulls

TAUoverSupermon (ToS)EuroPar 2007, Rennes, France24 Simple Example (NPB LU - A, Per-5 iterations) Exclusive time

TAUoverSupermon (ToS)EuroPar 2007, Rennes, France25 Simple Example (NPB LU - A, Per-5 iterations) Number of calls

TAUoverSupermon (ToS)EuroPar 2007, Rennes, France26 Comparing ToM with NFS  TAUoverNFS versus TAUoverMRNET  250 ssor iterations  251 TAU_DB_DUMP operations  Significant advantages with specialized transport substrate  Similar when using Supermon as the substrate  Remoting of expensive FS meta-data operations to Front-End NPB LU (A) 32 Processors Over NFS (secs) Over MRNET (secs) % Improvement over NFS Total Runtime TAU_DB_DUMP

TAUoverSupermon (ToS)EuroPar 2007, Rennes, France27 Playing with Filters  Downstream (FE to BE) multicast path  Even without filters, is very useful for control  Data Reduction Filters are integral to Upstream path (BE to FE)  W/O filters loss-less data reproduced D-1 times  Unnecessary large cost to network  Filter 1: Random Sampling Filter  Very simplistic data reduction by node-wise sampling  Accepts or Rejects packets probabilistically  TAU Front-End can control probability P(accept)  P(accept)=K/N (N = # leafs, K is user constant)  Bounds number of packets per-round to K

TAUoverSupermon (ToS)EuroPar 2007, Rennes, France28 Filter 1 in Action (Ring application)  Compare different P(accept) values  1, 1/4, 1/16  Front-End unable to keep up  Queuing delay propagated back

TAUoverSupermon (ToS)EuroPar 2007, Rennes, France29 Other Filters  Statistics filter  Reduce raw performance data to smaller set of statistics  Distribute these statistical analyses from Front-End to the filters  Simple measures - mean, std.dev, histograms  More sophisticated measures - distributed clustering  Controlling filters  No direct way to control Upstream-filters  not on control path  Recommended solution  place upstream filters that work in concert with downstream filters to share control information  requires synchronization of state between upstream and downstream filters  Our Echo hack  Back-Ends transparently echo Filter-Control packets back upstream  this is then interpreted by the filters  easier to implement  control response time may be greater

TAUoverSupermon (ToS)EuroPar 2007, Rennes, France30 Feedback / Suggestions  Easy to integrate with MRNET  Good examples documentation, readable source code  Setup phase  Make MRNET intermediate nodes listen on pre-specified port  Allow arbitrary mrnet-ranks to connect and then set the Ids in the topology  Relaxing strict apriori-ranks can make setup easier  Setup in Job-Q environments difficult  Packetization API can be more flexible  Current API is usable and simple (var-arg printf style)  Composing a packet over a series of staggered stages difficult  Allow control over how buffering is performed  Important in a push-pull model as data injection points (rates) independent of data retrieval  Is not a problem in a purely pull model

TAUoverSupermon (ToS)EuroPar 2007, Rennes, France31 TAUoverMRNET - Contrast TAUoverSupermon  Supermon (cluster-monitor) vs. MRNet (reduction-network)  Both light-weight transport substrates  Data format  Supermon: ascii s-expressions  MRNET: packets with packed (binary?) data  Supermon Setup  Loose topology  No support/help in setting up intermediate nodes  Assume Supermon is part of the environment  MRNET Setup  Strict topology  Better support for starting intermediate nodes  With/Without Back-End instantiation (TAU uses latter)  Multiple Front-Ends (or sinks) possible with Supermon  MRNET, front-end needs to program this functionality  No exisiting pluggable “filter” support in Supermon  Performing aggregation is more difficult with Supermon.  Supermons allows buffer-policy specification, MRNET does not

TAUoverSupermon (ToS)EuroPar 2007, Rennes, France32 Future Work  Dyninst  Tighter integration of source and binary instrumentation  Conveying of source information to binary level  Enabling use of TAU’s advanced measurement features  Leveraging TAU’s performance mapping support  Want robust and portable binary rewriting tool  MRNet  Development of more performance filters  Evaluation of MRNet performance for different scenarios  Testing at large scale  Use in applications

TAUoverSupermon (ToS)EuroPar 2007, Rennes, France33 Acknowledgements  LANL - Ron?  ANL - cluster access?  LLNL - cluster access?  Funding Ack?