Presentation is loading. Please wait.

Presentation is loading. Please wait.

Aroon Nataraj, Matthew Sottile, Alan Morris, Allen D. Malony, Sameer Shende { anataraj, matt, amorris, malony,

Similar presentations


Presentation on theme: "Aroon Nataraj, Matthew Sottile, Alan Morris, Allen D. Malony, Sameer Shende { anataraj, matt, amorris, malony,"— Presentation transcript:

1 Aroon Nataraj, Matthew Sottile, Alan Morris, Allen D. Malony, Sameer Shende { anataraj, matt, amorris, malony, shende}@cs.uoregon.edu http://www.cs.uoregon.edu/research/tau Department of Computer and Information Science Performance Research Laboratory University of Oregon TAUoverSupermon (ToS) Low-Overhead Online Parallel Performance Monitoring

2 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France2 Acknowledgements  LANL - Ron?  ANL - cluster access?  LLNL - cluster access?  Funding Ack?

3 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France3 Outline  Problem, Motivations & Requirements  Our Approach - Coupling TAU and Supermon  What is TAU?  What is Supermon?  And how we coupled them?  The Rationale  Experimentation  Online monitored data visualized  Performance / Scalability results investigated  Work in progress and Future plans  Related Work  Conclusion

4 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France4 TAU Transport Substrate - Motivations  Transport Substrate  Enables movement of measurement-related data  TAU, in the past, has relied on shared file-system  Some Modes of Performance Observation  Offline / Post-mortem observation and analysis  least requirements for a specialized transport  Online observation  long running applications, especially at scale  dumping to file-system can be suboptimal  Online observation with feedback into application  in addition, requires that the transport is bi-directional  Performance observation problems and requirements are a function of the mode

5 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France5 TAU Performance System  Tuning and Analysis Utilities (14+ year project effort)  Performance system framework for HPC systems  Integrated, scalable, flexible, and parallel  Multiple parallel programming paradigms  Parallel performance mapping methodology  Portable (open source) parallel performance system  Instrumentation, measurement, analysis, and visualization  Portable performance profiling and tracing facility  Performance data management and data mining  Scalable (very large) parallel performance analysis  Partners  Research Center Jülich, LLNL, ANL, LANL, UTK

6 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France6 TAU Performance Observation Methodology  Advocate event-based, direct performance observation  Observe execution events  Types: control flow, state-based, user-defined  Modes: atomic, interval (enter/exit)  Instrument program code directly (defines events)  Modify program code at points of event occurrence  Different code forms (source, library, object, binary, VM)  Measurement code inserted (instantiates events)  Make events visible  Measures performance related to event occurrence  Contrast with event-based sampling

7 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France7 TAU Performance System Architecture

8 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France8 TAU Performance System Architecture

9 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France9 User-level abstractions problem domain source code object codelibraries instrumentation executable runtime image compiler linkerOS VM instrumentation performance data run preprocessor Multi-Level Instrumentation and Mapping  Multiple interfaces  Information sharing  Between interfaces  Event selection  Within levels  Between levels  Mapping  Performance data is associated with high- level semantic abstractions

10 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France10 TAU Instrumentation Approach  Support for standard program events  Routines, classes and templates  Statement-level blocks and loops  Support for user-defined events  Begin/End events (“user-defined timers”)  Atomic events (e.g., size of memory allocated/freed)  Selection of event statistics  Support definition of “semantic” entities for mapping  Support for event groups (aggregation, selection)  Instrumentation selection and optimization  Instrumentation enabling/disabling and runtime throttling

11 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France11 TAU Instrumentation Mechanisms  Source code  Manual (TAU API, TAU component API)  Automatic (robust)  C, C++, F77/90/95 (Program Database Toolkit (PDT))  OpenMP (directive rewriting (Opari), POMP2 spec)  Object code  Pre-instrumented libraries (e.g., MPI using PMPI)  Statically-linked and dynamically-linked  Executable code  Dynamic instrumentation (pre-execution) (DyninstAPI)  Virtual machine instrumentation (e.g., Java using JVMPI)  TAU_COMPILER to automate instrumentation process

12 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France12 TAU Measurement Approach  Portable and scalable parallel profiling solution  Multiple profiling types and options  Event selection and control (enabling/disabling, throttling)  Online profile access and sampling  Online performance profile overhead compensation  Portable and scalable parallel tracing solution  Trace translation to EPILOG, VTF3, and OTF  Trace streams (OTF) and hierarchical trace merging  Robust timing and hardware performance support  Multiple counters (hardware, user-defined, system)  Measurement specification separate from instrumentation

13 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France13 TAU Measurement Mechanisms  Parallel profiling  Function-level, block-level, statement-level  Supports user-defined events and mapping events  TAU parallel profile stored (dumped) during execution  Support for flat, callgraph/callpath, phase profiling  Support for memory profiling (headroom, leaks)  Tracing  All profile-level events  Inter-process communication events  Inclusion of multiple counter data in traced events  Compile-time and runtime measurement selection

14 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France14 TAU-over-MRNET (ToM) Project MRNET as a Transport Substrate in TAU (Reporting early work done in the last week.)

15 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France15 Requirements  Improve performance of transport  NFS can be slow and variable  Specialization and remoting of FS-operations to front-end  Data Reduction  At scale, cost of moving data too high  Sample in different domain (node-wise, event-wise)  Control  Selection of events, measurement technique, target nodes  What data to output, how often and in what form?  Feedback into the measurement system, feedback into application  Online, distributed processing of generated performance data  Use compute resource of transport nodes  Global performance analyses within the topology  Distribute statistical analyses  easy (mean, variance, histogram), challenging (clustering)

16 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France16 Approach and First Prototype  Measurement and measured data transport are separate  No such distinction in TAU  Created abstraction to separate and hide transport  TauOutput  Did not create a custom transport for TAU  Use existing monitoring/transport capabilities  Supermon (Sottile and Minnich, LANL)  Piggy-backed TAU performance data on Supermon channels  Correlate system-level metrics from Supermon with TAU application performance data

17 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France17 Rationale  Moved away from NFS  Separation of concerns  Scalability, portability, robustness  Addressed independent of TAU  Re-use existing technologies where appropriate  Multiple bindings  Use different solutions best suited to particular platform  Implementation speed  Easy, fast to create adapter that binds to existing transport  MRNET support was added in about a week  Says a lot about usability of MRNET

18 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France18 ToM Architecture  TAU Components*  Front-End (FE)  Filters  Back-End (BE)  * Over MRNet API  No-Backend-Instantiation mode  Push-Pull model of data retrieval  No daemon  Instrumented application contains TAU and Back-End  Two channels (streams)  Data (BE to FE)  Control (FE to BE)

19 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France19 ToM Architecture  Applicaton calls into TAU  Per-Iteration explicit call to output routine  Periodic calls using alarm  TauOutput object invoked  Configuration specific: compile or runtime  One per thread  TauOutput mimics subset of FS-style operations  Avoids changes to TAU code  If required rest of TAU can be made aware of output type  Non-blocking recv for control  Back-end pushes  Sink pulls

20 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France20 Simple Example (NPB LU - A, Per-5 iterations) Exclusive time

21 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France21 Simple Example (NPB LU - A, Per-5 iterations) Number of calls

22 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France22 Comparing ToM with NFS  TAUoverNFS versus TAUoverMRNET  250 ssor iterations  251 TAU_DB_DUMP operations  Significant advantages with specialized transport substrate  Similar when using Supermon as the substrate  Remoting of expensive FS meta-data operations to Front-End NPB LU (A) 32 Processors Over NFS (secs) Over MRNET (secs) % Improvement over NFS Total Runtime45.5136.5019.79 TAU_DB_DUMP8.580.36695.73

23 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France23 Playing with Filters  Downstream (FE to BE) multicast path  Even without filters, is very useful for control  Data Reduction Filters are integral to Upstream path (BE to FE)  W/O filters loss-less data reproduced D-1 times  Unnecessary large cost to network  Filter 1: Random Sampling Filter  Very simplistic data reduction by node-wise sampling  Accepts or Rejects packets probabilistically  TAU Front-End can control probability P(accept)  P(accept)=K/N (N = # leafs, K is user constant)  Bounds number of packets per-round to K

24 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France24 Filter 1 in Action (Ring application)  Compare different P(accept) values  1, 1/4, 1/16  Front-End unable to keep up  Queuing delay propagated back

25 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France25 Other Filters  Statistics filter  Reduce raw performance data to smaller set of statistics  Distribute these statistical analyses from Front-End to the filters  Simple measures - mean, std.dev, histograms  More sophisticated measures - distributed clustering  Controlling filters  No direct way to control Upstream-filters  not on control path  Recommended solution  place upstream filters that work in concert with downstream filters to share control information  requires synchronization of state between upstream and downstream filters  Our Echo hack  Back-Ends transparently echo Filter-Control packets back upstream  this is then interpreted by the filters  easier to implement  control response time may be greater

26 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France26 Feedback / Suggestions  Easy to integrate with MRNET  Good examples documentation, readable source code  Setup phase  Make MRNET intermediate nodes listen on pre-specified port  Allow arbitrary mrnet-ranks to connect and then set the Ids in the topology  Relaxing strict apriori-ranks can make setup easier  Setup in Job-Q environments difficult  Packetization API can be more flexible  Current API is usable and simple (var-arg printf style)  Composing a packet over a series of staggered stages difficult  Allow control over how buffering is performed  Important in a push-pull model as data injection points (rates) independent of data retrieval  Is not a problem in a purely pull model

27 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France27 TAUoverMRNET - Contrast TAUoverSupermon  Supermon (cluster-monitor) vs. MRNet (reduction-network)  Both light-weight transport substrates  Data format  Supermon: ascii s-expressions  MRNET: packets with packed (binary?) data  Supermon Setup  Loose topology  No support/help in setting up intermediate nodes  Assume Supermon is part of the environment  MRNET Setup  Strict topology  Better support for starting intermediate nodes  With/Without Back-End instantiation (TAU uses latter)  Multiple Front-Ends (or sinks) possible with Supermon  MRNET, front-end needs to program this functionality  No exisiting pluggable “filter” support in Supermon  Performing aggregation is more difficult with Supermon.  Supermons allows buffer-policy specification, MRNET does not

28 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France28 Future Work  Dyninst  Tighter integration of source and binary instrumentation  Conveying of source information to binary level  Enabling use of TAU’s advanced measurement features  Leveraging TAU’s performance mapping support  Want robust and portable binary rewriting tool  MRNet  Development of more performance filters  Evaluation of MRNet performance for different scenarios  Testing at large scale  Use in applications


Download ppt "Aroon Nataraj, Matthew Sottile, Alan Morris, Allen D. Malony, Sameer Shende { anataraj, matt, amorris, malony,"

Similar presentations


Ads by Google