Presentation is loading. Please wait.

Presentation is loading. Please wait.

Aroon Nataraj, Matthew Sottile, Alan Morris, Allen D. Malony, Sameer Shende { anataraj, matt, amorris, malony,

Similar presentations


Presentation on theme: "Aroon Nataraj, Matthew Sottile, Alan Morris, Allen D. Malony, Sameer Shende { anataraj, matt, amorris, malony,"— Presentation transcript:

1 Aroon Nataraj, Matthew Sottile, Alan Morris, Allen D. Malony, Sameer Shende { anataraj, matt, amorris, malony, shende}@cs.uoregon.edu http://www.cs.uoregon.edu/research/tau Department of Computer and Information Science Performance Research Laboratory University of Oregon TAUoverSupermon (ToS) Low-Overhead Online Parallel Performance Monitoring

2 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France2 Outline  Problem, Motivations & Requirements  Our Approach - Coupling TAU and Supermon  What is TAU?  What is Supermon?  And how we coupled them?  The Rationale  Experimentation  Online monitored data visualized  Performance / Scalability results investigated  Work in progress and Future plans  Related Work  Conclusion

3 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France3 Performance Transport Substrate - Motivations  Transport Substrate for Performance Measurement  Enables communication with (and between) the performance measurement subsystems  Enables movement of measurement data and control  Modes of Performance Observation  Offline / Post-mortem observation and analysis  least requirements for a specialized transport  Online observation  long running applications, especially at scale  Online observation with feedback into application  in addition, requires that the transport is bi-directional  Performance observation problems/requirements => Function of the mode => Addressed by substrate

4 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France4 Motivations continued …  So why doesn’t Offline/Postmortem Observation suffice?  Why online observation at all?  long running applications, especially at scale  And why online observation with feedback  in addition, requires that the transport is bi-directional  Is this becoming more important? Why? Evidence?  Need to take from Paper… - read it.

5 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France5 TAU Performance System  Tuning and Analysis Utilities (14+ year project effort)  Performance system framework for HPC systems  Integrated, scalable, flexible, and parallel  Multiple parallel programming paradigms  Parallel performance mapping methodology  Portable (open source) parallel performance system  Instrumentation, measurement, analysis, and visualization  Portable performance profiling and tracing facility  Performance data management and data mining  Scalable (very large) parallel performance analysis  Partners  Research Center Jülich, LLNL, ANL, LANL, UTK

6 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France6 TAU Performance Observation Methodology  Advocate event-based, direct performance observation  Observe execution events  Types: control flow, state-based, user-defined  Modes: atomic, interval (enter/exit)  Instrument program code directly (defines events)  Modify program code at points of event occurrence  Different code forms (source, library, object, binary, VM)  Measurement code inserted (instantiates events)  Make events visible  Measures performance related to event occurrence  Contrast with event-based sampling

7 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France7 TAU Performance System Architecture

8 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France8 TAU Performance System Architecture

9 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France9 User-level abstractions problem domain source code object codelibraries instrumentation executable runtime image compiler linkerOS VM instrumentation performance data run preprocessor Multi-Level Instrumentation and Mapping  Multiple interfaces  Information sharing  Between interfaces  Event selection  Within levels  Between levels  Mapping  Performance data is associated with high- level semantic abstractions

10 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France10 TAU Instrumentation Approach  Support for standard program events  Routines, classes and templates  Statement-level blocks and loops  Support for user-defined events  Begin/End events (“user-defined timers”)  Atomic events (e.g., size of memory allocated/freed)  Selection of event statistics  Support definition of “semantic” entities for mapping  Support for event groups (aggregation, selection)  Instrumentation selection and optimization  Instrumentation enabling/disabling and runtime throttling

11 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France11 TAU Instrumentation Mechanisms  Source code  Manual (TAU API, TAU component API)  Automatic (robust)  C, C++, F77/90/95 (Program Database Toolkit (PDT))  OpenMP (directive rewriting (Opari), POMP2 spec)  Object code  Pre-instrumented libraries (e.g., MPI using PMPI)  Statically-linked and dynamically-linked  Executable code  Dynamic instrumentation (pre-execution) (DyninstAPI)  Virtual machine instrumentation (e.g., Java using JVMPI)  TAU_COMPILER to automate instrumentation process

12 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France12 TAU Measurement Approach  Portable and scalable parallel profiling solution  Multiple profiling types and options  Event selection and control (enabling/disabling, throttling)  Online profile access and sampling  Online performance profile overhead compensation  Portable and scalable parallel tracing solution  Trace translation to EPILOG, VTF3, and OTF  Trace streams (OTF) and hierarchical trace merging  Robust timing and hardware performance support  Multiple counters (hardware, user-defined, system)  Measurement specification separate from instrumentation

13 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France13 TAU Measurement Mechanisms  Parallel profiling  Function-level, block-level, statement-level  Supports user-defined events and mapping events  TAU parallel profile stored (dumped) during execution  Support for flat, callgraph/callpath, phase profiling  Support for memory profiling (headroom, leaks)  Tracing  All profile-level events  Inter-process communication events  Inclusion of multiple counter data in traced events  Compile-time and runtime measurement selection

14 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France14 Primary Requirements  Performance of substrate  Must be low-overhead  Robust and Fault-Tolerant  Must detect and repair failures  Must not adversely affect the application on failures  Bi-directional Transport (Control)  Selection of events, measurement technique, target nodes  What data to output, how often and in what form?  Feedback into the measurement system & application  Allows synchronization between sources & sinks (??)  Scalable  Must maintain the above properties at scale

15 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France15 Secondary Requirements  Data Reduction  At scale, cost of moving data too high  Allow sampling/aggregation in different domains (node- wise, event-wise)  Online, Distributed Processing of generated performance data  Use compute resource of transport nodes  Global performance analyses within the topology  Distribute statistical analyses  easy (mean, variance, histogram), challenging (clustering)

16 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France16 Approach  Option 1: Use a NFS from within TAU and monitors  Global shared-FS must be available  File I/O overheads can be high  Control through file-ops costly (e.g. meta-data ops)  All data not consumed - persistent storage wasteful  Option 2: Build new custom, light-weight transport  Allows tailoring to TAU  Significant programming investment  Portability concerns across  Re-inventing the wheel, may be  Our approach: Re-use existing transports  Transport plug-ins couple with and adapt to TAU

17 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France17 Approach continued …  Measurement and data transport separated  No such distinction in TAU before  Created abstraction to separate and hide transport  TauOutput  TauOutput exposes subset of Posix File I/O API  Acts as a virtual transport layer  Most of TAU code-base unware of transport details  Very few changes in code-base outside plugin  Supermon (Sottile and Minnich, LANL) Plug-in  TAU instruments & measures  Supermon bridges monitors (sinks) to contexts (sources)

18 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France18 Rationale  Moved away from NFS  Separation of concerns  Scalability, portability, robustness  Addressed independent of TAU  Re-use existing technologies where appropriate  Multiple bindings  Use different solutions best suited to particular platform  Implementation speed  Easy, fast to create adapter that binds to existing transport  Performance correlation : Bonus

19 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France19 Performance & Scalability Investigated  Moved away from NFS %Overhead LU-PMLU-ToNFSLU-ToS N=1280.96270.0674.234 N=2564.314311.4759.318 N=5125.401946.39821.606

20 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France20 Performance & Scalability Investigated  Where did the time go?  Why? ToNFS - PMToNFS DumpGap?Mpi_Recv Diff N=12814.367.4286.9327.023 N=25635.615.83419.76519.886 N=51267.9432.36735.57335.745

21 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France21 Performance & Scalability Investigated  Moved away from NFS

22 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France22 ToM Architecture  TAU Components*  Front-End (FE)  Filters  Back-End (BE)  * Over MRNet API  No-Backend-Instantiation mode  Push-Pull model of data retrieval  No daemon  Instrumented application contains TAU and Back-End  Two channels (streams)  Data (BE to FE)  Control (FE to BE)

23 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France23 ToM Architecture  Applicaton calls into TAU  Per-Iteration explicit call to output routine  Periodic calls using alarm  TauOutput object invoked  Configuration specific: compile or runtime  One per thread  TauOutput mimics subset of FS-style operations  Avoids changes to TAU code  If required rest of TAU can be made aware of output type  Non-blocking recv for control  Back-end pushes  Sink pulls

24 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France24 Simple Example (NPB LU - A, Per-5 iterations) Exclusive time

25 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France25 Simple Example (NPB LU - A, Per-5 iterations) Number of calls

26 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France26 Comparing ToM with NFS  TAUoverNFS versus TAUoverMRNET  250 ssor iterations  251 TAU_DB_DUMP operations  Significant advantages with specialized transport substrate  Similar when using Supermon as the substrate  Remoting of expensive FS meta-data operations to Front-End NPB LU (A) 32 Processors Over NFS (secs) Over MRNET (secs) % Improvement over NFS Total Runtime45.5136.5019.79 TAU_DB_DUMP8.580.36695.73

27 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France27 Playing with Filters  Downstream (FE to BE) multicast path  Even without filters, is very useful for control  Data Reduction Filters are integral to Upstream path (BE to FE)  W/O filters loss-less data reproduced D-1 times  Unnecessary large cost to network  Filter 1: Random Sampling Filter  Very simplistic data reduction by node-wise sampling  Accepts or Rejects packets probabilistically  TAU Front-End can control probability P(accept)  P(accept)=K/N (N = # leafs, K is user constant)  Bounds number of packets per-round to K

28 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France28 Filter 1 in Action (Ring application)  Compare different P(accept) values  1, 1/4, 1/16  Front-End unable to keep up  Queuing delay propagated back

29 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France29 Other Filters  Statistics filter  Reduce raw performance data to smaller set of statistics  Distribute these statistical analyses from Front-End to the filters  Simple measures - mean, std.dev, histograms  More sophisticated measures - distributed clustering  Controlling filters  No direct way to control Upstream-filters  not on control path  Recommended solution  place upstream filters that work in concert with downstream filters to share control information  requires synchronization of state between upstream and downstream filters  Our Echo hack  Back-Ends transparently echo Filter-Control packets back upstream  this is then interpreted by the filters  easier to implement  control response time may be greater

30 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France30 Feedback / Suggestions  Easy to integrate with MRNET  Good examples documentation, readable source code  Setup phase  Make MRNET intermediate nodes listen on pre-specified port  Allow arbitrary mrnet-ranks to connect and then set the Ids in the topology  Relaxing strict apriori-ranks can make setup easier  Setup in Job-Q environments difficult  Packetization API can be more flexible  Current API is usable and simple (var-arg printf style)  Composing a packet over a series of staggered stages difficult  Allow control over how buffering is performed  Important in a push-pull model as data injection points (rates) independent of data retrieval  Is not a problem in a purely pull model

31 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France31 TAUoverMRNET - Contrast TAUoverSupermon  Supermon (cluster-monitor) vs. MRNet (reduction-network)  Both light-weight transport substrates  Data format  Supermon: ascii s-expressions  MRNET: packets with packed (binary?) data  Supermon Setup  Loose topology  No support/help in setting up intermediate nodes  Assume Supermon is part of the environment  MRNET Setup  Strict topology  Better support for starting intermediate nodes  With/Without Back-End instantiation (TAU uses latter)  Multiple Front-Ends (or sinks) possible with Supermon  MRNET, front-end needs to program this functionality  No exisiting pluggable “filter” support in Supermon  Performing aggregation is more difficult with Supermon.  Supermons allows buffer-policy specification, MRNET does not

32 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France32 Future Work  Dyninst  Tighter integration of source and binary instrumentation  Conveying of source information to binary level  Enabling use of TAU’s advanced measurement features  Leveraging TAU’s performance mapping support  Want robust and portable binary rewriting tool  MRNet  Development of more performance filters  Evaluation of MRNet performance for different scenarios  Testing at large scale  Use in applications

33 TAUoverSupermon (ToS)EuroPar 2007, Rennes, France33 Acknowledgements  LANL - Ron?  ANL - cluster access?  LLNL - cluster access?  Funding Ack?


Download ppt "Aroon Nataraj, Matthew Sottile, Alan Morris, Allen D. Malony, Sameer Shende { anataraj, matt, amorris, malony,"

Similar presentations


Ads by Google