Presentation is loading. Please wait.

Presentation is loading. Please wait.

Allen D. Malony, Aroon Nataraj Department of Computer and Information Science Performance.

Similar presentations


Presentation on theme: "Allen D. Malony, Aroon Nataraj Department of Computer and Information Science Performance."— Presentation transcript:

1 Allen D. Malony, Aroon Nataraj {malony,anataraj}@cs.uoregon.edu http://www.cs.uoregon.edu/research/tau Department of Computer and Information Science Performance Research Laboratory University of Oregon TAU Meets Dyninst and MRNet: A Long-term and Short-term Affair

2 TAU Meets Dyninst and MRNetParaDyn/Condor Week 20072 Performance Research Lab  Dr. Sameer Shende, Senior scientist  Alan Morris, Senior software engineer  Wyatt Spear, Software engineer  Scott Biersdorff, Software engineer  Li Li, Ph.D. student  “Model-based Automatic Performance Diagnosis”  Ph.D. thesis, January 2007  Kevin Huck, Ph.D. student  Aroon Nataraj, Ph.D. student  Integrated kernel / application performance analysis  Scalable performance monitoring

3 TAU Meets Dyninst and MRNetParaDyn/Condor Week 20073 Outline  What is TAU?  Observation methodology  Instrumentation, measurement, analysis tools  Our affair with Dyninst  Perspective  MPI applications  Integrated instrumentation  Courting MRNet  Initial results  Future work

4 TAU Meets Dyninst and MRNetParaDyn/Condor Week 20074 TAU Performance System  Tuning and Analysis Utilities (14+ year project effort)  Performance system framework for HPC systems  Integrated, scalable, flexible, and parallel  Multiple parallel programming paradigms  Parallel performance mapping methodology  Portable (open source) parallel performance system  Instrumentation, measurement, analysis, and visualization  Portable performance profiling and tracing facility  Performance data management and data mining  Scalable (very large) parallel performance analysis  Partners  Research Center Jülich, LLNL, ANL, LANL, UTK

5 TAU Meets Dyninst and MRNetParaDyn/Condor Week 20075 TAU Performance Observation Methodology  Advocate event-based, direct performance observation  Observe execution events  Types: control flow, state-based, user-defined  Modes: atomic, interval (enter/exit)  Instrument program code directly (defines events)  Modify program code at points of event occurrence  Different code forms (source, library, object, binary, VM)  Measurement code inserted (instantiates events)  Make events visible  Measures performance related to event occurrence  Contrast with event-based sampling

6 TAU Meets Dyninst and MRNetParaDyn/Condor Week 20076 TAU Performance System Architecture

7 TAU Meets Dyninst and MRNetParaDyn/Condor Week 20077 TAU Performance System Architecture

8 TAU Meets Dyninst and MRNetParaDyn/Condor Week 20078 User-level abstractions problem domain source code object codelibraries instrumentation executable runtime image compiler linkerOS VM instrumentation performance data run preprocessor Multi-Level Instrumentation and Mapping  Multiple interfaces  Information sharing  Between interfaces  Event selection  Within levels  Between levels  Mapping  Performance data is associated with high- level semantic abstractions

9 TAU Meets Dyninst and MRNetParaDyn/Condor Week 20079 TAU Instrumentation Approach  Support for standard program events  Routines, classes and templates  Statement-level blocks and loops  Support for user-defined events  Begin/End events (“user-defined timers”)  Atomic events (e.g., size of memory allocated/freed)  Selection of event statistics  Support definition of “semantic” entities for mapping  Support for event groups (aggregation, selection)  Instrumentation selection and optimization  Instrumentation enabling/disabling and runtime throttling

10 TAU Meets Dyninst and MRNetParaDyn/Condor Week 200710 TAU Instrumentation Mechanisms  Source code  Manual (TAU API, TAU component API)  Automatic (robust)  C, C++, F77/90/95 (Program Database Toolkit (PDT))  OpenMP (directive rewriting (Opari), POMP2 spec)  Object code  Pre-instrumented libraries (e.g., MPI using PMPI)  Statically-linked and dynamically-linked  Executable code  Dynamic instrumentation (pre-execution) (DyninstAPI)  Virtual machine instrumentation (e.g., Java using JVMPI)  TAU_COMPILER to automate instrumentation process

11 TAU Meets Dyninst and MRNetParaDyn/Condor Week 200711 TAU Measurement Approach  Portable and scalable parallel profiling solution  Multiple profiling types and options  Event selection and control (enabling/disabling, throttling)  Online profile access and sampling  Online performance profile overhead compensation  Portable and scalable parallel tracing solution  Trace translation to EPILOG, VTF3, and OTF  Trace streams (OTF) and hierarchical trace merging  Robust timing and hardware performance support  Multiple counters (hardware, user-defined, system)  Measurement specification separate from instrumentation

12 TAU Meets Dyninst and MRNetParaDyn/Condor Week 200712 TAU Measurement Mechanisms  Parallel profiling  Function-level, block-level, statement-level  Supports user-defined events and mapping events  TAU parallel profile stored (dumped) during execution  Support for flat, callgraph/callpath, phase profiling  Support for memory profiling (headroom, leaks)  Tracing  All profile-level events  Inter-process communication events  Inclusion of multiple counter data in traced events  Compile-time and runtime measurement selection

13 TAU Meets Dyninst and MRNetParaDyn/Condor Week 200713 Performance Analysis and Visualization  Analysis of parallel profile and trace measurement  Parallel profile analysis  ParaProf: parallel profile analysis and presentation  ParaVis: parallel performance visualization package  Profile generation from trace data (tau2pprof)  Performance data management framework (PerfDMF)  Parallel trace analysis  Translation to VTF (V3.0), EPILOG, OTF formats  Integration with VNG (Technical University of Dresden)  Online parallel analysis and visualization  Integration with CUBE browser (KOJAK, UTK, FZJ)

14 TAU Meets Dyninst and MRNetParaDyn/Condor Week 200714 TAU and DyninstAPI  TAU has had a long-term affair Dyninst technology  Dyninst offered a binary-level instrumentation tool  Could help in cases when the source code is unavailable  Could allow instrumentation without recompilation  TAU requirements  Instrument HPC applications with TAU measurements  Multiple paradigms, languages, compilers, platforms  Portability  Tested Dyninst features as they were released  Issues  MPI, threading, availability, binary rewriting  It been on/off open relationship

15 TAU Meets Dyninst and MRNetParaDyn/Condor Week 200715 Using DyninstAPI  TAU uses DyninstAPI for binary code patching  Pre-execution  versus at any point during execution  Methods  runtime before the application begins  binary rewriting  tau_run (mutator)  Loads TAU measurement library  Uses DyninstAPI to instrument mutatee  Can apply instrumentation selection

16 TAU Meets Dyninst and MRNetParaDyn/Condor Week 200716 Using DyninstAPI with TAU Configure TAU with Dyninst and build / /bin/tau_run % configure –dyninst=/usr/local/dyninstAPI-5.0.1 % make clean; make install tau_run command % tau_run [ ] [-Xrun ][-f ] [-v] Instrument all events with TAU measurement library and execute: % tau_run klargest Instrument all events with TAU+PAPI measurements (libTAUsh-papi.so) and execute: % tau_run -XrunTAUsh-papi a.out Instruments only events specified in select.tau instrumentation specification file and execute: % tau_run -f select.tau a.out Binary rewriting: % tau_run –o a.inst.out a.out

17 TAU Meets Dyninst and MRNetParaDyn/Condor Week 200717 Runtime Instrumentation with DyninstAPI  tau_run loads TAU’s shared object in the address space  Selects routines to be instrumented  Calls DyninstAPI OneTimeCode  Register a startup routine  Pass a string of routine (event) names  “main | foo | bar”  IDs assigned to events  TAU’s hooks for entry/exit used for instrumentation  Invoked during execution

18 TAU Meets Dyninst and MRNetParaDyn/Condor Week 200718 Using DyninstAPI with MPI  One mutator per mutatee  Each mutator instruments mutatee prior to execution  No central control  Each mutatee writes its own performance data to disk % mpirun -np 4./run.sh % cat run.sh #!/bin/sh /usr/local/tau-2.x/x86_64/bin/tau_run /a.out

19 TAU Meets Dyninst and MRNetParaDyn/Condor Week 200719 Binary Rewriting with TAU  Rewrite binary (Save the world) before executing  No central control  No need to re-instrument the code on all backend nodes  Each mutatee writes its own performance data to disk % tau_run -o a.inst.out a.out % cd _dyninstsaved0 % mpirun -np 4./a.inst.out

20 TAU Meets Dyninst and MRNetParaDyn/Condor Week 200720 Example  EK-SIMPLE benchmark  CFD benchmark  Andy Shaw, Kofi Fynn  Adapted by Brad Chamberlain  Experimentation  Run on 4 cpus  Runtime instrumentation using DyninstAPI and tau_run  Measure wallclock time and CPU time experiments  Profiling and tracing modes of measurement  Look at performance data with Paraprof and Vampir

21 TAU Meets Dyninst and MRNetParaDyn/Condor Week 200721 ParaProf - Main Window (4 cpus)

22 TAU Meets Dyninst and MRNetParaDyn/Condor Week 200722 ParaProf - Indivdual Profile (n,c,t 0,0,0)

23 TAU Meets Dyninst and MRNetParaDyn/Condor Week 200723 ParaProf - Statistics Table (Mean)

24 TAU Meets Dyninst and MRNetParaDyn/Condor Week 200724 ParaProf - net_recv (MPI rank 1)

25 TAU Meets Dyninst and MRNetParaDyn/Condor Week 200725 Integrated Instrumentation (Source + Dyninst)  Use source instrumentation for some events  Use Dyninst for other events  Access same TAU measurement infrastructure  Demonstrate on matrix multiplication example  Compare regular versus strip-mining versions Source instrumented Source + binary instrumented

26 TAU Meets Dyninst and MRNetParaDyn/Condor Week 200726 TAU-over-MRNET (ToM) Project MRNET as a Transport Substrate in TAU (Reporting early work done in the last week.)

27 TAU Meets Dyninst and MRNetParaDyn/Condor Week 200727 TAU Transport Substrate - Motivations  Transport Substrate  Enables movement of measurement-related data  TAU, in the past, has relied on shared file-system  Some Modes of Performance Observation  Offline / Post-mortem observation and analysis  least requirements for a specialized transport  Online observation  long running applications, especially at scale  dumping to file-system can be suboptimal  Online observation with feedback into application  in addition, requires that the transport is bi-directional  Performance observation problems and requirements are a function of the mode

28 TAU Meets Dyninst and MRNetParaDyn/Condor Week 200728 Requirements  Improve performance of transport  NFS can be slow and variable  Specialization and remoting of FS-operations to front-end  Data Reduction  At scale, cost of moving data too high  Sample in different domain (node-wise, event-wise)  Control  Selection of events, measurement technique, target nodes  What data to output, how often and in what form?  Feedback into the measurement system, feedback into application  Online, distributed processing of generated performance data  Use compute resource of transport nodes  Global performance analyses within the topology  Distribute statistical analyses  easy (mean, variance, histogram), challenging (clustering)

29 TAU Meets Dyninst and MRNetParaDyn/Condor Week 200729 Approach and First Prototype  Measurement and measured data transport are separate  No such distinction in TAU  Created abstraction to separate and hide transport  TauOutput  Did not create a custom transport for TAU  Use existing monitoring/transport capabilities  Supermon (Sottile and Minnich, LANL)  Piggy-backed TAU performance data on Supermon channels  Correlate system-level metrics from Supermon with TAU application performance data

30 TAU Meets Dyninst and MRNetParaDyn/Condor Week 200730 Rationale  Moved away from NFS  Separation of concerns  Scalability, portability, robustness  Addressed independent of TAU  Re-use existing technologies where appropriate  Multiple bindings  Use different solutions best suited to particular platform  Implementation speed  Easy, fast to create adapter that binds to existing transport  MRNET support was added in about a week  Says a lot about usability of MRNET

31 TAU Meets Dyninst and MRNetParaDyn/Condor Week 200731 ToM Architecture  TAU Components*  Front-End (FE)  Filters  Back-End (BE)  * Over MRNet API  No-Backend-Instantiation mode  Push-Pull model of data retrieval  No daemon  Instrumented application contains TAU and Back-End  Two channels (streams)  Data (BE to FE)  Control (FE to BE)

32 TAU Meets Dyninst and MRNetParaDyn/Condor Week 200732 ToM Architecture  Applicaton calls into TAU  Per-Iteration explicit call to output routine  Periodic calls using alarm  TauOutput object invoked  Configuration specific: compile or runtime  One per thread  TauOutput mimics subset of FS-style operations  Avoids changes to TAU code  If required rest of TAU can be made aware of output type  Non-blocking recv for control  Back-end pushes  Sink pulls

33 TAU Meets Dyninst and MRNetParaDyn/Condor Week 200733 Simple Example (NPB LU - A, Per-5 iterations) Exclusive time

34 TAU Meets Dyninst and MRNetParaDyn/Condor Week 200734 Simple Example (NPB LU - A, Per-5 iterations) Number of calls

35 TAU Meets Dyninst and MRNetParaDyn/Condor Week 200735 Comparing ToM with NFS  TAUoverNFS versus TAUoverMRNET  250 ssor iterations  251 TAU_DB_DUMP operations  Significant advantages with specialized transport substrate  Similar when using Supermon as the substrate  Remoting of expensive FS meta-data operations to Front-End NPB LU (A) 32 Processors Over NFS (secs) Over MRNET (secs) % Improvement over NFS Total Runtime45.5136.5019.79 TAU_DB_DUMP8.580.36695.73

36 TAU Meets Dyninst and MRNetParaDyn/Condor Week 200736 Playing with Filters  Downstream (FE to BE) multicast path  Even without filters, is very useful for control  Data Reduction Filters are integral to Upstream path (BE to FE)  W/O filters loss-less data reproduced D-1 times  Unnecessary large cost to network  Filter 1: Random Sampling Filter  Very simplistic data reduction by node-wise sampling  Accepts or Rejects packets probabilistically  TAU Front-End can control probability P(accept)  P(accept)=K/N (N = # leafs, K is user constant)  Bounds number of packets per-round to K

37 TAU Meets Dyninst and MRNetParaDyn/Condor Week 200737 Filter 1 in Action (Ring application)  Compare different P(accept) values  1, 1/4, 1/16  Front-End unable to keep up  Queuing delay propagated back

38 TAU Meets Dyninst and MRNetParaDyn/Condor Week 200738 Other Filters  Statistics filter  Reduce raw performance data to smaller set of statistics  Distribute these statistical analyses from Front-End to the filters  Simple measures - mean, std.dev, histograms  More sophisticated measures - distributed clustering  Controlling filters  No direct way to control Upstream-filters  not on control path  Recommended solution  place upstream filters that work in concert with downstream filters to share control information  requires synchronization of state between upstream and downstream filters  Our Echo hack  Back-Ends transparently echo Filter-Control packets back upstream  this is then interpreted by the filters  easier to implement  control response time may be greater

39 TAU Meets Dyninst and MRNetParaDyn/Condor Week 200739 Feedback / Suggestions  Easy to integrate with MRNET  Good examples documentation, readable source code  Setup phase  Make MRNET intermediate nodes listen on pre-specified port  Allow arbitrary mrnet-ranks to connect and then set the Ids in the topology  Relaxing strict apriori-ranks can make setup easier  Setup in Job-Q environments difficult  Packetization API can be more flexible  Current API is usable and simple (var-arg printf style)  Composing a packet over a series of staggered stages difficult  Allow control over how buffering is performed  Important in a push-pull model as data injection points (rates) independent of data retrieval  Is not a problem in a purely pull model

40 TAU Meets Dyninst and MRNetParaDyn/Condor Week 200740 TAUoverMRNET - Contrast TAUoverSupermon  Supermon (cluster-monitor) vs. MRNet (reduction-network)  Both light-weight transport substrates  Data format  Supermon: ascii s-expressions  MRNET: packets with packed (binary?) data  Supermon Setup  Loose topology  No support/help in setting up intermediate nodes  Assume Supermon is part of the environment  MRNET Setup  Strict topology  Better support for starting intermediate nodes  With/Without Back-End instantiation (TAU uses latter)  Multiple Front-Ends (or sinks) possible with Supermon  MRNET, front-end needs to program this functionality  No exisiting pluggable “filter” support in Supermon  Performing aggregation is more difficult with Supermon.  Supermons allows buffer-policy specification, MRNET does not

41 TAU Meets Dyninst and MRNetParaDyn/Condor Week 200741 Future Work  Dyninst  Tighter integration of source and binary instrumentation  Conveying of source information to binary level  Enabling use of TAU’s advanced measurement features  Leveraging TAU’s performance mapping support  Want robust and portable binary rewriting tool  MRNet  Development of more performance filters  Evaluation of MRNet performance for different scenarios  Testing at large scale  Use in applications


Download ppt "Allen D. Malony, Aroon Nataraj Department of Computer and Information Science Performance."

Similar presentations


Ads by Google