Download presentation
Presentation is loading. Please wait.
Published byKristian Atkinson Modified over 9 years ago
1
Allen D. Malony, Aroon Nataraj {malony,anataraj}@cs.uoregon.edu http://www.cs.uoregon.edu/research/tau Department of Computer and Information Science Performance Research Laboratory University of Oregon TAU Meets Dyninst and MRNet: A Long-term and Short-term Affair
2
TAU Meets Dyninst and MRNetParaDyn/Condor Week 20072 Performance Research Lab Dr. Sameer Shende, Senior scientist Alan Morris, Senior software engineer Wyatt Spear, Software engineer Scott Biersdorff, Software engineer Li Li, Ph.D. student “Model-based Automatic Performance Diagnosis” Ph.D. thesis, January 2007 Kevin Huck, Ph.D. student Aroon Nataraj, Ph.D. student Integrated kernel / application performance analysis Scalable performance monitoring
3
TAU Meets Dyninst and MRNetParaDyn/Condor Week 20073 Outline What is TAU? Observation methodology Instrumentation, measurement, analysis tools Our affair with Dyninst Perspective MPI applications Integrated instrumentation Courting MRNet Initial results Future work
4
TAU Meets Dyninst and MRNetParaDyn/Condor Week 20074 TAU Performance System Tuning and Analysis Utilities (14+ year project effort) Performance system framework for HPC systems Integrated, scalable, flexible, and parallel Multiple parallel programming paradigms Parallel performance mapping methodology Portable (open source) parallel performance system Instrumentation, measurement, analysis, and visualization Portable performance profiling and tracing facility Performance data management and data mining Scalable (very large) parallel performance analysis Partners Research Center Jülich, LLNL, ANL, LANL, UTK
5
TAU Meets Dyninst and MRNetParaDyn/Condor Week 20075 TAU Performance Observation Methodology Advocate event-based, direct performance observation Observe execution events Types: control flow, state-based, user-defined Modes: atomic, interval (enter/exit) Instrument program code directly (defines events) Modify program code at points of event occurrence Different code forms (source, library, object, binary, VM) Measurement code inserted (instantiates events) Make events visible Measures performance related to event occurrence Contrast with event-based sampling
6
TAU Meets Dyninst and MRNetParaDyn/Condor Week 20076 TAU Performance System Architecture
7
TAU Meets Dyninst and MRNetParaDyn/Condor Week 20077 TAU Performance System Architecture
8
TAU Meets Dyninst and MRNetParaDyn/Condor Week 20078 User-level abstractions problem domain source code object codelibraries instrumentation executable runtime image compiler linkerOS VM instrumentation performance data run preprocessor Multi-Level Instrumentation and Mapping Multiple interfaces Information sharing Between interfaces Event selection Within levels Between levels Mapping Performance data is associated with high- level semantic abstractions
9
TAU Meets Dyninst and MRNetParaDyn/Condor Week 20079 TAU Instrumentation Approach Support for standard program events Routines, classes and templates Statement-level blocks and loops Support for user-defined events Begin/End events (“user-defined timers”) Atomic events (e.g., size of memory allocated/freed) Selection of event statistics Support definition of “semantic” entities for mapping Support for event groups (aggregation, selection) Instrumentation selection and optimization Instrumentation enabling/disabling and runtime throttling
10
TAU Meets Dyninst and MRNetParaDyn/Condor Week 200710 TAU Instrumentation Mechanisms Source code Manual (TAU API, TAU component API) Automatic (robust) C, C++, F77/90/95 (Program Database Toolkit (PDT)) OpenMP (directive rewriting (Opari), POMP2 spec) Object code Pre-instrumented libraries (e.g., MPI using PMPI) Statically-linked and dynamically-linked Executable code Dynamic instrumentation (pre-execution) (DyninstAPI) Virtual machine instrumentation (e.g., Java using JVMPI) TAU_COMPILER to automate instrumentation process
11
TAU Meets Dyninst and MRNetParaDyn/Condor Week 200711 TAU Measurement Approach Portable and scalable parallel profiling solution Multiple profiling types and options Event selection and control (enabling/disabling, throttling) Online profile access and sampling Online performance profile overhead compensation Portable and scalable parallel tracing solution Trace translation to EPILOG, VTF3, and OTF Trace streams (OTF) and hierarchical trace merging Robust timing and hardware performance support Multiple counters (hardware, user-defined, system) Measurement specification separate from instrumentation
12
TAU Meets Dyninst and MRNetParaDyn/Condor Week 200712 TAU Measurement Mechanisms Parallel profiling Function-level, block-level, statement-level Supports user-defined events and mapping events TAU parallel profile stored (dumped) during execution Support for flat, callgraph/callpath, phase profiling Support for memory profiling (headroom, leaks) Tracing All profile-level events Inter-process communication events Inclusion of multiple counter data in traced events Compile-time and runtime measurement selection
13
TAU Meets Dyninst and MRNetParaDyn/Condor Week 200713 Performance Analysis and Visualization Analysis of parallel profile and trace measurement Parallel profile analysis ParaProf: parallel profile analysis and presentation ParaVis: parallel performance visualization package Profile generation from trace data (tau2pprof) Performance data management framework (PerfDMF) Parallel trace analysis Translation to VTF (V3.0), EPILOG, OTF formats Integration with VNG (Technical University of Dresden) Online parallel analysis and visualization Integration with CUBE browser (KOJAK, UTK, FZJ)
14
TAU Meets Dyninst and MRNetParaDyn/Condor Week 200714 TAU and DyninstAPI TAU has had a long-term affair Dyninst technology Dyninst offered a binary-level instrumentation tool Could help in cases when the source code is unavailable Could allow instrumentation without recompilation TAU requirements Instrument HPC applications with TAU measurements Multiple paradigms, languages, compilers, platforms Portability Tested Dyninst features as they were released Issues MPI, threading, availability, binary rewriting It been on/off open relationship
15
TAU Meets Dyninst and MRNetParaDyn/Condor Week 200715 Using DyninstAPI TAU uses DyninstAPI for binary code patching Pre-execution versus at any point during execution Methods runtime before the application begins binary rewriting tau_run (mutator) Loads TAU measurement library Uses DyninstAPI to instrument mutatee Can apply instrumentation selection
16
TAU Meets Dyninst and MRNetParaDyn/Condor Week 200716 Using DyninstAPI with TAU Configure TAU with Dyninst and build / /bin/tau_run % configure –dyninst=/usr/local/dyninstAPI-5.0.1 % make clean; make install tau_run command % tau_run [ ] [-Xrun ][-f ] [-v] Instrument all events with TAU measurement library and execute: % tau_run klargest Instrument all events with TAU+PAPI measurements (libTAUsh-papi.so) and execute: % tau_run -XrunTAUsh-papi a.out Instruments only events specified in select.tau instrumentation specification file and execute: % tau_run -f select.tau a.out Binary rewriting: % tau_run –o a.inst.out a.out
17
TAU Meets Dyninst and MRNetParaDyn/Condor Week 200717 Runtime Instrumentation with DyninstAPI tau_run loads TAU’s shared object in the address space Selects routines to be instrumented Calls DyninstAPI OneTimeCode Register a startup routine Pass a string of routine (event) names “main | foo | bar” IDs assigned to events TAU’s hooks for entry/exit used for instrumentation Invoked during execution
18
TAU Meets Dyninst and MRNetParaDyn/Condor Week 200718 Using DyninstAPI with MPI One mutator per mutatee Each mutator instruments mutatee prior to execution No central control Each mutatee writes its own performance data to disk % mpirun -np 4./run.sh % cat run.sh #!/bin/sh /usr/local/tau-2.x/x86_64/bin/tau_run /a.out
19
TAU Meets Dyninst and MRNetParaDyn/Condor Week 200719 Binary Rewriting with TAU Rewrite binary (Save the world) before executing No central control No need to re-instrument the code on all backend nodes Each mutatee writes its own performance data to disk % tau_run -o a.inst.out a.out % cd _dyninstsaved0 % mpirun -np 4./a.inst.out
20
TAU Meets Dyninst and MRNetParaDyn/Condor Week 200720 Example EK-SIMPLE benchmark CFD benchmark Andy Shaw, Kofi Fynn Adapted by Brad Chamberlain Experimentation Run on 4 cpus Runtime instrumentation using DyninstAPI and tau_run Measure wallclock time and CPU time experiments Profiling and tracing modes of measurement Look at performance data with Paraprof and Vampir
21
TAU Meets Dyninst and MRNetParaDyn/Condor Week 200721 ParaProf - Main Window (4 cpus)
22
TAU Meets Dyninst and MRNetParaDyn/Condor Week 200722 ParaProf - Indivdual Profile (n,c,t 0,0,0)
23
TAU Meets Dyninst and MRNetParaDyn/Condor Week 200723 ParaProf - Statistics Table (Mean)
24
TAU Meets Dyninst and MRNetParaDyn/Condor Week 200724 ParaProf - net_recv (MPI rank 1)
25
TAU Meets Dyninst and MRNetParaDyn/Condor Week 200725 Integrated Instrumentation (Source + Dyninst) Use source instrumentation for some events Use Dyninst for other events Access same TAU measurement infrastructure Demonstrate on matrix multiplication example Compare regular versus strip-mining versions Source instrumented Source + binary instrumented
26
TAU Meets Dyninst and MRNetParaDyn/Condor Week 200726 TAU-over-MRNET (ToM) Project MRNET as a Transport Substrate in TAU (Reporting early work done in the last week.)
27
TAU Meets Dyninst and MRNetParaDyn/Condor Week 200727 TAU Transport Substrate - Motivations Transport Substrate Enables movement of measurement-related data TAU, in the past, has relied on shared file-system Some Modes of Performance Observation Offline / Post-mortem observation and analysis least requirements for a specialized transport Online observation long running applications, especially at scale dumping to file-system can be suboptimal Online observation with feedback into application in addition, requires that the transport is bi-directional Performance observation problems and requirements are a function of the mode
28
TAU Meets Dyninst and MRNetParaDyn/Condor Week 200728 Requirements Improve performance of transport NFS can be slow and variable Specialization and remoting of FS-operations to front-end Data Reduction At scale, cost of moving data too high Sample in different domain (node-wise, event-wise) Control Selection of events, measurement technique, target nodes What data to output, how often and in what form? Feedback into the measurement system, feedback into application Online, distributed processing of generated performance data Use compute resource of transport nodes Global performance analyses within the topology Distribute statistical analyses easy (mean, variance, histogram), challenging (clustering)
29
TAU Meets Dyninst and MRNetParaDyn/Condor Week 200729 Approach and First Prototype Measurement and measured data transport are separate No such distinction in TAU Created abstraction to separate and hide transport TauOutput Did not create a custom transport for TAU Use existing monitoring/transport capabilities Supermon (Sottile and Minnich, LANL) Piggy-backed TAU performance data on Supermon channels Correlate system-level metrics from Supermon with TAU application performance data
30
TAU Meets Dyninst and MRNetParaDyn/Condor Week 200730 Rationale Moved away from NFS Separation of concerns Scalability, portability, robustness Addressed independent of TAU Re-use existing technologies where appropriate Multiple bindings Use different solutions best suited to particular platform Implementation speed Easy, fast to create adapter that binds to existing transport MRNET support was added in about a week Says a lot about usability of MRNET
31
TAU Meets Dyninst and MRNetParaDyn/Condor Week 200731 ToM Architecture TAU Components* Front-End (FE) Filters Back-End (BE) * Over MRNet API No-Backend-Instantiation mode Push-Pull model of data retrieval No daemon Instrumented application contains TAU and Back-End Two channels (streams) Data (BE to FE) Control (FE to BE)
32
TAU Meets Dyninst and MRNetParaDyn/Condor Week 200732 ToM Architecture Applicaton calls into TAU Per-Iteration explicit call to output routine Periodic calls using alarm TauOutput object invoked Configuration specific: compile or runtime One per thread TauOutput mimics subset of FS-style operations Avoids changes to TAU code If required rest of TAU can be made aware of output type Non-blocking recv for control Back-end pushes Sink pulls
33
TAU Meets Dyninst and MRNetParaDyn/Condor Week 200733 Simple Example (NPB LU - A, Per-5 iterations) Exclusive time
34
TAU Meets Dyninst and MRNetParaDyn/Condor Week 200734 Simple Example (NPB LU - A, Per-5 iterations) Number of calls
35
TAU Meets Dyninst and MRNetParaDyn/Condor Week 200735 Comparing ToM with NFS TAUoverNFS versus TAUoverMRNET 250 ssor iterations 251 TAU_DB_DUMP operations Significant advantages with specialized transport substrate Similar when using Supermon as the substrate Remoting of expensive FS meta-data operations to Front-End NPB LU (A) 32 Processors Over NFS (secs) Over MRNET (secs) % Improvement over NFS Total Runtime45.5136.5019.79 TAU_DB_DUMP8.580.36695.73
36
TAU Meets Dyninst and MRNetParaDyn/Condor Week 200736 Playing with Filters Downstream (FE to BE) multicast path Even without filters, is very useful for control Data Reduction Filters are integral to Upstream path (BE to FE) W/O filters loss-less data reproduced D-1 times Unnecessary large cost to network Filter 1: Random Sampling Filter Very simplistic data reduction by node-wise sampling Accepts or Rejects packets probabilistically TAU Front-End can control probability P(accept) P(accept)=K/N (N = # leafs, K is user constant) Bounds number of packets per-round to K
37
TAU Meets Dyninst and MRNetParaDyn/Condor Week 200737 Filter 1 in Action (Ring application) Compare different P(accept) values 1, 1/4, 1/16 Front-End unable to keep up Queuing delay propagated back
38
TAU Meets Dyninst and MRNetParaDyn/Condor Week 200738 Other Filters Statistics filter Reduce raw performance data to smaller set of statistics Distribute these statistical analyses from Front-End to the filters Simple measures - mean, std.dev, histograms More sophisticated measures - distributed clustering Controlling filters No direct way to control Upstream-filters not on control path Recommended solution place upstream filters that work in concert with downstream filters to share control information requires synchronization of state between upstream and downstream filters Our Echo hack Back-Ends transparently echo Filter-Control packets back upstream this is then interpreted by the filters easier to implement control response time may be greater
39
TAU Meets Dyninst and MRNetParaDyn/Condor Week 200739 Feedback / Suggestions Easy to integrate with MRNET Good examples documentation, readable source code Setup phase Make MRNET intermediate nodes listen on pre-specified port Allow arbitrary mrnet-ranks to connect and then set the Ids in the topology Relaxing strict apriori-ranks can make setup easier Setup in Job-Q environments difficult Packetization API can be more flexible Current API is usable and simple (var-arg printf style) Composing a packet over a series of staggered stages difficult Allow control over how buffering is performed Important in a push-pull model as data injection points (rates) independent of data retrieval Is not a problem in a purely pull model
40
TAU Meets Dyninst and MRNetParaDyn/Condor Week 200740 TAUoverMRNET - Contrast TAUoverSupermon Supermon (cluster-monitor) vs. MRNet (reduction-network) Both light-weight transport substrates Data format Supermon: ascii s-expressions MRNET: packets with packed (binary?) data Supermon Setup Loose topology No support/help in setting up intermediate nodes Assume Supermon is part of the environment MRNET Setup Strict topology Better support for starting intermediate nodes With/Without Back-End instantiation (TAU uses latter) Multiple Front-Ends (or sinks) possible with Supermon MRNET, front-end needs to program this functionality No exisiting pluggable “filter” support in Supermon Performing aggregation is more difficult with Supermon. Supermons allows buffer-policy specification, MRNET does not
41
TAU Meets Dyninst and MRNetParaDyn/Condor Week 200741 Future Work Dyninst Tighter integration of source and binary instrumentation Conveying of source information to binary level Enabling use of TAU’s advanced measurement features Leveraging TAU’s performance mapping support Want robust and portable binary rewriting tool MRNet Development of more performance filters Evaluation of MRNet performance for different scenarios Testing at large scale Use in applications
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.