Allen D. Malony, Aroon Nataraj Department of Computer and Information Science Performance.

Slides:



Advertisements
Similar presentations
Machine Learning-based Autotuning with TAU and Active Harmony Nicholas Chaimov University of Oregon Paradyn Week 2013 April 29, 2013.
Advertisements

K T A U Kernel Tuning and Analysis Utilities Department of Computer and Information Science Performance Research Laboratory University of Oregon.
INTRODUCTION TO SIMULATION WITH OMNET++ José Daniel García Sánchez ARCOS Group – University Carlos III of Madrid.
Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.
Robert Bell, Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science.
Scalability Study of S3D using TAU Sameer Shende
Sameer Shende Department of Computer and Information Science Neuro Informatics Center University of Oregon Tool Interoperability.
Profiling S3D on Cray XT3 using TAU Sameer Shende
The TAU Performance Technology for Complex Parallel Systems (Performance Analysis Bring Your Own Code Workshop, NRL Washington D.C.) Sameer Shende, Allen.
Nick Trebon, Alan Morris, Jaideep Ray, Sameer Shende, Allen Malony {ntrebon, amorris, Department of.
On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Bernd Mohr 1, Allen D. Malony 2, Rudi Eigenmann 3 1 Forschungszentrum.
3.5 Interprocess Communication
Workshop on Performance Tools for Petascale Computing 9:30 – 10:30am, Tuesday, July 17, 2007, Snowbird, UT Sameer S. Shende
TAU Performance System Alan Morris, Sameer Shende, Allen D. Malony University of Oregon {amorris, sameer,
Performance Tools BOF, SC’07 5:30pm – 7pm, Tuesday, A9 Sameer S. Shende Performance Research Laboratory University.
Chapter 2: Impact of Machine Architectures What is the Relationship Between Programs, Programming Languages, and Computers.
Workshop on Performance Tools for Petascale Computing 9:30 – 10:30am, Tuesday, July 17, 2007, Snowbird, UT Sameer S. Shende
1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.
Allen D. Malony, Sameer Shende, Robert Bell Department of Computer and Information Science Computational Science Institute, NeuroInformatics.
Kai Li, Allen D. Malony, Robert Bell, Sameer Shende Department of Computer and Information Science Computational.
Sameer Shende, Allen D. Malony Computer & Information Science Department Computational Science Institute University of Oregon.
Instrumentation and Measurement CSci 599 Class Presentation Shreyans Mehta.
SS ZG653Second Semester, Topic Architectural Patterns Pipe and Filter.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Research & Development Roadmap 1. Outline A New Communication Framework Giving Bro Control over the Network Security Monitoring for Industrial Control.
Paradyn Week – April 14, 2004 – Madison, WI DPOMP: A DPCL Based Infrastructure for Performance Monitoring of OpenMP Applications Bernd Mohr Forschungszentrum.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Cluster Reliability Project ISIS Vanderbilt University.
Aroon Nataraj, Matthew Sottile, Alan Morris, Allen D. Malony, Sameer Shende { anataraj, matt, amorris, malony,
2.Sampling over the Ranks in each time Step. Sampling also reduces Amt of data (but over Diff. dimension). 9 Scalable Online Parallel Performance Measurement.
Score-P – A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir Alexandru Calotoiu German Research School for.
Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.
Using TAU on SiCortex Alan Morris, Aroon Nataraj Sameer Shende, Allen D. Malony University of Oregon {amorris, anataraj, sameer,
SC 2012 © LLNL / JSC 1 HPCToolkit / Rice University Performance Analysis through callpath sampling  Designed for low overhead  Hot path analysis  Recovery.
BLU-ICE and the Distributed Control System Constraints for Software Development Strategies Timothy M. McPhillips Stanford Synchrotron Radiation Laboratory.
March 17, 2005 Roadmap of Upcoming Research, Features and Releases Bart Miller & Jeff Hollingsworth.
Profile Analysis with ParaProf Sameer Shende Performance Reseaerch Lab, University of Oregon
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.
Chapter 4 – Threads (Pgs 153 – 174). Threads  A "Basic Unit of CPU Utilization"  A technique that assists in performing parallel computation by setting.
November 2005 New Features in Paradyn and Dyninst Matthew LeGendre Ray Chen
Allen D. Malony Department of Computer and Information Science TAU Performance Research Laboratory University of Oregon Discussion:
© 2006, National Research Council Canada © 2006, IBM Corporation Solving performance issues in OTS-based systems Erik Putrycz Software Engineering Group.
Full and Para Virtualization
Creating SmartArt 1.Create a slide and select Insert > SmartArt. 2.Choose a SmartArt design and type your text. (Choose any format to start. You can change.
Marcelo R.N. Mendes. What is FINCoS? A set of tools for data generation, load submission, and performance measurement of CEP systems; Main Characteristics:
Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.
Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.
Allen D. Malony Department of Computer and Information Science Performance Research Laboratory.
Aroon Nataraj, Matthew Sottile, Alan Morris, Allen D. Malony, Sameer Shende { anataraj, matt, amorris, malony,
Performane Analyzer Performance Analysis and Visualization of Large-Scale Uintah Simulations Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance.
Michael J. Voss and Rudolf Eigenmann PPoPP, ‘01 (Presented by Kanad Sinha)
Online Performance Analysis and Visualization of Large-Scale Parallel Applications Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance Research.
Aroon Nataraj, Matthew Sottile, Alan Morris, Allen D. Malony, Sameer Shende { anataraj, matt, amorris, malony,
Fermilab Scientific Computing Division Fermi National Accelerator Laboratory, Batavia, Illinois, USA. Off-the-Shelf Hardware and Software DAQ Performance.
Virtualization for Cloud Computing
Kai Li, Allen D. Malony, Sameer Shende, Robert Bell
Productive Performance Tools for Heterogeneous Parallel Computing
Performance Technology for Scalable Parallel Systems
SOFTWARE DESIGN AND ARCHITECTURE
In-situ Visualization using VisIt
TAU integration with Score-P
Allen D. Malony, Sameer Shende
A configurable binary instrumenter
Analysis models and design models
Allen D. Malony Computer & Information Science Department
Outline Introduction Motivation for performance mapping SEAA model
Parallel Program Analysis Framework for the DOE ACTS Toolkit
Support for Adaptivity in ARMCI Using Migratable Objects
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Allen D. Malony, Aroon Nataraj Department of Computer and Information Science Performance Research Laboratory University of Oregon TAU Meets Dyninst and MRNet: A Long-term and Short-term Affair

TAU Meets Dyninst and MRNetParaDyn/Condor Week Performance Research Lab  Dr. Sameer Shende, Senior scientist  Alan Morris, Senior software engineer  Wyatt Spear, Software engineer  Scott Biersdorff, Software engineer  Li Li, Ph.D. student  “Model-based Automatic Performance Diagnosis”  Ph.D. thesis, January 2007  Kevin Huck, Ph.D. student  Aroon Nataraj, Ph.D. student  Integrated kernel / application performance analysis  Scalable performance monitoring

TAU Meets Dyninst and MRNetParaDyn/Condor Week Outline  What is TAU?  Observation methodology  Instrumentation, measurement, analysis tools  Our affair with Dyninst  Perspective  MPI applications  Integrated instrumentation  Courting MRNet  Initial results  Future work

TAU Meets Dyninst and MRNetParaDyn/Condor Week TAU Performance System  Tuning and Analysis Utilities (14+ year project effort)  Performance system framework for HPC systems  Integrated, scalable, flexible, and parallel  Multiple parallel programming paradigms  Parallel performance mapping methodology  Portable (open source) parallel performance system  Instrumentation, measurement, analysis, and visualization  Portable performance profiling and tracing facility  Performance data management and data mining  Scalable (very large) parallel performance analysis  Partners  Research Center Jülich, LLNL, ANL, LANL, UTK

TAU Meets Dyninst and MRNetParaDyn/Condor Week TAU Performance Observation Methodology  Advocate event-based, direct performance observation  Observe execution events  Types: control flow, state-based, user-defined  Modes: atomic, interval (enter/exit)  Instrument program code directly (defines events)  Modify program code at points of event occurrence  Different code forms (source, library, object, binary, VM)  Measurement code inserted (instantiates events)  Make events visible  Measures performance related to event occurrence  Contrast with event-based sampling

TAU Meets Dyninst and MRNetParaDyn/Condor Week TAU Performance System Architecture

TAU Meets Dyninst and MRNetParaDyn/Condor Week TAU Performance System Architecture

TAU Meets Dyninst and MRNetParaDyn/Condor Week User-level abstractions problem domain source code object codelibraries instrumentation executable runtime image compiler linkerOS VM instrumentation performance data run preprocessor Multi-Level Instrumentation and Mapping  Multiple interfaces  Information sharing  Between interfaces  Event selection  Within levels  Between levels  Mapping  Performance data is associated with high- level semantic abstractions

TAU Meets Dyninst and MRNetParaDyn/Condor Week TAU Instrumentation Approach  Support for standard program events  Routines, classes and templates  Statement-level blocks and loops  Support for user-defined events  Begin/End events (“user-defined timers”)  Atomic events (e.g., size of memory allocated/freed)  Selection of event statistics  Support definition of “semantic” entities for mapping  Support for event groups (aggregation, selection)  Instrumentation selection and optimization  Instrumentation enabling/disabling and runtime throttling

TAU Meets Dyninst and MRNetParaDyn/Condor Week TAU Instrumentation Mechanisms  Source code  Manual (TAU API, TAU component API)  Automatic (robust)  C, C++, F77/90/95 (Program Database Toolkit (PDT))  OpenMP (directive rewriting (Opari), POMP2 spec)  Object code  Pre-instrumented libraries (e.g., MPI using PMPI)  Statically-linked and dynamically-linked  Executable code  Dynamic instrumentation (pre-execution) (DyninstAPI)  Virtual machine instrumentation (e.g., Java using JVMPI)  TAU_COMPILER to automate instrumentation process

TAU Meets Dyninst and MRNetParaDyn/Condor Week TAU Measurement Approach  Portable and scalable parallel profiling solution  Multiple profiling types and options  Event selection and control (enabling/disabling, throttling)  Online profile access and sampling  Online performance profile overhead compensation  Portable and scalable parallel tracing solution  Trace translation to EPILOG, VTF3, and OTF  Trace streams (OTF) and hierarchical trace merging  Robust timing and hardware performance support  Multiple counters (hardware, user-defined, system)  Measurement specification separate from instrumentation

TAU Meets Dyninst and MRNetParaDyn/Condor Week TAU Measurement Mechanisms  Parallel profiling  Function-level, block-level, statement-level  Supports user-defined events and mapping events  TAU parallel profile stored (dumped) during execution  Support for flat, callgraph/callpath, phase profiling  Support for memory profiling (headroom, leaks)  Tracing  All profile-level events  Inter-process communication events  Inclusion of multiple counter data in traced events  Compile-time and runtime measurement selection

TAU Meets Dyninst and MRNetParaDyn/Condor Week Performance Analysis and Visualization  Analysis of parallel profile and trace measurement  Parallel profile analysis  ParaProf: parallel profile analysis and presentation  ParaVis: parallel performance visualization package  Profile generation from trace data (tau2pprof)  Performance data management framework (PerfDMF)  Parallel trace analysis  Translation to VTF (V3.0), EPILOG, OTF formats  Integration with VNG (Technical University of Dresden)  Online parallel analysis and visualization  Integration with CUBE browser (KOJAK, UTK, FZJ)

TAU Meets Dyninst and MRNetParaDyn/Condor Week TAU and DyninstAPI  TAU has had a long-term affair Dyninst technology  Dyninst offered a binary-level instrumentation tool  Could help in cases when the source code is unavailable  Could allow instrumentation without recompilation  TAU requirements  Instrument HPC applications with TAU measurements  Multiple paradigms, languages, compilers, platforms  Portability  Tested Dyninst features as they were released  Issues  MPI, threading, availability, binary rewriting  It been on/off open relationship

TAU Meets Dyninst and MRNetParaDyn/Condor Week Using DyninstAPI  TAU uses DyninstAPI for binary code patching  Pre-execution  versus at any point during execution  Methods  runtime before the application begins  binary rewriting  tau_run (mutator)  Loads TAU measurement library  Uses DyninstAPI to instrument mutatee  Can apply instrumentation selection

TAU Meets Dyninst and MRNetParaDyn/Condor Week Using DyninstAPI with TAU Configure TAU with Dyninst and build / /bin/tau_run % configure –dyninst=/usr/local/dyninstAPI % make clean; make install tau_run command % tau_run [ ] [-Xrun ][-f ] [-v] Instrument all events with TAU measurement library and execute: % tau_run klargest Instrument all events with TAU+PAPI measurements (libTAUsh-papi.so) and execute: % tau_run -XrunTAUsh-papi a.out Instruments only events specified in select.tau instrumentation specification file and execute: % tau_run -f select.tau a.out Binary rewriting: % tau_run –o a.inst.out a.out

TAU Meets Dyninst and MRNetParaDyn/Condor Week Runtime Instrumentation with DyninstAPI  tau_run loads TAU’s shared object in the address space  Selects routines to be instrumented  Calls DyninstAPI OneTimeCode  Register a startup routine  Pass a string of routine (event) names  “main | foo | bar”  IDs assigned to events  TAU’s hooks for entry/exit used for instrumentation  Invoked during execution

TAU Meets Dyninst and MRNetParaDyn/Condor Week Using DyninstAPI with MPI  One mutator per mutatee  Each mutator instruments mutatee prior to execution  No central control  Each mutatee writes its own performance data to disk % mpirun -np 4./run.sh % cat run.sh #!/bin/sh /usr/local/tau-2.x/x86_64/bin/tau_run /a.out

TAU Meets Dyninst and MRNetParaDyn/Condor Week Binary Rewriting with TAU  Rewrite binary (Save the world) before executing  No central control  No need to re-instrument the code on all backend nodes  Each mutatee writes its own performance data to disk % tau_run -o a.inst.out a.out % cd _dyninstsaved0 % mpirun -np 4./a.inst.out

TAU Meets Dyninst and MRNetParaDyn/Condor Week Example  EK-SIMPLE benchmark  CFD benchmark  Andy Shaw, Kofi Fynn  Adapted by Brad Chamberlain  Experimentation  Run on 4 cpus  Runtime instrumentation using DyninstAPI and tau_run  Measure wallclock time and CPU time experiments  Profiling and tracing modes of measurement  Look at performance data with Paraprof and Vampir

TAU Meets Dyninst and MRNetParaDyn/Condor Week ParaProf - Main Window (4 cpus)

TAU Meets Dyninst and MRNetParaDyn/Condor Week ParaProf - Indivdual Profile (n,c,t 0,0,0)

TAU Meets Dyninst and MRNetParaDyn/Condor Week ParaProf - Statistics Table (Mean)

TAU Meets Dyninst and MRNetParaDyn/Condor Week ParaProf - net_recv (MPI rank 1)

TAU Meets Dyninst and MRNetParaDyn/Condor Week Integrated Instrumentation (Source + Dyninst)  Use source instrumentation for some events  Use Dyninst for other events  Access same TAU measurement infrastructure  Demonstrate on matrix multiplication example  Compare regular versus strip-mining versions Source instrumented Source + binary instrumented

TAU Meets Dyninst and MRNetParaDyn/Condor Week TAU-over-MRNET (ToM) Project MRNET as a Transport Substrate in TAU (Reporting early work done in the last week.)

TAU Meets Dyninst and MRNetParaDyn/Condor Week TAU Transport Substrate - Motivations  Transport Substrate  Enables movement of measurement-related data  TAU, in the past, has relied on shared file-system  Some Modes of Performance Observation  Offline / Post-mortem observation and analysis  least requirements for a specialized transport  Online observation  long running applications, especially at scale  dumping to file-system can be suboptimal  Online observation with feedback into application  in addition, requires that the transport is bi-directional  Performance observation problems and requirements are a function of the mode

TAU Meets Dyninst and MRNetParaDyn/Condor Week Requirements  Improve performance of transport  NFS can be slow and variable  Specialization and remoting of FS-operations to front-end  Data Reduction  At scale, cost of moving data too high  Sample in different domain (node-wise, event-wise)  Control  Selection of events, measurement technique, target nodes  What data to output, how often and in what form?  Feedback into the measurement system, feedback into application  Online, distributed processing of generated performance data  Use compute resource of transport nodes  Global performance analyses within the topology  Distribute statistical analyses  easy (mean, variance, histogram), challenging (clustering)

TAU Meets Dyninst and MRNetParaDyn/Condor Week Approach and First Prototype  Measurement and measured data transport are separate  No such distinction in TAU  Created abstraction to separate and hide transport  TauOutput  Did not create a custom transport for TAU  Use existing monitoring/transport capabilities  Supermon (Sottile and Minnich, LANL)  Piggy-backed TAU performance data on Supermon channels  Correlate system-level metrics from Supermon with TAU application performance data

TAU Meets Dyninst and MRNetParaDyn/Condor Week Rationale  Moved away from NFS  Separation of concerns  Scalability, portability, robustness  Addressed independent of TAU  Re-use existing technologies where appropriate  Multiple bindings  Use different solutions best suited to particular platform  Implementation speed  Easy, fast to create adapter that binds to existing transport  MRNET support was added in about a week  Says a lot about usability of MRNET

TAU Meets Dyninst and MRNetParaDyn/Condor Week ToM Architecture  TAU Components*  Front-End (FE)  Filters  Back-End (BE)  * Over MRNet API  No-Backend-Instantiation mode  Push-Pull model of data retrieval  No daemon  Instrumented application contains TAU and Back-End  Two channels (streams)  Data (BE to FE)  Control (FE to BE)

TAU Meets Dyninst and MRNetParaDyn/Condor Week ToM Architecture  Applicaton calls into TAU  Per-Iteration explicit call to output routine  Periodic calls using alarm  TauOutput object invoked  Configuration specific: compile or runtime  One per thread  TauOutput mimics subset of FS-style operations  Avoids changes to TAU code  If required rest of TAU can be made aware of output type  Non-blocking recv for control  Back-end pushes  Sink pulls

TAU Meets Dyninst and MRNetParaDyn/Condor Week Simple Example (NPB LU - A, Per-5 iterations) Exclusive time

TAU Meets Dyninst and MRNetParaDyn/Condor Week Simple Example (NPB LU - A, Per-5 iterations) Number of calls

TAU Meets Dyninst and MRNetParaDyn/Condor Week Comparing ToM with NFS  TAUoverNFS versus TAUoverMRNET  250 ssor iterations  251 TAU_DB_DUMP operations  Significant advantages with specialized transport substrate  Similar when using Supermon as the substrate  Remoting of expensive FS meta-data operations to Front-End NPB LU (A) 32 Processors Over NFS (secs) Over MRNET (secs) % Improvement over NFS Total Runtime TAU_DB_DUMP

TAU Meets Dyninst and MRNetParaDyn/Condor Week Playing with Filters  Downstream (FE to BE) multicast path  Even without filters, is very useful for control  Data Reduction Filters are integral to Upstream path (BE to FE)  W/O filters loss-less data reproduced D-1 times  Unnecessary large cost to network  Filter 1: Random Sampling Filter  Very simplistic data reduction by node-wise sampling  Accepts or Rejects packets probabilistically  TAU Front-End can control probability P(accept)  P(accept)=K/N (N = # leafs, K is user constant)  Bounds number of packets per-round to K

TAU Meets Dyninst and MRNetParaDyn/Condor Week Filter 1 in Action (Ring application)  Compare different P(accept) values  1, 1/4, 1/16  Front-End unable to keep up  Queuing delay propagated back

TAU Meets Dyninst and MRNetParaDyn/Condor Week Other Filters  Statistics filter  Reduce raw performance data to smaller set of statistics  Distribute these statistical analyses from Front-End to the filters  Simple measures - mean, std.dev, histograms  More sophisticated measures - distributed clustering  Controlling filters  No direct way to control Upstream-filters  not on control path  Recommended solution  place upstream filters that work in concert with downstream filters to share control information  requires synchronization of state between upstream and downstream filters  Our Echo hack  Back-Ends transparently echo Filter-Control packets back upstream  this is then interpreted by the filters  easier to implement  control response time may be greater

TAU Meets Dyninst and MRNetParaDyn/Condor Week Feedback / Suggestions  Easy to integrate with MRNET  Good examples documentation, readable source code  Setup phase  Make MRNET intermediate nodes listen on pre-specified port  Allow arbitrary mrnet-ranks to connect and then set the Ids in the topology  Relaxing strict apriori-ranks can make setup easier  Setup in Job-Q environments difficult  Packetization API can be more flexible  Current API is usable and simple (var-arg printf style)  Composing a packet over a series of staggered stages difficult  Allow control over how buffering is performed  Important in a push-pull model as data injection points (rates) independent of data retrieval  Is not a problem in a purely pull model

TAU Meets Dyninst and MRNetParaDyn/Condor Week TAUoverMRNET - Contrast TAUoverSupermon  Supermon (cluster-monitor) vs. MRNet (reduction-network)  Both light-weight transport substrates  Data format  Supermon: ascii s-expressions  MRNET: packets with packed (binary?) data  Supermon Setup  Loose topology  No support/help in setting up intermediate nodes  Assume Supermon is part of the environment  MRNET Setup  Strict topology  Better support for starting intermediate nodes  With/Without Back-End instantiation (TAU uses latter)  Multiple Front-Ends (or sinks) possible with Supermon  MRNET, front-end needs to program this functionality  No exisiting pluggable “filter” support in Supermon  Performing aggregation is more difficult with Supermon.  Supermons allows buffer-policy specification, MRNET does not

TAU Meets Dyninst and MRNetParaDyn/Condor Week Future Work  Dyninst  Tighter integration of source and binary instrumentation  Conveying of source information to binary level  Enabling use of TAU’s advanced measurement features  Leveraging TAU’s performance mapping support  Want robust and portable binary rewriting tool  MRNet  Development of more performance filters  Evaluation of MRNet performance for different scenarios  Testing at large scale  Use in applications