Simplifying the Usage of Performance Evaluation Tools: Experiences with TAU and DyninstAPI Paradyn/Condor Week 2010, Rm 221, Fluno Center, U. of Wisconsin,

Slides:



Advertisements
Similar presentations
Machine Learning-based Autotuning with TAU and Active Harmony Nicholas Chaimov University of Oregon Paradyn Week 2013 April 29, 2013.
Advertisements

K T A U Kernel Tuning and Analysis Utilities Department of Computer and Information Science Performance Research Laboratory University of Oregon.
Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.
Dynamic performance measurement control Dynamic event grouping Multiple configurable counters Selective instrumentation Application-Level Performance Access.
Workload Characterization using the TAU Performance System Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer,
S3D: Performance Impact of Hybrid XT3/XT4 Sameer Shende
Robert Bell, Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science.
Scalability Study of S3D using TAU Sameer Shende
Sameer Shende Department of Computer and Information Science Neuro Informatics Center University of Oregon Tool Interoperability.
Profiling S3D on Cray XT3 using TAU Sameer Shende
TAU: Tuning and Analysis Utilities. TAU Performance System Framework  Tuning and Analysis Utilities  Performance system framework for scalable parallel.
The TAU Performance Technology for Complex Parallel Systems (Performance Analysis Bring Your Own Code Workshop, NRL Washington D.C.) Sameer Shende, Allen.
Nick Trebon, Alan Morris, Jaideep Ray, Sameer Shende, Allen Malony {ntrebon, amorris, Department of.
TAU Performance System
On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Bernd Mohr 1, Allen D. Malony 2, Rudi Eigenmann 3 1 Forschungszentrum.
Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science Institute University.
Workshop on Performance Tools for Petascale Computing 9:30 – 10:30am, Tuesday, July 17, 2007, Snowbird, UT Sameer S. Shende
TAU Performance System Alan Morris, Sameer Shende, Allen D. Malony University of Oregon {amorris, sameer,
Performance Tools BOF, SC’07 5:30pm – 7pm, Tuesday, A9 Sameer S. Shende Performance Research Laboratory University.
Performance Instrumentation and Measurement for Terascale Systems Jack Dongarra, Shirley Moore, Philip Mucci University of Tennessee Sameer Shende, and.
June 2, 2003ICCS Performance Instrumentation and Measurement for Terascale Systems Jack Dongarra, Shirley Moore, Philip Mucci University of Tennessee.
Workshop on Performance Tools for Petascale Computing 9:30 – 10:30am, Tuesday, July 17, 2007, Snowbird, UT Sameer S. Shende
Performance Evaluation of S3D using TAU Sameer Shende
TAU: Performance Regression Testing Harness for FLASH Sameer Shende
Scalability Study of S3D using TAU Sameer Shende
Optimization of Instrumentation in Parallel Performance Evaluation Tools Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer,
S3D: Comparing Performance of XT3+XT4 with XT4 Sameer Shende
Kai Li, Allen D. Malony, Robert Bell, Sameer Shende Department of Computer and Information Science Computational.
The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon.
Sameer Shende, Allen D. Malony Computer & Information Science Department Computational Science Institute University of Oregon.
Allen D. Malony Performance Research Laboratory (PRL) Neuroinformatics Center (NIC) Department.
1 Score-P – A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir Markus Geimer 2), Bert Wesarg 1), Brian Wylie.
Paradyn Week – April 14, 2004 – Madison, WI DPOMP: A DPCL Based Infrastructure for Performance Monitoring of OpenMP Applications Bernd Mohr Forschungszentrum.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Score-P – A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir Alexandru Calotoiu German Research School for.
Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.
Using TAU on SiCortex Alan Morris, Aroon Nataraj Sameer Shende, Allen D. Malony University of Oregon {amorris, anataraj, sameer,
Profile Analysis with ParaProf Sameer Shende Performance Reseaerch Lab, University of Oregon
Allen D. Malony 1, Scott Biersdorff 2, Wyatt Spear 2 1 Department of Computer and Information Science 2 Performance Research Laboratory University of Oregon.
Allen D. Malony, Aroon Nataraj Department of Computer and Information Science Performance.
Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,
Dynamic performance measurement control Dynamic event grouping Multiple configurable counters Selective instrumentation Application-Level Performance Access.
Allen D. Malony, Sameer S. Shende, Alan Morris, Robert Bell, Kevin Huck, Nick Trebon, Suravee Suthikulpanit, Kai Li, Li Li
Allen D. Malony Department of Computer and Information Science TAU Performance Research Laboratory University of Oregon Discussion:
21 Sep UPC Performance Analysis Tool: Status and Plans Professor Alan D. George, Principal Investigator Mr. Hung-Hsun Su, Sr. Research Assistant.
Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.
Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.
Allen D. Malony Department of Computer and Information Science Performance Research Laboratory.
Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs Allen D. Malony, Scott Biersdorff, Sameer Shende, Heike Jagode†, Stanimire.
Aroon Nataraj, Matthew Sottile, Alan Morris, Allen D. Malony, Sameer Shende { anataraj, matt, amorris, malony,
TAU Performance System ® TAU is a profiling and tracing toolkit that supports programs written in C, C++, Fortran, Java, Python,
Other Tools HPC Code Development Tools July 29, 2010 Sue Kelly Sandia is a multiprogram laboratory operated by Sandia Corporation, a.
TAU Performance System Sameer Shende Performance Reseaerch Lab, University of Oregon
Aroon Nataraj, Matthew Sottile, Alan Morris, Allen D. Malony, Sameer Shende { anataraj, matt, amorris, malony,
Performance Tool Integration in Programming Environments for GPU Acceleration: Experiences with TAU and HMPP Allen D. Malony1,2, Shangkar Mayanglambam1.
Kai Li, Allen D. Malony, Sameer Shende, Robert Bell
Productive Performance Tools for Heterogeneous Parallel Computing
Introduction to the TAU Performance System®
Performance Technology for Scalable Parallel Systems
TAU integration with Score-P
TAU: Performance Technology for Productive, High Performance Computing
Allen D. Malony, Sameer Shende
TAU Parallel Performance System
Advanced TAU Commander
A configurable binary instrumenter
TAU The 11th DOE ACTS Workshop
TAU: A Framework for Parallel Performance Analysis
Allen D. Malony Computer & Information Science Department
Outline Introduction Motivation for performance mapping SEAA model
Parallel Program Analysis Framework for the DOE ACTS Toolkit
Presentation transcript:

Simplifying the Usage of Performance Evaluation Tools: Experiences with TAU and DyninstAPI Paradyn/Condor Week 2010, Rm 221, Fluno Center, U. of Wisconsin, Madison, 10:45am – 11:30 am Tuesday, 14 th April, 2010 Sameer Shende, Allen D. Malony, Alan Morris Performance Research Laboratory University of Oregon, Eugene, OR {sameer, malony,

2 Acknowledgements: University of Oregon  Dr. Allen D. Malony, Professor, CIS Dept, and Director, NeuroInformatics Center  Alan Morris, Senior software engineer  Dr. Chee Wai Lee, Research faculty  Wyatt Spear, Software engineer  Scott Biersdorff, Software engineer  Dr. Robert Yelle, Research faculty  Suzanne Millstein, Ph.D. student And  Matt Legendre and Dan McNulty, University of Wisconsin at Madison

Motivation  We have made great advances in instrumentation, measurement and analysis techniques  Tools are rich in features and have a complex tool dependency  Tools are getting more complex to use and to install  We need to simplify the usage of our performance evaluation tools!

TAU Performance System ®  Integrated toolkit for performance problem solving  Instrumentation, measurement, analysis, visualization  Portable performance profiling and tracing facility  Performance data management and data mining  Based on direct performance measurement approach  Open source  Available on all HPC platforms  Partners  LLNL, ANL, ORNL, LANL, PNNL, LBL  Research Centre Jülich, TU Dresden TAU Architecture

TAU Parallel Performance System Goals  Portable (open source) parallel performance system  Computer system architectures and operating systems  Different programming languages and compilers  Multi-level, multi-language performance instrumentation  Flexible and configurable performance measurement  Support for multiple parallel programming paradigms  Multi-threading, message passing, mixed-mode, hybrid, object oriented (generic), component-based  Support for performance mapping  Integration of leading performance technology  Scalable (very large) parallel performance analysis

TAU Performance System Components TAU Architecture Program Analysis Parallel Profile Analysis PDT PerfDMF ParaProf Performance Data Mining Performance Monitoring TAUoverMRNet (ToM) PerfExplorer

TAU Performance System Architecture

TAU Performance System Architecture

Parallel Profile Visualization: ParaProf

Scalable Visualization: ParaProf (128k cores)

Scatter Plot: ParaProf (128k cores)

ParaProf: Communication Matrix Display

Comparing Effects of Multi-Core Processors AORSA2D  magnetized plasma simulation  Automatic loop level instrumentation  Blue is single node  Red is dual core  Cray XT3 (4K cores)

ParaProf: Mflops Sorted by Exclusive Time low mflops?

Performance Regression Testing

Usage Scenarios: Evaluate Scalability

Scaling NAMD with CUDA (Jumpshot with TAU) Data transfer

Measuring Performance of PGI Accelerated Code

TAU and Eclipse  Provide an interface for configuring TAU’s automatic instrumentation within Eclipse’s build system  Manage runtime configuration settings and environment variables for execution of TAU instrumented programs C/C++/Fortran Project in Eclipse Add or modify an Eclipse build configuration w/ TAU Temporary copy of instrumented code Compilation/linking with TAU libraries TAU instrumented libraries Program execution Performance data Program output

TAU and Eclipse PerfDMF

Choosing PAPI Counters with TAU in Eclipse

TAU Performance System Architecture

TAU Instrumentation Approach  Support for standard program events  Routines, classes and templates  Statement-level blocks  Begin/End events (Interval events)  Support for user-defined events  Begin/End events specified by user  Atomic events (e.g., size of memory allocated/freed)  Selection of event statistics  Support definition of “semantic” entities for mapping  Support for event groups (aggregation, selection)  Instrumentation optimization  Eliminate instrumentation in lightweight routines

TAU Instrumentation Mechanisms  Source code  Manual (TAU API, TAU component API)  Automatic (robust)  C, C++, F77/90/95 (Program Database Toolkit (PDT))  OpenMP (directive rewriting (Opari), POMP2 spec)  Object code  Compiler-based instrumentation (-optCompInst)  Pre-instrumented libraries (e.g., MPI using PMPI)  Statically-linked and dynamically-linked (tau_wrap)  Executable code  Binary re-writing and dynamic instrumentation (DyninstAPI, U. Wisconsin, U. Maryland)  Virtual machine instrumentation (e.g., Java using JVMPI)  Interpreter based instrumentation (Python)  Kernel based instrumentation (KTAU)

Program Database Toolkit (PDT) Application / Library C / C++ parser Fortran parser F77/90/95 C / C++ IL analyzer Fortran IL analyzer Program Database Files IL DUCTAPE PDBhtml SILOON CHASM TAU_instr Program documentation Application component glue C++ / F90/95 interoperability Automatic source instrumentation

Automatic Source-Level Instrumentation in TAU TAU v : If source based instrumentation fails, compiler-based instrumentation is used automatically

Using TAU with Source Code Instrumentation  TAU supports several measurement options (profiling, tracing, profiling with hardware counters, etc.)  Each measurement configuration of TAU corresponds to a unique stub makefile that is generated when you configure it  To instrument source code using PDT  Choose an appropriate TAU stub makefile in /lib: % export TAU_MAKEFILE=/usr/local/packages/tau/x86_64/lib/Makefile.tau-mpi-pdt % export TAU_OPTIONS=‘-optVerbose …’ (see tau_compiler.sh -help) And use tau_f90.sh, tau_cxx.sh or tau_cc.sh as Fortran, C++ or C compilers: % mpif90 foo.f90 changes to % tau_f90.sh foo.f90  Execute application and analyze performance data: % pprof (for text based profile display) % paraprof (for GUI)

TAU Measurement Configuration – Examples % cd /usr/local/packages/tau/x86_64/lib; ls Makefile.* Makefile.tau-pdt Makefile.tau-mpi-pdt Makefile.tau-papi-mpi-pdt Makefile.tau-pthread-pdt Makefile.tau-pthread-mpi-pdt Makefile.tau-openmp-opari-pdt Makefile.tau-openmp-opari-mpi-pdt Makefile.tau-papi-openmp-opari-mpi-pdt …  For an MPI+F90 application, you may want to start with: Makefile.tau-mpi-pdt  Supports MPI instrumentation & PDT for automatic source instrumentation  % setenv TAU_MAKEFILE /usr/local/packages/tau/x86_64/lib/Makefile.tau-mpi-pdt  % tau_f90.sh application.f90; mpirun –np 256./a.out

Compile-Time Environment Variables  Optional parameters for TAU_OPTIONS: [tau_compiler.sh –help] -optVerboseTurn on verbose debugging messages -optCompInstUse compiler based instrumentation -optNoCompInstDo not revert to compiler instrumentation if source instrumentation fails. -optDetectMemoryLeaks Turn on debugging memory allocations/ de-allocations to track leaks -optKeepFiles Does not remove intermediate.pdb and.inst.* files -optPreProcess Preprocess Fortran sources before instrumentation -optTauSelectFile="" Specify selective instrumentation file for tau_instrumentor -optLinking="" Options passed to the linker. Typically $(TAU_MPI_FLIBS) $(TAU_LIBS) $(TAU_CXXLIBS) -optCompile="" Options passed to the compiler. Typically $(TAU_MPI_INCLUDE) $(TAU_INCLUDE) $(TAU_DEFS) -optPdtF95Opts="" Add options for Fortran parser in PDT (f95parse/gfparse) -optPdtF95Reset="" Reset options for Fortran parser in PDT (f95parse/gfparse) -optPdtCOpts="" Options for C parser in PDT (cparse). Typically $(TAU_MPI_INCLUDE) $(TAU_INCLUDE) $(TAU_DEFS) -optPdtCxxOpts="" Options for C++ parser in PDT (cxxparse). Typically $(TAU_MPI_INCLUDE) $(TAU_INCLUDE) $(TAU_DEFS)...

Runtime Environment Variables in TAU Environment VariableDefaultDescription TAU_TRACE0Setting to 1 turns on tracing TAU_CALLPATH0Setting to 1 turns on callpath profiling TAU_TRACK_HEAP or TAU_TRACK_HEADROOM 0Setting to 1 turns on tracking heap memory/headroom at routine entry & exit using context events (e.g., Heap at Entry: main=>foo=>bar) TAU_CALLPATH_DEPTH2Specifies depth of callpath. Setting to 0 generates no callpath or routine information, setting to 1 generates flat profile and context events have just parent information (e.g., Heap Entry: foo) TAU_SYNCHRONIZE_CLOCKS1Synchronize clocks across nodes to correct timestamps in traces TAU_COMM_MATRIX0Setting to 1 generates communication matrix display using context events TAU_THROTTLE1Setting to 0 turns off throttling. Enabled by default to remove instrumentation in lightweight routines that are called frequently TAU_THROTTLE_NUMCALLS100000Specifies the number of calls before testing for throttling TAU_THROTTLE_PERCALL10Specifies value in microseconds. Throttle a routine if it is called over times and takes less than 10 usec of inclusive time per call TAU_COMPENSATE0Setting to 1 enables runtime compensation of instrumentation overhead TAU_PROFILE_FORMATProfileSetting to “merged” generates a single file. “snapshot” generates xml format TAU_METRICSTIMESetting to a comma separted list generates other metrics. (e.g., TIME:linuxtimers:PAPI_FP_OPS:PAPI_NATIVE_ )

Simplifying Instrumentation using DyninstAPI  TAU uses DyninstAPI to create a binary re-writer (tau_run)  TAU’s measurement library (DSO) is loaded by tau_run  Both runtime instrumentation and binary re-writing are supported  Selection of files and routines based on exclude/include lists  Simplifies tool usage greatly!  Available on POINT LiveDVD [  Usage:  % tau_run a.out –o a.inst.out  % mpirun –np 4 a.inst.out  % paraprof

Issues  Re-writing static executables limited to gcc, limited platforms in beta  Currently, we support dynamic executables (v6.1)  We are working on supporting both static and dynamic executables  We hope to support more platforms, compilers and runtime systems in the future  Rewriting shared libraries used by the application  LD_PRELOAD’able wrapper libraries can be created using tau_wrap  requires interface information in header file

Binary Rewriting in TAU using DyninstAPI

Wish List for tau_run  Support for more platforms  Apple Mac OS X, Windows, IBM BG/P, AIX, …  Support for more compilers  Support for rewriting shared objects  Support for static binary rewriting with validation for compilers other than gcc  XLC, PathScale, Cray CCE, Intel, PGI,…

Other Tools…  Other TAU tools that use technologies from the ParaDyn/DyninstAPI group  TAU over MRNet (ToM) for runtime  Stackwalker API for accessing callstack

StackWalkerAPI in TAU  Requirements overview:  Minimal information required (PC is enough)  Threaded support necessary  Low overhead (for high sample rates)  Stack unwinding from a signal handler  Malloc could be interrupted  Need to walk through signal handler frame

Issues encountered with StackWalkerAPI  StackWalkerAPI:  Isn’t thread safe (and locking to use it can cause significant overhead)  Uses malloc/new (and so do dependent libraries such as libdwarf)  C++ (we would prefer C)  Issues walking certain kinds of stack frames  Matt Legendre was able to help us out a lot though!  Alternatives:  TAU is currently using stack walking constructs from HPCToolkit

Online Monitoring using TAU over MRNet (ToM)  Back-End (BE) TAU adapter offloads performance data  Filters  reduction  distributed analysis  upstream / downstream  Front-End (FE) unpacks, interprets, stores  Paths  reverse data reduction path  multicast control path  Push-Pull model  source pushes, sink pulls

Conclusions  TAU and DyninstAPI represents mature technology for performance instrumentation, measurement and analysis  Using DyninstAPI’s binary re-writing capabilities, we have produced a tool that simplifies code instrumentation  We hope to collaborate on other projects and include support for an enhanced stack walker API Questions?

Support Acknowledgements  Department of Energy (DOE)  Office of Science  MICS, Argonne National Lab  ASC/NNSA  University of Utah ASC/NNSA Level 1  ASC/NNSA, LLNL  Department of Defense (DoD)  NSF SDCI  Partners:  Research Centre Juelich  LBL, ORNL, ANL, LANL, PNNL, LLNL  TU Dresden  ParaTools, Inc.