Optimization of Instrumentation in Parallel Performance Evaluation Tools Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer,

Optimization of Instrumentation in Parallel Performance Evaluation Tools Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer, malony,amorris}@cs.uoregon.edu PARA’06: MS8: Tools for Parallel Performance Analysis, 2:40pm – 3pm, Mon 6/19/06

TAU Performance System2 Outline  Overview of features  Instrumentation  Measurement (Profiling, Tracing)  Analysis tools  Tools and techniques for optimizing instrumentation  Conclusions

TAU Performance System3  Tuning and Analysis Utilities (14+ year project effort)  Performance system framework for HPC systems  Integrated, scalable, portable, flexible, and parallel  Integrated toolkit for performance problem solving  Automatic instrumentation  Highly configurable measurement system with support for many flavors of profiling and tracing  Portable analysis and visualization tools  Performance data management and data mining  http://www.cs.uoregon.edu/research/tau

TAU Performance System4 TAU Performance System Architecture event selection

TAU Performance System5 TAU Performance System Architecture

TAU Performance System6 Program Database Toolkit (PDT) Application / Library C / C++ parser Fortran parser F77/90/95 C / C++ IL analyzer Fortran IL analyzer Program Database Files IL DUCTAPE PDBhtml SILOON CHASM TAU_instr Program documentation Application component glue C++ / F90/95 interoperability Automatic source instrumentation

TAU Performance System7 ParaProf – Manager Window performance database derived performance metrics

TAU Performance System8 ParaProf – Full Profile (Miranda) 8K processors!

TAU Performance System9 ParaProf - Statistics Table (Uintah)

TAU Performance System10 ParaProf – 3D Full Profile (Miranda) 16k processors

TAU Performance System11 ParaProf – 3D Scatterplot (Miranda)  Each point is a “thread” of execution  Relation between four routines shown at once

TAU Performance System12 TAU Instrumentation Approach  Support for standard program events  Routines  Classes and templates  Statement-level blocks  Support for user-defined events  Begin/End events (“user-defined timers”)  Atomic events (e.g., size of memory allocated/freed)  Support definition of “semantic” entities for mapping  Support for event groups  Instrumentation optimization (eliminate instrumentation in lightweight routines)

TAU Performance System13 Sampling vs Measured Profiling  Sampling  At a sample, PC or callstack is examined  Estimate performance of the program based on samples taken in code regions  Fixed overhead, depends on inter-sample interval  Typically used in gprof, prof and other system profilers  Measured Profiling  Instrumentation calls inserted at code regions  Entry/exit from routine, outer-loops, “events”  Accurate measurements, compensation for timer overheads possible  Accuracy inversely proportional to the granularity of instrumentation  Coarse grained instrumentation is more accurate  Overhead of instrumentation depends on event frequency  Optimize instrumentation to capture necessary detail, eliminate instrumentation in frequently executing lightweight routines  Used in TAU

TAU Performance System14 TAU Instrumentation  Flexible instrumentation mechanisms at multiple levels  Source code  manual (TAU API, TAU Component API)  automatic C, C++, F77/90/95 (Program Database Toolkit (PDT)) OpenMP (directive rewriting (Opari), POMP spec)  Object code  pre-instrumented libraries (e.g., MPI using PMPI)  statically-linked and dynamically-linked  Executable code  dynamic instrumentation (pre-execution) (DynInstAPI)  virtual machine instrumentation (e.g., Java using JVMPI)  Runtime Linking (LD_PRELOAD)

TAU Performance System15 PAPI [UTK]  Performance Application Programming Interface  The purpose of the PAPI project is to design, standardize and implement a portable and efficient API to access the hardware performance monitor counters found on most modern microprocessors.  Parallel Tools Consortium project  University of Tennessee, Knoxville  http://icl.cs.utk.edu/papi

TAU Performance System16 KOJAK  KOJAK Toolkit [ICL, UTK and FZJ, Germany]  Epilog tracing library  Opari OpenMP re-writing tool  Expert automatic bottleneck detection trace analyzer  CUBE performance data browser  http://icl.cs.utk.edu/kojak

TAU Performance System17 Automatic Instrumentation  We now provide compiler wrapper scripts  Simply replace mpxlf90 with tau_f90.sh  Automatically instruments Fortran source code, links with TAU MPI Wrapper libraries.  Use tau_cc.sh and tau_cxx.sh for C/C++ Before CXX = mpCC F90 = mpxlf90_r CFLAGS = LIBS = -lm OBJS = f1.o f2.o f3.o … fn.o app: $(OBJS) $(CXX) $(LDFLAGS) $(OBJS) -o $@ $(LIBS).cpp.o: $(CC) $(CFLAGS) -c $< After CXX = tau_cxx.sh F90 = tau_f90.sh CFLAGS = LIBS = -lm OBJS = f1.o f2.o f3.o … fn.o app: $(OBJS) $(CXX) $(LDFLAGS) $(OBJS) -o $@ $(LIBS).cpp.o: $(CC) $(CFLAGS) -c $<

TAU Performance System18 AutoInstrumentation using TAU_COMPILER  $(TAU_COMPILER) stub Makefile variable in 2.14+ release  Invokes PDT parser, TAU instrumentor, compiler through tau_compiler.sh shell script  Requires minimal changes to application Makefile  Compilation rules are not changed  User sets TAU_MAKEFILE and TAU_OPTIONS environment variables  User renames the compilers  F90=xlf90 to  F90= tau_f90.sh  Passes options from TAU stub Makefile to the four compilation stages  Uses original compilation command if an error occurs

TAU Performance System19 TAU_COMPILER Options  Optional parameters for $(TAU_COMPILER): [tau_compiler.sh –help]  -optVerboseTurn on verbose debugging messages  -optPdtDir="" PDT architecture directory. Typically $(PDTDIR)/$(PDTARCHDIR)  -optPdtF95Opts="" Options for Fortran parser in PDT (f95parse)  -optPdtCOpts="" Options for C parser in PDT (cparse). Typically $(TAU_MPI_INCLUDE) $(TAU_INCLUDE) $(TAU_DEFS)  -optPdtCxxOpts="" Options for C++ parser in PDT (cxxparse). Typically $(TAU_MPI_INCLUDE) $(TAU_INCLUDE) $(TAU_DEFS)  -optPdtF90Parser="" Specify a different Fortran parser. For e.g., f90parse instead of f95parse  -optPdtUser="" Optional arguments for parsing source code  -optPDBFile="" Specify [merged] PDB file. Skips parsing phase.  -optTauInstr="" Specify location of tau_instrumentor. Typically $(TAUROOT)/$(CONFIG_ARCH)/bin/tau_instrumentor  -optTauSelectFile="" Specify selective instrumentation file for tau_instrumentor  -optTau="" Specify options for tau_instrumentor  -optCompile="" Options passed to the compiler. Typically $(TAU_MPI_INCLUDE) $(TAU_INCLUDE) $(TAU_DEFS)  -optLinking="" Options passed to the linker. Typically $(TAU_MPI_FLIBS) $(TAU_LIBS) $(TAU_CXXLIBS)  -optNoMpi Removes -l*mpi* libraries during linking (default)  -optKeepFiles Does not remove intermediate.pdb and.inst.* files e.g., % setenv TAU_OPTIONS ‘-optTauSelectFile=select.tau –optVerbose -optPdtCOpts=“-I/home -DFOO” ’ % setenv TAU_MAKEFILE /usr/local/tau-2.15.4/ia64/lib/Makefile.tau-icpc-mpi-pdt % tau_cxx.sh matrix.cpp -o matrix –lm % tau_f90.sh foo.o bar.o –o app –lm

TAU Performance System20 Optimization of Instrumentation Overhead  Group routines into profile groups, runtime selection of profiling groups  Instrument sections of code selectively  Exclude or include list of routines fed to the instrumentor – controlled manually or automatically  Rule based control of instrumentation  Generate selective instrumentation file by examining performance data from a previous run

TAU Performance System21 tau_reduce: Rule-Based Overhead Analysis  Analyze the performance data to determine events with high (relative) overhead performance measurements  Create a select list for excluding those events  Rule grammar (used in tau_reduce tool) [GroupName:] Field Operator Number  GroupName indicates rule applies to events in group  Field is a event metric attribute (from profile statistics)  numcalls, numsubs, percent, usec, cumusec, count [PAPI], totalcount, stdev, usecs/call, counts/call  Operator is one of >, <, or =  Number is any number  Compound rules possible using & between simple rules

TAU Performance System22 Optimizing Instrumentation Overhead: Examples  #Exclude all events that are members of TAU_USER #and use less than 1000 microseconds TAU_USER:usec < 1000  #Exclude all events that have less than 100 #microseconds and are called only once usec < 1000 & numcalls = 1  #Exclude all events that have less than 1000 usecs per #call OR have a (total inclusive) percent less than 5 usecs/call < 1000 percent < 5  Scientific notation can be used  usec>1000 & numcalls>400000 & usecs/call 25

TAU Performance System23 TAU_REDUCE  Reads profile files and rules  Creates selective instrumentation file  Specifies which routines should be excluded from instrumentation tau_reduce rules profile Selective instrumentation file

TAU Performance System24 Instrumentation Specification % tau_instrumentor Usage : tau_instrumentor [-o ] [-noinline] [-g groupname] [-i headerfile] [-c|-c++|-fortran] [-f ] For selective instrumentation, use –f option % tau_instrumentor foo.pdb foo.cpp –o foo.inst.cpp –f selective.dat % cat selective.dat # Selective instrumentation: Specify an exclude/include list of routines/files. BEGIN_EXCLUDE_LIST void quicksort(int *, int, int) void sort_5elements(int *) void interchange(int *, int *) END_EXCLUDE_LIST BEGIN_FILE_INCLUDE_LIST Main.cpp Foo?.c *.C END_FILE_INCLUDE_LIST # Instruments routines in Main.cpp, Foo?.c and *.C files only # Use BEGIN_[FILE]_INCLUDE_LIST with END_[FILE]_INCLUDE_LIST

TAU Performance System25 Optimization of Instrumentation Overhead (contd.)  Runtime throttling of events based on rule  Numcalls > ThresholdA and TimePerCall < ThresholdB  setenv TAU_THROTTLE 1  setenv TAU_THROTTLE_NUMCALLS  setenv TAU_THROTTLE_PERCALL  Default values:  = 100000 calls  = 10 microseconds per call  The next call to meet these conditions is disabled at runtime and put in a TAU_DISABLE group

TAU Performance System26 EPILOG Tracing Optimization  TAU and Epilog Tracing Package  TAU can generate epilog trace files  configure –epilog= -TRACE …  Epilog uses its own MPI wrapper library  Events are analyzed by Expert to detect performance bottlenecks automatically  Output is a CUBE profile file with callpath information  CUBE output read by CUBE GUI and TAU’s ParaProf profile browser  Expert discards all events do not call an MPI call directly/indirectly  Optimization opportunity for instrumentation

TAU Performance System27 Runtime Instrumentation Control  When TAU is configured with –MPITRACE configuration option (without EPILOG support)  TAU stores events and wallclock time in a buffer  Defers writing buffer to disk until an MPI call takes place  Events directly in callstack are enabled and written to disk  Other events are discarded  TAU traces are converted to Epilog traces (tau2elg)  Expert has minimal set of events

TAU Performance System28 Callpath Profiling Based Selective Instrumentation  TAU is configured with –PROFILECALLPATH  Env. variable TAU_CALLPATH_DEPTH set to a large value  Callpaths rooted at “main”  TAU profiles analyzed to produce an “include list”  list of routines that should be instrumented (tauinc.sh) [F. Wolf]  Events that call an MPI routine directly/indirectly  TAU generates EPILOG traces  Expert analyzes EPILOG traces to produce CUBE profiles  ParaProf and CUBE browsers read CUBE files  PerfDMF performance database stores bottleneck results

TAU Performance System29 Conclusions  Optimization of instrumentation is critical for balancing the volume of performance data generated  Several techniques for reducing the amount of instrumentation

TAU Performance System30 Support Acknowledgements  Department of Energy (DOE)  Office of Science contracts  University of Utah ASC Level 1 sub-contract  LLNL ASC/NNSA Level 3 contract  LLNL ParaTools/GWT contract  NSF  High-End Computing Grant  T.U. Dresden, GWT  Dr. Wolfgang Nagel and Holger Brunst  Research Centre Juelich  Dr. Bernd Mohr, Dr. Felix Wolf  Los Alamos National Laboratory contracts

Optimization of Instrumentation in Parallel Performance Evaluation Tools Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer,

Similar presentations

Presentation on theme: "Optimization of Instrumentation in Parallel Performance Evaluation Tools Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Optimization of Instrumentation in Parallel Performance Evaluation Tools Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer,

Similar presentations

Presentation on theme: "Optimization of Instrumentation in Parallel Performance Evaluation Tools Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer,"— Presentation transcript:

Similar presentations

About project

Feedback