Download presentation
Presentation is loading. Please wait.
Published byShanon Lamb Modified over 9 years ago
1
Using TAU on SiCortex Alan Morris, Aroon Nataraj Sameer Shende, Allen D. Malony University of Oregon {amorris, anataraj, sameer, malony}@cs.uoregon.edu
2
TAU Performance System2 Outline What is TAU? Instrumentation Measurement Invoking on SiCortex - tauex Demo sweep3D Visualizing performance results
3
TAU Performance System3 Tuning and Analysis Utilities (13+ year project effort) Performance measurement framework for HPC systems Portable, scalable, flexible, and parallel Integrated toolkit for performance problem solving Automatic instrumentation Highly configurable measurement system with support for many flavors of profiling and tracing Portable analysis and visualization tools Performance data management and data mining http://tau.uoregon.edu
4
TAU Performance System4 TAU Instrumentation Approach Support for standard program events Routines Classes and templates Finer-grain -- loop-level Support for user-defined events Begin/End events (“user-defined timers”) Atomic events (e.g., size of memory allocated/freed) Support for event groups Selective examination of performance data Runtime disabling of groups Instrumentation optimization Selective instrumentation of events (only instrument needed) tau_reduce - generate selective instrumentation file
5
TAU Performance System5 TAU Instrumentation Flexible instrumentation mechanisms at multiple levels Source code manual (TAU API, CCA TAU Component API) automatic C, C++, F77/90/95 (Program Database Toolkit (PDT)) OpenMP (directive rewriting (Opari), POMP spec) Library level pre-instrumented libraries (e.g., MPI using PMPI) statically-linked and dynamically-linked Executable code dynamic instrumentation (pre-execution) (DynInstAPI) virtual machine instrumentation (e.g., Java using JVMPI) Runtime Linking (LD_PRELOAD)
6
TAU Performance System6 Automatic Instrumentation We provide compiler wrapper scripts Simply replace mpif90 with tauf90 Automatically instruments Fortran source code, links with TAU MPI Wrapper libraries. Use taucc and taucxx for C/C++ Before CXX = mpicxx F90 = mpif90 CFLAGS = LIBS = -lm OBJS = f1.o f2.o f3.o … fn.o app: $(OBJS) $(CXX) $(LDFLAGS) $(OBJS) -o $@ $(LIBS).cpp.o: $(CC) $(CFLAGS) -c $< After CXX = taucxx F90 = tauf90 CFLAGS = LIBS = -lm OBJS = f1.o f2.o f3.o … fn.o app: $(OBJS) $(CXX) $(LDFLAGS) $(OBJS) -o $@ $(LIBS).cpp.o: $(CC) $(CFLAGS) -c $<
7
TAU Performance System7 Measurement Options Flat profiles Time (or counts) spent in each routine (nodes in callgraph). Exclusive/inclusive time, no. of calls, child calls Support for hardware counters (PAPI), multiple counters. Callpath Profiles Flat profiles, plus Time spent along a calling path (edges in callgraph) E.g., “ main=> f1 => f2 => MPI_Send ” shows the time spent in MPI_Send when called by f2, when f2 is called by f1, when it is called by main. Configurable callpath depth limit ( TAU_CALLPATH_DEPTH environment variable) Tracing VAMPIRTRACE TAU Trace format; Converters: tau2slog2, tau2vtf, tau2otf
8
TAU Performance System8 Running Applications with TAU on SiCortex New tool, tauex Usage: tauex [options] [--] Options: -d: Enable debugging output, use repeatedly for more output. -h: Print this message. -i: Print information about the host machine. -s: Dump the shell environment variables and exit. -U: User mode counts -K: Kernel mode counts -S: Supervisor mode counts -I: Interrupt mode counts -l: List events -L : Describe event -a: Count all native events (implies -m) -m: Multiple runs (enough runs of exe to gather all events) -e : Specify PAPI preset or native event -T : specify TAU option -v: Debug/Verbose mode -XrunTAU- : specify TAU library directly
9
TAU Performance System9 Demo Application Sweep3D Build standard sweep3d application (un-instrumented) Run standard (un-instrumented) sweep3d using tauex Provides MPI-only profiles Build sweep3d with automatic TAU instrumentor Run instrumented sweep3d with tauex Provides application-level and MPI events
10
TAU Performance System10 Using tauex Use with uninstrumented executable for MPI profiling and tracing $ mpif90 ring.f90 –o ring $ srun -pscx -n4 tauex./ring $ cd ring.tau.1669/MULTI__P_WALL_CLOCK_TIME $ pprof … FUNCTION SUMMARY (mean): --------------------------------------------------------------------------------------- %Time Exclusive Inclusive #Call #Subrs Inclusive Name msec total msec usec/call --------------------------------------------------------------------------------------- 100.0 3 338 1 8 338086.TAU application 59.5 201 201 1 0 201180 MPI_Init() 37.2 125 125 1 0 125889 MPI_Finalize() 2.1 6 6 1 0 6971 MPI_Barrier() 0.2 0.626 0.626 1 0 626 MPI_Recv() 0.1 0.247 0.247 1 0 247 MPI_Bcast() 0.0 0.102 0.102 1 0 102 MPI_Send() 0.0 0.00875 0.00875 1 0 9 MPI_Comm_size() 0.0 0.00525 0.00525 1 0 5 MPI_Comm_rank()
11
TAU Performance System11 Using TAU compiler wrappers $ tauf90 ring.f90 –o ring # verbose output shows each step Debug: Parsing with PDT Parser Executing> /usr/share/PDT/mips/bin/f95parse ring.f90 -I/home/amorris/usr/include Debug: Instrumenting with TAU Executing> /usr/bin/tau_instrumentor ring.pdb ring.f90 -o ring.inst.f90 Debug: Compiling (Individually) with Instrumented Code Executing> pathf95 -mabi=64 -I. -c ring.inst.f90 -o ring.o Debug: Linking (Together) object files Executing> pathf95 ring.o -mabi=64 -lTAU -Wl,-rpath -lpfm -lpapi -lpfm -lpthread - L/usr/lib/gcc/mips64el-gentoo-linux-gnu/4.1.2/ -lstdc++ -lgcc_s -lscmpi -o ring Debug: cleaning inst file Executing> /bin/rm -f ring.inst.f90 Debug: cleaning PDB file Executing> /bin/rm -f ring.pdb
12
TAU Performance System12 Using tauex Use with instrumented executables for Application+MPI profiling and tracing $ tau_f90.sh ring.f90 –o ring $ srun -pscx -n4 tauex –e CPU_CYCLES./ring $ cd ring.tau.1674/MULTI__CPU_CYCLES $ pprof FUNCTION SUMMARY (mean): --------------------------------------------------------------------------------------- %Time Exclusive Inclusive #Call #Subrs Count/Call Name counts total counts --------------------------------------------------------------------------------------- 100.0 4.552E+05 2.557E+07 1 5 25568675 MAIN 81.1 2.074E+07 2.074E+07 1 0 20742874 MPI_Init() 16.3 1.419E+05 4.176E+06 1 4 4176389 FUNC 14.2 3.639E+06 3.639E+06 1 0 3638847 MPI_Barrier() 0.8 2.091E+05 2.091E+05 1 0 209063 MPI_Recv() 0.7 1.875E+05 1.875E+05 1 0 187542 MPI_Finalize() 0.6 1.539E+05 1.539E+05 1 0 153880 MPI_Bcast() 0.1 3.27E+04 3.27E+04 1 0 32702 MPI_Send() 0.0 4337 4337 1 0 4337 MPI_Comm_size() 0.0 2308 2308 1 0 2308 MPI_Comm_rank()
13
TAU Performance System13 Using tauex Other typical usage scenarios # floating point instruction counts and time (compute derived # FLOPS in ParaProf) $ tauex –e P_WALL_CLOCK_TIME –e PAPI_FP_INS # Generate callpath profiles $ tauex –e PAPI_FP_INS –T callpath # Generate OTF traces for Vampir $ tauex –e PAPI_FP_INS –T vampirtrace # Generate Epilog traces for Kojak $ tauex –T epilog
14
TAU Performance System14 Using TAU with FLASH on SiCortex To use TAU with FLASH on SiCortex platforms, simply specify –tau= in setup # On Full Disclosure: $./setup Sedov -2d -auto -site=sicortex -objdir=tau - tau=/home/amorris/usr/share/TAU/64/Makefile.tau- multiplecounters-pathcc-mpi-papi-pdt # Build $ cd tau ; make # Run (using tauex) # This will generate callpath profiles with time and floating point instruction metrics $ srun -pscx -n16 tauex -T callpath –e P_WALL_CLOCK_TIME –e PAPI_FP_INS./flash3
15
TAU Performance System15 Using ParaProf Not yet available on SiCortex mips nodes (requires Java) For now, tar up.tau. directory and copy to another machine ParaProf can be run on Linux, Windows, or Mac If you don’t have TAU/paraprof, but have Java Web Start, visit http://tau.uoregon.edu/paraprof to run it scx-m23-n6> tar czf flash3.tau.1578.tar.gz flash3.tau.1578 scx-m32-n6> scp flash3.tau.1578.tar.gz somewhere-else: scx-m32-n6> ssh somewhere-else se> tar –xzf flash3.tau.1578.tar.gz se> paraprof flash3.tau.1578
16
TAU Performance System16 ParaProf – Full Profile (FLASH) MPI_Barrier IO routines
17
TAU Performance System17 ParaProf - Statistics Table (Uintah)
18
TAU Performance System18 ParaProf –Callgraph View (MFIX)
19
TAU Performance System19 ParaProf – Histogram View (Miranda) 8k processors 16k processors Scalable 2D displays
20
TAU Performance System20 ParaProf – 3D Full Profile (Miranda) 16k processors
21
TAU Performance System21 ParaProf – 3D Scatterplot (Miranda) Each point is a “thread” of execution Relation between four routines shown at once
22
TAU Performance System22 Tracing (Vampir) - Uintah Trace analysis provides in-depth understanding of temporal event and message passing relationships Traces can even store hardware counters
23
TAU Performance System23 VNG Timeline Display (Miranda on BGL)
24
TAU Performance System24 Thank You TAU should soon be part of the SiCortex standard install Check out: http://tau.uoregon.edu
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.