Using TAU on SiCortex Alan Morris, Aroon Nataraj Sameer Shende, Allen D. Malony University of Oregon {amorris, anataraj, sameer,

Using TAU on SiCortex Alan Morris, Aroon Nataraj Sameer Shende, Allen D. Malony University of Oregon {amorris, anataraj, sameer, malony}@cs.uoregon.edu

TAU Performance System2 Outline  What is TAU?  Instrumentation  Measurement  Invoking on SiCortex - tauex  Demo  sweep3D  Visualizing performance results

TAU Performance System3  Tuning and Analysis Utilities (13+ year project effort)  Performance measurement framework for HPC systems  Portable, scalable, flexible, and parallel  Integrated toolkit for performance problem solving  Automatic instrumentation  Highly configurable measurement system with support for many flavors of profiling and tracing  Portable analysis and visualization tools  Performance data management and data mining  http://tau.uoregon.edu

TAU Performance System4 TAU Instrumentation Approach  Support for standard program events  Routines  Classes and templates  Finer-grain -- loop-level  Support for user-defined events  Begin/End events (“user-defined timers”)  Atomic events (e.g., size of memory allocated/freed)  Support for event groups  Selective examination of performance data  Runtime disabling of groups  Instrumentation optimization  Selective instrumentation of events (only instrument needed)  tau_reduce - generate selective instrumentation file

TAU Performance System5 TAU Instrumentation  Flexible instrumentation mechanisms at multiple levels  Source code  manual (TAU API, CCA TAU Component API)  automatic C, C++, F77/90/95 (Program Database Toolkit (PDT)) OpenMP (directive rewriting (Opari), POMP spec)  Library level  pre-instrumented libraries (e.g., MPI using PMPI)  statically-linked and dynamically-linked  Executable code  dynamic instrumentation (pre-execution) (DynInstAPI)  virtual machine instrumentation (e.g., Java using JVMPI)  Runtime Linking (LD_PRELOAD)

TAU Performance System6 Automatic Instrumentation  We provide compiler wrapper scripts  Simply replace mpif90 with tauf90  Automatically instruments Fortran source code, links with TAU MPI Wrapper libraries.  Use taucc and taucxx for C/C++ Before CXX = mpicxx F90 = mpif90 CFLAGS = LIBS = -lm OBJS = f1.o f2.o f3.o … fn.o app: $(OBJS) $(CXX) $(LDFLAGS) $(OBJS) -o $@ $(LIBS).cpp.o: $(CC) $(CFLAGS) -c $< After CXX = taucxx F90 = tauf90 CFLAGS = LIBS = -lm OBJS = f1.o f2.o f3.o … fn.o app: $(OBJS) $(CXX) $(LDFLAGS) $(OBJS) -o $@ $(LIBS).cpp.o: $(CC) $(CFLAGS) -c $<

TAU Performance System7 Measurement Options  Flat profiles  Time (or counts) spent in each routine (nodes in callgraph).  Exclusive/inclusive time, no. of calls, child calls  Support for hardware counters (PAPI), multiple counters.  Callpath Profiles  Flat profiles, plus  Time spent along a calling path (edges in callgraph)  E.g., “ main=> f1 => f2 => MPI_Send ” shows the time spent in MPI_Send when called by f2, when f2 is called by f1, when it is called by main.  Configurable callpath depth limit ( TAU_CALLPATH_DEPTH environment variable)  Tracing  VAMPIRTRACE  TAU Trace format; Converters: tau2slog2, tau2vtf, tau2otf

TAU Performance System8 Running Applications with TAU on SiCortex  New tool, tauex Usage: tauex [options] [--] Options: -d: Enable debugging output, use repeatedly for more output. -h: Print this message. -i: Print information about the host machine. -s: Dump the shell environment variables and exit. -U: User mode counts -K: Kernel mode counts -S: Supervisor mode counts -I: Interrupt mode counts -l: List events -L : Describe event -a: Count all native events (implies -m) -m: Multiple runs (enough runs of exe to gather all events) -e : Specify PAPI preset or native event -T : specify TAU option -v: Debug/Verbose mode -XrunTAU- : specify TAU library directly

TAU Performance System9 Demo  Application Sweep3D  Build standard sweep3d application (un-instrumented)  Run standard (un-instrumented) sweep3d using tauex  Provides MPI-only profiles  Build sweep3d with automatic TAU instrumentor  Run instrumented sweep3d with tauex  Provides application-level and MPI events

TAU Performance System10 Using tauex  Use with uninstrumented executable for MPI profiling and tracing $ mpif90 ring.f90 –o ring $ srun -pscx -n4 tauex./ring $ cd ring.tau.1669/MULTI__P_WALL_CLOCK_TIME $ pprof … FUNCTION SUMMARY (mean): --------------------------------------------------------------------------------------- %Time Exclusive Inclusive #Call #Subrs Inclusive Name msec total msec usec/call --------------------------------------------------------------------------------------- 100.0 3 338 1 8 338086.TAU application 59.5 201 201 1 0 201180 MPI_Init() 37.2 125 125 1 0 125889 MPI_Finalize() 2.1 6 6 1 0 6971 MPI_Barrier() 0.2 0.626 0.626 1 0 626 MPI_Recv() 0.1 0.247 0.247 1 0 247 MPI_Bcast() 0.0 0.102 0.102 1 0 102 MPI_Send() 0.0 0.00875 0.00875 1 0 9 MPI_Comm_size() 0.0 0.00525 0.00525 1 0 5 MPI_Comm_rank()

TAU Performance System11 Using TAU compiler wrappers $ tauf90 ring.f90 –o ring # verbose output shows each step Debug: Parsing with PDT Parser Executing> /usr/share/PDT/mips/bin/f95parse ring.f90 -I/home/amorris/usr/include Debug: Instrumenting with TAU Executing> /usr/bin/tau_instrumentor ring.pdb ring.f90 -o ring.inst.f90 Debug: Compiling (Individually) with Instrumented Code Executing> pathf95 -mabi=64 -I. -c ring.inst.f90 -o ring.o Debug: Linking (Together) object files Executing> pathf95 ring.o -mabi=64 -lTAU -Wl,-rpath -lpfm -lpapi -lpfm -lpthread - L/usr/lib/gcc/mips64el-gentoo-linux-gnu/4.1.2/ -lstdc++ -lgcc_s -lscmpi -o ring Debug: cleaning inst file Executing> /bin/rm -f ring.inst.f90 Debug: cleaning PDB file Executing> /bin/rm -f ring.pdb

TAU Performance System12 Using tauex  Use with instrumented executables for Application+MPI profiling and tracing $ tau_f90.sh ring.f90 –o ring $ srun -pscx -n4 tauex –e CPU_CYCLES./ring $ cd ring.tau.1674/MULTI__CPU_CYCLES $ pprof FUNCTION SUMMARY (mean): --------------------------------------------------------------------------------------- %Time Exclusive Inclusive #Call #Subrs Count/Call Name counts total counts --------------------------------------------------------------------------------------- 100.0 4.552E+05 2.557E+07 1 5 25568675 MAIN 81.1 2.074E+07 2.074E+07 1 0 20742874 MPI_Init() 16.3 1.419E+05 4.176E+06 1 4 4176389 FUNC 14.2 3.639E+06 3.639E+06 1 0 3638847 MPI_Barrier() 0.8 2.091E+05 2.091E+05 1 0 209063 MPI_Recv() 0.7 1.875E+05 1.875E+05 1 0 187542 MPI_Finalize() 0.6 1.539E+05 1.539E+05 1 0 153880 MPI_Bcast() 0.1 3.27E+04 3.27E+04 1 0 32702 MPI_Send() 0.0 4337 4337 1 0 4337 MPI_Comm_size() 0.0 2308 2308 1 0 2308 MPI_Comm_rank()

TAU Performance System13 Using tauex  Other typical usage scenarios # floating point instruction counts and time (compute derived # FLOPS in ParaProf) $ tauex –e P_WALL_CLOCK_TIME –e PAPI_FP_INS # Generate callpath profiles $ tauex –e PAPI_FP_INS –T callpath # Generate OTF traces for Vampir $ tauex –e PAPI_FP_INS –T vampirtrace # Generate Epilog traces for Kojak $ tauex –T epilog

TAU Performance System14 Using TAU with FLASH on SiCortex  To use TAU with FLASH on SiCortex platforms, simply specify –tau= in setup # On Full Disclosure: $./setup Sedov -2d -auto -site=sicortex -objdir=tau - tau=/home/amorris/usr/share/TAU/64/Makefile.tau- multiplecounters-pathcc-mpi-papi-pdt # Build $ cd tau ; make # Run (using tauex) # This will generate callpath profiles with time and floating point instruction metrics $ srun -pscx -n16 tauex -T callpath –e P_WALL_CLOCK_TIME –e PAPI_FP_INS./flash3

TAU Performance System15 Using ParaProf  Not yet available on SiCortex mips nodes (requires Java)  For now, tar up.tau. directory and copy to another machine  ParaProf can be run on Linux, Windows, or Mac  If you don’t have TAU/paraprof, but have Java Web Start, visit http://tau.uoregon.edu/paraprof to run it scx-m23-n6> tar czf flash3.tau.1578.tar.gz flash3.tau.1578 scx-m32-n6> scp flash3.tau.1578.tar.gz somewhere-else: scx-m32-n6> ssh somewhere-else se> tar –xzf flash3.tau.1578.tar.gz se> paraprof flash3.tau.1578

TAU Performance System16 ParaProf – Full Profile (FLASH) MPI_Barrier IO routines

TAU Performance System17 ParaProf - Statistics Table (Uintah)

TAU Performance System18 ParaProf –Callgraph View (MFIX)

TAU Performance System19 ParaProf – Histogram View (Miranda) 8k processors 16k processors  Scalable 2D displays

TAU Performance System20 ParaProf – 3D Full Profile (Miranda) 16k processors

TAU Performance System21 ParaProf – 3D Scatterplot (Miranda)  Each point is a “thread” of execution  Relation between four routines shown at once

TAU Performance System22 Tracing (Vampir) - Uintah  Trace analysis provides in-depth understanding of temporal event and message passing relationships  Traces can even store hardware counters

TAU Performance System23 VNG Timeline Display (Miranda on BGL)

TAU Performance System24 Thank You  TAU should soon be part of the SiCortex standard install  Check out: http://tau.uoregon.edu

Using TAU on SiCortex Alan Morris, Aroon Nataraj Sameer Shende, Allen D. Malony University of Oregon {amorris, anataraj, sameer,

Similar presentations

Presentation on theme: "Using TAU on SiCortex Alan Morris, Aroon Nataraj Sameer Shende, Allen D. Malony University of Oregon {amorris, anataraj, sameer,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Using TAU on SiCortex Alan Morris, Aroon Nataraj Sameer Shende, Allen D. Malony University of Oregon {amorris, anataraj, sameer,

Similar presentations

Presentation on theme: "Using TAU on SiCortex Alan Morris, Aroon Nataraj Sameer Shende, Allen D. Malony University of Oregon {amorris, anataraj, sameer,"— Presentation transcript:

Similar presentations

About project

Feedback