Grouping Performance Data in TAU Profile Groups A group of related routines forms a profile group Statically defined TAU_DEFAULT, TAU_USER[1-5], TAU_MESSAGE, TAU_IO, … Dynamically defined Group name based on string “adlib”, “particles” Runtime lookup in a map to get unique group identifier tau_instrumentor file.pdb file.cpp –o file.i.cpp -g “particles” Assigns all routines in file.cpp to group “particles” Ability to change group names at runtime Instrumentation control based on profile groups
TAU Instrumentation Control API Enabling Profile Groups TAU_ENABLE_INSTRUMENTATION(); // Global control TAU_ENABLE_GROUP(TAU_GROUP); // statically defined TAU_ENABLE_GROUP_NAME(“group name”); // dynamic TAU_ENABLE_ALL_GROUPS(); // for all groups Disabling Profile Groups TAU_DISABLE_INSTRUMENTATION(); TAU_DISABLE_GROUP(TAU_GROUP); TAU_DISABLE_GROUP_NAME(); TAU_DISABLE_ALL_GROUPS(); Obtaining Profile Group Identifier TAU_GET_PROFILE_GROUP(“group name”); Runtime Switching of Profile Groups TAU_PROFILE_SET_GROUP(TAU_GROUP); TAU_PROFILE_SET_GROUP_NAME(“group name”);
Disabling Dynamic Profile Group -- Example int main(int argc, char **argv) { /* Invoke program with --profile field+particles */ TAU_INIT(&argc, &argv); … } void foo(void) { TAU_PROFILE(“void foo(void)”, “ “, TAU_DEFAULT); Field f; TAU_DISABLE_GROUP_NAME(“field"); // other routines in “field” dynamic group are affected for (int i=0; i<N; i++) f.applyrules(i); }
TAU Pre-execution Control Dynamic groups defined at file scope Group names and group associations may be modified at runtime Controlling groups at pre-execution time using --profile option % tau_instrumentor app.pdb app.cpp –o app.i.cpp –g “particles” % mpirun –np 4 application –profile particles+field+mesh+io Enables instrumentation for TAU_DEFAULT and particles, field, mesh and io groups. Examples: POOMA v1 (LANL) Static groups used VTF (ASAP Caltech) Dynamic execution instrumentation control by python based controller
Applications of TAU POOMA PETSc SAMRAI
Performance Mapping in TAU: Motivation Complexity Layered software Multi-level instrumentation Entities not directly in source Mapping User-level abstractions
Hypothetical Mapping Example Engine Particles distributed on surfaces of a cube Work packets
Hypothetical Mapping Example Source Particle* P[MAX]; /* Array of particles */ int GenerateParticles() { /* distribute particles over all faces of the cube */ for (int face=0, last=0; face < 6; face++){ /* particles on this face */ int particles_on_this_face = num(face); for (int i=last; i < particles_on_this_face; i++) { /* particle properties are a function of face */ P[i] =... f(face);... } last+= particles_on_this_face; }
Hypothetical Mapping Example (continued) How much time is spent processing face i particles? What is the distribution of performance among faces? int ProcessParticle(Particle *p) { /* perform some computation on p */ } int main() { GenerateParticles(); /* create a list of particles */ for (int i = 0; i < N; i++) /* iterates over the list */ ProcessParticle(P[i]); }
No Performance Mapping versus Mapping Typical performance tools report performance with respect to routines Do not provide support for mapping Performance tools with SEAA mapping can observe performance with respect to scientist’s programming and problem abstractions without mappingwith mapping
Semantic Entities/Attributes/Associations New dynamic mapping scheme - SEAA Entities defined at any level of abstraction Attribute entity with semantic information Entity-to-entity associations Two association types: Embedded – extends data structure of associated object to store performance measurement entity External – creates an external look-up table using address of object as the key to locate performance measurement entity
Mapping in POOMA II POOMA [LANL] is a C++ framework for Computational Physics Provides high-level abstractions: Fields (Arrays), Particles, FFT, etc. Encapsulates details of parallelism, data-distribution Uses custom-computation kernels for efficient expression evaluation [PETE] Uses vertical-execution of array statements to re-use cache [SMARTS]
POOMA II Array Example Multi- dimensional array statements A=B+C+D;
POOMA, PETE and SMARTS
Using Synchronous Timers
Form of Expression Templates in POOMA
Mapping Problem One-to-many upward mapping Traditional methods of mapping (ammortization/aggregation) lack resolution and accuracy! Template <class LHS, class RHS, class Op, class EvalTag> void ExpressionKernel<LHS,RHS,Op, EvalTag>::run() {/* iterate execution */ } A=1.0; B=2.0; … A= B+C+D; C=E-A+2.0*D;...
POOMA II Mappings Each work packet belongs to an ExpressionKernel object Each statement’s form associated with timer in the constructor of ExpressionKernel ExpressionKernel class extended with embedded timer Timing calls and entry and exit of run() method start and stop per object timer
Results of TAU Mappings Per-statement profile!
POOMA Traces Helps bridge the semantic-gap!
PETSc (ANL) Portable, Extensible Toolkit for Scientific Computation Suite of data structures and routines for the scalable (parallel) solution of scientific applications modeled by partial differential equations Uses MPI for inter-process communication Instrumentation PDT for C/C++ source instrumentation MPI wrapper library layer instrumentation Example: Solves a set of linear equations (Ax=b) in parallel (SLES)
PETSc Linear Equation Solver Profile
PETSc Traces
PETSc Calltree and Communication Matrix
SAMRAI (LLNL) Structured Adaptive Mesh Refinement Application Infrastructure Instrumentation for TAU: PDT based C++ instrumentation MPI wrapper interposition library based instrumentation SAMRAI timers mapped to TAU timers TAU’s Mapping API Embedded association
Mapping in TAU Embedded association vs External association SAMRAI Timer Performance Data... Hash Table TAU Timer
SAMRAI Euler Gas Dynamics Application (2D) Adaptive Mesh Refinement (AMR) application A single overarching algorithm object drives the time integration and adaptive gridding processes Discrete Euler equations are solved on each patch in the AMR hierarchy
Euler Profile (ComputeFluxesOnPatch)
Euler Profile (Summary)
Euler Profile (Inclusive Time)
Euler Profile (Contribution of Flux computation)
Euler Traces
Euler CallTree (ComputeFluxesOnPatch)
Hands-on session On mcurie.nersc.gov, copy files from /usr/local/pkg/acts/tau/tau2/tau-2.9/training See README file Set correct path e.g., % set path=($path /usr/local/pkg/acts/tau/tau2/tau2.9/t3e/bin) Examine the Makefile. Type “make” in each directory; then execute the program Type “racy” or “vampir” Type a project name e.g., “matrix.pmf” and click OK to see the performance data.
Examples The training directory contains example programs that illustrate the use of TAU instrumentation and measuremen options. instrument -This contains a simple C++ example that shows how TAU's API can be used for manually instrumenting a C++ program. It highlights instrumentation for templates and user defined events. threads - A simple multi-threaded program that shows how the main function of a thread is instrumented. Performance data is generated for each thread of execution. Configure with -pthread. cthreads - Same as threads above, but for a C program. An instrumented C program may be compiled with a C compiler, but needs to be linked with a C++ linker. Configure with -pthread. pi - An MPI program that calculates the value of pi and e. It highlights the use of TAU's MPI wrapper library. TAU needs to be configured with -mpiinc= and -mpilib=. Run using mpirun -np cpi. papi - A matrix multiply example that shows how to use TAU statement level timers for comparing the performance of two algorithms for matrix multiplication. When used with PAPI or PCL, this can highlight the cache behaviors of these algorithms. TAU should be configured with -papi= or -pcl= and the user should set PAPI_EVENT or PCL_EVENT respective environment variables, to use this.
Examples - (cont.) papithreads - Same as papi, but uses threads to highlight how hardware performance counters may be used in a multi-threaded application. When it is used with PAPI, TAU should be configured with -papi= -pthread autoinstrument - Shows the use of Program Database Toolkit (PDT) for automating the insertion of TAU macros in the source code. It requires configuring TAU with the -pdt= option. The Makefile is modified to illustrate the use of a source to source translator (tau_instrumentor). NPB2.3 - The NAS Parallel Benchmark 2.3 [from NASA Ames]. It shows how to use TAU's MPI wrapper with a manually instrumented Fortran program. LU and SP are the two benchmarks. LU is instrumented completely, while only parts of the SP program are instrumented to contrast the coverage of routines. In both cases MPI level instrumentation is complete. TAU needs to be configured with -mpiinc= and -mpilib= to use this.