Recent Advances in the TAU Performance System Sameer Shende, Allen D. Malony University of Oregon.

Slides:

Advertisements

Similar presentations

Allen D. Malony Department of Computer and Information Science Computational Science Institute University of Oregon The TAU Performance.

Advertisements

Dynamic performance measurement control Dynamic event grouping Multiple configurable counters Selective instrumentation Application-Level Performance Access.

Workload Characterization using the TAU Performance System Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer,

Sameer Shende, Allen D. Malony, and Alan Morris {sameer, malony, Steven Parker, and J. Davison de St. Germain {sparker,

Robert Bell, Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science.

Performance Technology for Complex Parallel Systems Sameer Shende University of Oregon.

The TAU Performance Technology for Complex Parallel Systems Sameer Shende, Allen D. Malony, Robert Bell University of Oregon {sameer, malony,

Sameer Shende, Allen D. Malony {sameer, Department of Computer and Information Science Computational Science Institute University.

Scalability Study of S3D using TAU Sameer Shende

Sameer Shende Department of Computer and Information Science Neuro Informatics Center University of Oregon Tool Interoperability.

Grouping Performance Data in TAU  Profile Groups  A group of related routines forms a profile group  Statically defined  TAU_DEFAULT, TAU_USER[1-5],

Profiling S3D on Cray XT3 using TAU Sameer Shende

TAU: Tuning and Analysis Utilities. TAU Performance System Framework  Tuning and Analysis Utilities  Performance system framework for scalable parallel.

Performance Technology for Complex Parallel Systems Sameer Shende, Allen D. Malony University of Oregon.

The TAU Performance Technology for Complex Parallel Systems (Performance Analysis Bring Your Own Code Workshop, NRL Washington D.C.) Sameer Shende, Allen.

On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Bernd Mohr 1, Allen D. Malony 2, Rudi Eigenmann 3 1 Forschungszentrum.

Allen D. Malony Sameer S. Shende Robert Bell {malony, sameer, Department of Computer and Information Science Computational Science.

Case Study: PETSc ex19  Non-linear solver (snes)  2-D driven cavity code  uses velocity-velocity formulation  finite difference discretization on a.

Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science Institute University.

The TAU Performance System: Advances in Performance Mapping Sameer Shende University of Oregon.

Allen D. Malony Sameer S. Shende Robert Bell Department of Computer and Information Science Computational Science Institute.

TAU Performance System Alan Morris, Sameer Shende, Allen D. Malony University of Oregon {amorris, sameer,

Performance Tools BOF, SC’07 5:30pm – 7pm, Tuesday, A9 Sameer S. Shende Performance Research Laboratory University.

Performance Instrumentation and Measurement for Terascale Systems Jack Dongarra, Shirley Moore, Philip Mucci University of Tennessee Sameer Shende, and.

Allen D. Malony Department of Computer and Information Science Computational Science Institute University of Oregon TAU Performance.

June 2, 2003ICCS Performance Instrumentation and Measurement for Terascale Systems Jack Dongarra, Shirley Moore, Philip Mucci University of Tennessee.

Towards a Performance Tool Interface for OpenMP: An Approach Based on Directive Rewriting Bernd Mohr, Felix Wolf Forschungszentrum Jülich John von Neumann.

Performance Evaluation of S3D using TAU Sameer Shende

Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science Institute University.

TAU Performance Toolkit (WOMPAT 2004 OpenMP Lab) Sameer Shende, Allen D. Malony University of Oregon {sameer,

Scalability Study of S3D using TAU Sameer Shende

Optimization of Instrumentation in Parallel Performance Evaluation Tools Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer,

Allen D. Malony, Sameer Shende, Robert Bell Department of Computer and Information Science Computational Science Institute, NeuroInformatics.

Kai Li, Allen D. Malony, Robert Bell, Sameer Shende Department of Computer and Information Science Computational.

On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Rudi Eigenmann Department of Electrical and Computer Engineering.

The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon.

Sameer Shende, Allen D. Malony Computer & Information Science Department Computational Science Institute University of Oregon.

Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science Institute University.

Performance Observation Sameer Shende and Allen D. Malony cs.uoregon.edu.

Performance Technology for Complex Parallel Systems REFERENCES.

SC’01 Tutorial Nov. 7, 2001 TAU Performance System Framework  Tuning and Analysis Utilities  Performance system framework for scalable parallel and distributed.

OMPi: A portable C compiler for OpenMP V2.0 Elias Leontiadis George Tzoumas Vassilios V. Dimakopoulos University of Ioannina.

Paradyn Week – April 14, 2004 – Madison, WI DPOMP: A DPCL Based Infrastructure for Performance Monitoring of OpenMP Applications Bernd Mohr Forschungszentrum.

Score-P – A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir Alexandru Calotoiu German Research School for.

Using TAU on SiCortex Alan Morris, Aroon Nataraj Sameer Shende, Allen D. Malony University of Oregon {amorris, anataraj, sameer,

VAMPIR. Visualization and Analysis of MPI Resources Commercial tool from PALLAS GmbH VAMPIRtrace - MPI profiling library VAMPIR - trace visualization.

OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)

Performance Technology for Complex Parallel Systems Part 3 – Alternative Tools and Frameworks Bernd Mohr.

Profile Analysis with ParaProf Sameer Shende Performance Reseaerch Lab, University of Oregon

Dynamic performance measurement control Dynamic event grouping Multiple configurable counters Selective instrumentation Application-Level Performance Access.

Portable Parallel Performance Tools Shirley Browne, UTK Clay Breshears, CEWES MSRC Jan 27-28, 1998.

Allen D. Malony, Sameer S. Shende, Alan Morris, Robert Bell, Kevin Huck, Nick Trebon, Suravee Suthikulpanit, Kai Li, Li Li

Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.

Allen D. Malony, Sameer S. Shende, Robert Bell Kai Li, Li Li, Kevin Huck Department of Computer.

Performance Technology for Complex Parallel Systems Part 1 – Overview and TAU Introduction Allen D. Malony.

Performance Technology for Complex Parallel Systems Part 1 – Overview and TAU Introduction Allen D. Malony.

Online Performance Analysis and Visualization of Large-Scale Parallel Applications Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance Research.

Kai Li, Allen D. Malony, Sameer Shende, Robert Bell

TAU integration with Score-P

Tutorial Outline – Part 1

TAU Parallel Performance System

Performance Technology for Complex Parallel and Distributed Systems

A configurable binary instrumenter

TAU Parallel Performance System

TAU: A Framework for Parallel Performance Analysis

Allen D. Malony Computer & Information Science Department

Outline Introduction Motivation for performance mapping SEAA model

Allen D. Malony, Sameer Shende

Parallel Program Analysis Framework for the DOE ACTS Toolkit

TAU Performance DataBase Framework (PerfDBF)

Presentation transcript:

Recent Advances in the TAU Performance System Sameer Shende, Allen D. Malony University of Oregon

TAU Performance System Framework  Tuning and Analysis Utilities  Performance system framework for scalable parallel and distributed high- performance computing  Targets a general complex system computation model  nodes / contexts / threads  Multi-level: system / software / parallelism  Measurement and analysis abstraction  Integrated toolkit for performance instrumentation, measurement, analysis, and visualization  Portable, configurable performance profiling/tracing facility  Open software approach  University of Oregon, LANL, FZJ Germany 

TAU Performance System Architecture EPILOG Paraver

Program Database Toolkit (PDT)  Program code analysis framework for developing source- based tools  High-level interface to source code information  Integrated toolkit for source code parsing, database creation, and database query  commercial grade front end parsers  portable IL analyzer, database format, and access API  open software approach for tool development  Target and integrate multiple source languages  Use in TAU to build automated performance instrumentation tools

PDT Architecture and Tools C/C++ Fortran 77/90

PDT Components  Language front end  Edison Design Group (EDG): C (C99), C++  Mutek Solutions Ltd.: F77, F90  creates an intermediate-language (IL) tree  IL Analyzer  processes the intermediate language (IL) tree  creates “program database” (PDB) formatted file  DUCTAPE  C++ program Database Utilities and Conversion Tools APplication Environment  processes and merges PDB files  C++ library to access the PDB for PDT applications

New Features in TAU  Instrumentation  OPARI – OpenMP directive rewriting approach [POMP, FZJ]  Selective instrumentation –grouping, include/exclude lists  tau_reduce – rule based detection of high overhead lightweight routines  Measurement  PAPI [UTK] – Support for multiple hardware counters/time  Callpath profiling (1-level)  Native generation of EPILOG traces [EXPERT, FZJ]  Analysis  Support for Paraver [CEPBA] trace visualizer  jracy – New Java based profile browser in TAU  Availability  New platforms and compilers supported (NEC, Hitachi, Intel)

TAU Instrumentation  Flexible instrumentation mechanisms at multiple levels  Source code  manual  automatic using Program Database Toolkit (PDT), OPARI (for OpenMP programs)  Object code  pre-instrumented libraries (e.g., MPI using PMPI)  statically linked  dynamically linked (e.g., Virtual machine instrumentation)  fast breakpoints (compiler generated)  Executable code  dynamic instrumentation (pre-execution) using DynInstAPI

Instrumentation of OpenMP Constructs  OPARI  OpenMP Pragma And Region Instrumentor  Source-to-Source translator to insert POMP calls around OpenMP constructs and API functions  Done: Supports  Fortran77 and Fortran90, OpenMP 2.0  C and C++, OpenMP 1.0  POMP Extensions  EPILOG and TAU POMP implementations  Preserves source code information ( #line line file )  Work in Progress: Investigating standardization through OpenMP Forum

OpenMP API Instrumentation  Transform  omp_#_lock()  pomp_#_lock()  omp_#_nest_lock()  pomp_#_nest_lock() [ # = init | destroy | set | unset | test ]  POMP version  Calls omp version internally  Can do extra stuff before and after call

Example: !$OMP PARALLEL DO Instrumentation !$OMP PARALLEL DO clauses... do loop !$OMP END PARALLEL DO !$OMP PARALLEL other-clauses... !$OMP DO schedule-clauses, ordered-clauses, lastprivate-clauses do loop !$OMP END DO !$OMP END PARALLEL DO NOWAIT !$OMP BARRIER call pomp_parallel_fork(d) call pomp_parallel_begin(d) call pomp_parallel_end(d) call pomp_parallel_join(d) call pomp_do_enter(d) call pomp_do_exit(d) call pomp_barrier_enter(d) call pomp_barrier_exit(d)

Opari Instrumentation: Example  OpenMP directive instrumentation  Ocean Current Circulation [Tim Kaiser, SDSC] pomp_for_enter(&omp_rd_2); #line 252 "stommel.c" #pragma omp for schedule(static) reduction(+: diff) private(j) firstprivate (a1,a2,a3,a4,a5) nowait for( i=i1;i<=i2;i++) { for(j=j1;j<=j2;j++){ new_psi[i][j]=a1*psi[i+1][j] + a2*psi[i-1][j] + a3*psi[i][j+1] + a4*psi[i][j-1] - a5*the_for[i][j]; diff=diff+fabs(new_psi[i][j]-psi[i][j]); } pomp_barrier_enter(&omp_rd_2); #pragma omp barrier pomp_barrier_exit(&omp_rd_2); pomp_for_exit(&omp_rd_2); #line 261 "stommel.c"

OPARI: Basic Usage (f90)  Reset OPARI state information  rm -f opari.rc  Call OPARI for each input source file  opari file1.f90... opari fileN.f90  Generate OPARI runtime table, compile it with ANSI C  opari -table opari.tab.c cc -c opari.tab.c  Compile modified files *.mod.f90 using OpenMP  Link the resulting object files, the OPARI runtime table opari.tab.o and the TAU POMP RTL

OPARI: Makefile Template (C/C++) OMPCC =...# insert C OpenMP compiler here OMPCXX =...# insert C++ OpenMP compiler here.c.o: opari $< $(OMPCC) $(CFLAGS) -c $*.mod.c.cc.o: opari $< $(OMPCXX) $(CXXFLAGS) -c $*.mod.cc opari.init: rm -rf opari.rc opari.tab.o: opari -table opari.tab.c $(CC) -c opari.tab.c myprog: opari.init myfile*.o... opari.tab.o $(OMPCC) -o myprog myfile*.o opari.tab.o -lpomp myfile1.o: myfile1.c myheader.h myfile2.o:...

Tracing Hybrid Executions – TAU and Vampir

Profiling Hybrid Executions

Instrumentation Control  Selection of which performance events to observe  Could depend on scope, type, level of interest  Could depend on instrumentation overhead  How is selection supported in instrumentation system?  No choice  Include / exclude lists (TAU)  Environment variables  Static vs. dynamic  Problem: Controlling instrumentation of small routines  High relative measurement overhead  Significant intrusion and possible perturbation

Instrumentation Control: Grouping  Profile Groups  A group of related routines forms a profile group  Statically defined  TAU_DEFAULT, TAU_USER[1-5], TAU_MESSAGE, TAU_IO, …  Dynamically defined  Group name based on string “integrator”, “particles”  Runtime lookup in a map to get unique group identifier  tau_instrumentor file.pdb file.cpp –o file.i.cpp -g “particles” Assigns all routines in file.cpp to group “particles”  Ability to change group names at runtime  Instrumentation control based on profile groups

TAU Instrumentation Control API  Enabling Profile Groups  TAU_ENABLE_INSTRUMENTATION(); // Global control  TAU_ENABLE_GROUP(TAU_GROUP); // statically defined  TAU_ENABLE_GROUP_NAME(“group name”); // dynamic  TAU_ENABLE_ALL_GROUPS(); // for all groups  Disabling Profile Groups  TAU_DISABLE_INSTRUMENTATION();  TAU_DISABLE_GROUP(TAU_GROUP);  TAU_DISABLE_GROUP_NAME();  TAU_DISABLE_ALL_GROUPS();  Obtaining Profile Group Identifier  TAU_GET_PROFILE_GROUP(“group name”);  Runtime Switching of Profile Groups  TAU_PROFILE_SET_GROUP(TAU_GROUP);  TAU_PROFILE_SET_GROUP_NAME(“group name”);

TAU Pre-execution Instrumentation Control  Dynamic groups defined at file scope  Group names and group associations may be modified at runtime  Controlling groups at pre-execution time using --profile option % tau_instrumentor app.pdb app.cpp –o app.i.cpp –g “particles” % mpirun –np 4 application –profile particles+field+mesh+io  Enables instrumentation for TAU_DEFAULT and particles, field, mesh and io groups.  Examples:  POOMA v1 (LANL)  Static groups used  VTF (ASAP Caltech)  Dynamic execution instrumentation control by python based controller

Selective Instrumentation: Include/Exclude Lists % tau_instrumentor Usage : tau_instrumentor [-o ] [-noinline] [-g groupname] [-i headerfile] [-c|-c++|-fortran] [-f ] For selective instrumentation, use –f option % cat selective.dat # Selective instrumentation: Specify an exclude/include list. BEGIN_EXCLUDE_LIST void quicksort(int *, int, int) void sort_5elements(int *) void interchange(int *, int *) END_EXCLUDE_LIST # If an include list is specified, the routines in the list will be the only # routines that are instrumented. # To specify an include list (a list of routines that will be instrumented) # remove the leading # to uncomment the following lines #BEGIN_INCLUDE_LIST #int main(int, char **) #int select_ #END_INCLUDE_LIST

Rule-Based Overhead Analysis (N. Trebon, UO)  Analyze the performance data to determine events with high (relative) overhead performance measurements  Create a select list for excluding those events  Rule grammar (used in TAUreduce tool) [GroupName:] Field Operator Number  GroupName indicates rule applies to events in group  Field is a event metric attribute (from profile statistics)  numcalls, numsubs, percent, usec, cumusec, count [PAPI], totalcount, stdev, usecs/call, counts/call  Operator is one of >, <, or =  Number is any number  Compound rules possible using & between simple rules

Example Rules  #Exclude all events that are members of TAU_USER #and use less than 1000 microseconds TAU_USER:usec < 1000  #Exclude all events that have less than 100 #microseconds and are called only once usec < 1000 & numcalls = 1  #Exclude all events that have less than 1000 usecs per #call OR have a (total inclusive) percent less than 5 usecs/call < 1000 percent < 5  Scientific notation can be used  usec>1000 & numcalls> & usecs/call 25

TAU Measurement: Integration with PAPI  Uniform access to hardware performance counters, wall clock and process virtual time  PAPI (Performance API) (UTK, Ptools Consortium)  consistent, portable API  Support for measuring multiple counter/timing metrics (upto 25) in a single experiment  -MULTIPLECOUNTERS TAU configuration option

TAU Measurement System Configuration  configure [OPTIONS]  {-c++=, -cc= } Specify C++ and C compilers  {-pthread, -sproc}Use pthread or SGI sproc threads  -openmpUse OpenMP threads  -opari= Specify location of Opari OpenMP tool  -papi= Specify location of PAPI  -pdt= Specify location of PDT  {-mpiinc=, mpilib= }Specify MPI library instrumentation  - TRACE Generate TAU event traces  -PROFILE Generate TAU profiles  -PROFILECALLPATHGenerate Callpath profiles (1-level)  -MULTIPLECOUNTERSUse more than one hardware counter  -CPUTIMEUse usertime+system time  -PAPIWALLCLOCKUse PAPI to access wallclock time  -PAPIVIRTUALUse PAPI for virtual (user) time …

TAU Measurement Configuration – Examples ./configure -c++=xlC -cc=xlc –pdt=/usr/packages/pdtoolkit-2.1 -papi=/usr/packages/papi –mpi  Use TAU with IBM’s xlC compiler, PDT, PAPI and MPI  Enable TAU profiling (default) ./configure -TRACE –PROFILE  Enable both TAU profiling and tracing ./configure -c++=CC -cc=cc -MULTIPLECOUNTERS -papi=/usr/local/packages/papi –opari=/usr/local/opari-pomp-1.1 -mpiinc=/usr/packages/mpich/include -PAPIWALLCLOCK -mpilib=/usr/packages/mpich/lib –PAPIVIRTUAL –useropt=-O2  Use OpenMP+MPI using SGI’s compiler suite, Opari and use PAPI for accessing hardware performance counters, wallclock & process virtual time for measurements  Typically configure multiple measurement libraries

Setup: Running Applications % setenv PROFILEDIR /home/data/experiments/profile/01 % setenv TRACEDIR/home/data/experiments/trace/01(optional) % set path=($path / /bin) % setenv LD_LIBRARY_PATH $LD_LIBRARY_PATH\: / /lib For PAPI (1 counter): % setenv PAPI_EVENT PAPI_FP_INS For PAPI (multiplecounters): % setenv COUNTER1 PAPI_FP_INS (PAPI’s Floating point ins) % setenv COUNTER2 PAPI_L1_DCM (PAPI’s L1 Data cache misses) % setenv COUNTER3 P_VIRTUAL_TIME (PAPI’s virtual time) % setenv COUNTER4 P_WALL_CLOCK_TIME (PAPI’s Wallclock time) % mpirun –np % llsubmit job.sh

Profiling with PAPI (PAPI_L1_DCM, Matrix)

Profiling with PAPI (PAPI_FP_INS, NPB - LU)

Performance Mapping  Associate performance with “significant” entities (events)  Source code points are important  Functions, regions, control flow events, user events  Execution process and thread entities are important  Some entities are more abstract, harder to measure  Consider callgraph (callpath) profiling  Measure time (metric) along an edge (path) of callgraph  Incident edge gives parent / child view  Edge sequence (path) gives parent / descendant view  Problem: Callpath profiling when callgraph is unknown  Determine callgraph dynamically at runtime  Map performance measurement to dynamic call path state

1-Level Callpath Implementation in TAU  TAU maintains a performance event (routine) callstack  Profiled routine (child) looks in callstack for parent  Previous profiled performance event is the parent  A callpath profile structure created first time parent calls  TAU records parent in a callgraph map for child  String representing 1-level callpath used as its key  “a( )=>b( )” : name for time spent in “b” when called by “a”  Map returns pointer to callpath profile structure  1-level callpath is profiled using this profiling data  Build upon TAU’s performance mapping technology  Measurement is independent of instrumentation  Use –PROFILECALLPATH to configure TAU

Callpath Profiling Example (NAS LU v2.3) % configure -PROFILECALLPATH -SGITIMERS -arch=sgi64 -mpiinc=/usr/include -mpilib=/usr/lib64 -useropt=-O2

Vampir Trace Visualization Tool  Visualization and Analysis of MPI Programs  Originally developed by Forschungszentrum Jülich  Current development by Technical University Dresden  Distributed by PALLAS, Germany 

Vampir (NAS Parallel Benchmark – LU) Timeline display Callgraph display Communications display Parallelism display

Applications: EVH1

Applications: VTF (ASCI ASAP Caltech)  C++, C, F90, Python  PDT, MPI

Applications: SAMRAI (LLNL)  C++  PDT, MPI  SAMRAI timers (groups)

Applications: Uintah (U. Utah) (500 cpus) TAU uses SCIRun [U. Utah] for visualization of performance data (online/offline)

Applications: Uintah (contd.) Scalability analysis

TAU Performance System Status  Computing platforms  IBM SP, SGI Origin, ASCI Red, Cray T3E, HP Tru64 (Compaq) SC, HP Superdome (HPUX), Sun, Apple, Windows, Linux (IA-32, IA-64, Alpha, PPC), Hitachi, NEC  Programming languages  C, C++, Fortran 77/90, HPF, Java  Communication libraries  MPI, PVM, Nexus, Tulip, ACLMPL, MPIJava  Thread libraries  pthread, Java,Windows, SGI sproc, Tulip, SMARTS, OpenMP  Compilers  Intel KAI (KCC, KAP/Pro), PGI, GNU, Fujitsu, HP, Sun, Microsoft, SGI, Cray, IBM, HP, Compaq, Hitachi, NEC, Intel

Information  TAU (  PDT (  PAPI (  OPARI (

Support Acknowledgement  TAU and PDT support:  Department of Energy (DOE)  DOE 2000 ACTS contract  DOE MICS contract  DOE ASCI Level 3 (LANL, LLNL)  U. of Utah DOE ASCI Level 1 subcontract  DARPA  NSF National Young Investigator (NYI) award