Performance Technology for Complex Parallel Systems Sameer Shende, Allen D. Malony University of Oregon.

Slides:

Advertisements

Similar presentations

SYSTEM PROGRAMMING & SYSTEM ADMINISTRATION

Advertisements

Dynamic performance measurement control Dynamic event grouping Multiple configurable counters Selective instrumentation Application-Level Performance Access.

Sameer Shende, Allen D. Malony, and Alan Morris {sameer, malony, Steven Parker, and J. Davison de St. Germain {sparker,

The TAU Performance Technology for Complex Parallel Systems (Performance Analysis Bring Your Own Code Workshop, Building 1103 Room 236, NASA Stennis Space.

Robert Bell, Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science.

Performance Technology for Complex Parallel Systems Sameer Shende University of Oregon.

Sameer Shende Department of Computer and Information Science Neuro Informatics Center University of Oregon Tool Interoperability.

Recent Advances in the TAU Performance System Sameer Shende, Allen D. Malony University of Oregon.

Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science Institute University.

Profiling S3D on Cray XT3 using TAU Sameer Shende

TAU: Tuning and Analysis Utilities. TAU Performance System Framework  Tuning and Analysis Utilities  Performance system framework for scalable parallel.

Allen D. Malony Department of Computer and Information Science Computational Science Institute University of Oregon Integrating Performance.

The TAU Performance Technology for Complex Parallel Systems (Performance Analysis Bring Your Own Code Workshop, NRL Washington D.C.) Sameer Shende, Allen.

On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Bernd Mohr 1, Allen D. Malony 2, Rudi Eigenmann 3 1 Forschungszentrum.

Case Study: PETSc ex19  Non-linear solver (snes)  2-D driven cavity code  uses velocity-velocity formulation  finite difference discretization on a.

Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science Institute University.

The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon.

Performance Technology for Complex Parallel Systems Sameer Shende University of Oregon.

The TAU Performance System: Advances in Performance Mapping Sameer Shende University of Oregon.

TAU Performance System Alan Morris, Sameer Shende, Allen D. Malony University of Oregon {amorris, sameer,

Performance Instrumentation and Measurement for Terascale Systems Jack Dongarra, Shirley Moore, Philip Mucci University of Tennessee Sameer Shende, and.

Allen D. Malony Department of Computer and Information Science Computational Science Institute University of Oregon TAU Performance.

The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon.

June 2, 2003ICCS Performance Instrumentation and Measurement for Terascale Systems Jack Dongarra, Shirley Moore, Philip Mucci University of Tennessee.

TAU Parallel Performance System DOD UGC 2004 Tutorial Part 1: TAU Overview and Architecture.

Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science Institute University.

TAU Performance Toolkit (WOMPAT 2004 OpenMP Lab) Sameer Shende, Allen D. Malony University of Oregon {sameer,

Optimization of Instrumentation in Parallel Performance Evaluation Tools Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer,

Kai Li, Allen D. Malony, Robert Bell, Sameer Shende Department of Computer and Information Science Computational.

On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Rudi Eigenmann Department of Electrical and Computer Engineering.

The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon.

Sameer Shende, Allen D. Malony Computer & Information Science Department Computational Science Institute University of Oregon.

Tuning and Analysis Utilities Sameer Shende, Allen D. Malony, Robert Bell University of Oregon.

Performance Observation Sameer Shende and Allen D. Malony cs.uoregon.edu.

SC’01 Tutorial Nov. 7, 2001 TAU Performance System Framework  Tuning and Analysis Utilities  Performance system framework for scalable parallel and distributed.

Performance Technology for Complex Parallel Systems Part 2 – Complexity Scenarios Sameer Shende.

Paradyn Week – April 14, 2004 – Madison, WI DPOMP: A DPCL Based Infrastructure for Performance Monitoring of OpenMP Applications Bernd Mohr Forschungszentrum.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)

1 Performance Analysis with Vampir ZIH, Technische Universität Dresden.

Dynamic performance measurement control Dynamic event grouping Multiple configurable counters Selective instrumentation Application-Level Performance Access.

SOFTWARE DESIGN. INTRODUCTION There are 3 distinct types of activities in design 1.External design 2.Architectural design 3.Detailed design Architectural.

Debugging parallel programs. Breakpoint debugging Probably the most widely familiar method of debugging programs is breakpoint debugging. In this method,

Allen D. Malony, Sameer S. Shende, Alan Morris, Robert Bell, Kevin Huck, Nick Trebon, Suravee Suthikulpanit, Kai Li, Li Li

Tool Visualizations, Metrics, and Profiled Entities Overview [Brief Version] Adam Leko HCS Research Laboratory University of Florida.

Allen D. Malony, Sameer S. Shende, Robert Bell Kai Li, Li Li, Kevin Huck Department of Computer.

Connections to Other Packages The Cactus Team Albert Einstein Institute

Overview of AIMS Hans Sherburne UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative note Green:

TAU Parallel Performance System DOD UGC 2004 Tutorial Part 2: TAU Components and Usage.

Performane Analyzer Performance Analysis and Visualization of Large-Scale Uintah Simulations Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance.

Performance Technology for Complex Parallel Systems Part 1 – Overview and TAU Introduction Allen D. Malony.

Performance Technology for Complex Parallel Systems Part 1 – Overview and TAU Introduction Allen D. Malony.

Allen D. Malony Department of Computer and Information Science Computational Science Institute University of Oregon Integrating Performance.

Online Performance Analysis and Visualization of Large-Scale Parallel Applications Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance Research.

TAU Performance Toolkit (WOMPAT OpenMP Lab Sessions) Sameer Shende, Allen D. Malony, Robert Bell University of Oregon {sameer, malony,

Chapter Goals Describe the application development process and the role of methodologies, models, and tools Compare and contrast programming language generations.

Kai Li, Allen D. Malony, Sameer Shende, Robert Bell

TAU integration with Score-P

Tutorial Outline – Part 1

Allen D. Malony, Sameer Shende

TAU Parallel Performance System

A configurable binary instrumenter

TAU Parallel Performance System

TAU: A Framework for Parallel Performance Analysis

How To Use TAU? Instrumentation

The TAU Performance Technology for Complex Parallel Systems (Performance Analysis Bring Your Own Code Workshop, Building 1103 Room 236, NASA Stennis Space.

Allen D. Malony Computer & Information Science Department

Outline Introduction Motivation for performance mapping SEAA model

Parallel Program Analysis Framework for the DOE ACTS Toolkit

TAU Performance DataBase Framework (PerfDBF)

Presentation transcript:

Performance Technology for Complex Parallel Systems Sameer Shende, Allen D. Malony University of Oregon

Overview  Introduction  Definitions, general problem  Tuning and Analysis Utilities (TAU)  Instrumentation  Measurement  Analysis  Work in progress:  Visualization: Vampir  Performance Monitoring and Steering  Performance Database Framework  Case Study: Uintah  Conclusions

General Problems How do we create robust and ubiquitous performance technology for the analysis and tuning of parallel and distributed software and systems in the presence of (evolving) complexity challenges? How do we apply performance technology effectively for the variety and diversity of performance problems that arise in the context of complex parallel and distributed computer systems.

Computation Model for Performance Technology  How to address dual performance technology goals?  Robust capabilities + widely available methodologies  Contend with problems of system diversity  Flexible tool composition/configuration/integration  Approaches  Restrict computation types / performance problems  limited performance technology coverage  Base technology on abstract computation model  general architecture and software execution features  map features/methods to existing complex system types  develop capabilities that can adapt and be optimized

General Complex System Computation Model  Node: physically distinct shared memory machine  Message passing node interconnection network  Context: distinct virtual memory space within node  Thread: execution threads (user/system) in context memory Node VM space Context SMP Threads node memory … … Interconnection Network Inter-node message communication * * physical view model view

Definitions – Profiling  Profiling  Recording of summary information during execution  inclusive, exclusive time, # calls, hardware statistics, …  Reflects performance behavior of program entities  functions, loops, basic blocks  user-defined “semantic” entities  Very good for low-cost performance assessment  Helps to expose performance bottlenecks and hotspots  Implemented through  sampling: periodic OS interrupts or hardware counter traps  instrumentation: direct insertion of measurement code

Definitions – Tracing  Tracing  Recording of information about significant points (events) during program execution  entering/exiting code region (function, loop, block, …)  thread/process interactions (e.g., send/receive message)  Save information in event record  timestamp  CPU identifier, thread identifier  Event type and event-specific information  Event trace is a time-sequenced stream of event records  Can be used to reconstruct dynamic program behavior  Typically requires code instrumentation

Event Tracing: Instrumentation, Monitor, Trace 1master 2slave 3... void slave { trace(ENTER, 2);... recv(A, tag, buf); trace(RECV, A);... trace(EXIT, 2); } void master { trace(ENTER, 1);... trace(SEND, B); send(B, tag, buf);... trace(EXIT, 1); } MONITOR 58AENTER1 60BENTER2 62ASENDB 64AEXIT1 68BRECVA... 69BEXIT2... CPU A: CPU B: Event definition timestamp

Event Tracing: “Timeline” Visualization 1master 2slave AENTER1 60BENTER2 62ASENDB 64AEXIT1 68BRECVA... 69BEXIT2... main master slave B A

TAU Performance System Framework  Tuning and Analysis Utilities  Performance system framework for scalable parallel and distributed high- performance computing  Targets a general complex system computation model  nodes / contexts / threads  Multi-level: system / software / parallelism  Measurement and analysis abstraction  Integrated toolkit for performance instrumentation, measurement, analysis, and visualization  Portable, configurable performance profiling/tracing facility  Open software approach  University of Oregon, LANL, FZJ Germany 

Strategies for Empirical Performance Evaluation  Empirical performance evaluation as a series of performance experiments  Experiment trials describing instrumentation and measurement requirements  Where/When/How axes of empirical performance space  where are performance measurements made in program  when is performance instrumentation done  how are performance measurement/instrumentation chosen  Strategies for achieving flexibility and portability goals  Limited performance methods restrict evaluation scope  Non-portable methods force use of different techniques  Integration and combination of strategies

TAU Performance System Architecture EPILOG Paraver

TAU Instrumentation Options  Manual instrumentation  TAU Profiling API  Automatic instrumentation approaches  PDT – Source-to-source translation  MPI - Wrapper interposition library  Opari – OpenMP directive rewriting  Binary:  JVMPI – Java virtual machine instrumentation  DyninstAPI - Runtime code patching

TAU Instrumentation  Targets common measurement interface (TAU API)  Object-based design and implementation  Macro-based, using constructor/destructor techniques  Program units: function, classes, templates, blocks  Uniquely identify functions and templates  name and type signature (name registration)  static object creates performance entry  dynamic object receives static object pointer  runtime type identification for template instantiations  C and Fortran instrumentation variants  Instrumentation and measurement optimization

Multi-Level Instrumentation  Uses multiple instrumentation interfaces  Shares information: cooperation between interfaces  Taps information at multiple levels  Provides selective instrumentation at each level  Targets a common performance model  Presents a unified view of execution

Manual Instrumentation – Using TAU  Install TAU % configure ; make clean install  Instrument application  TAU Profiling API  Modify application makefile  include TAU’s stub makefile, modify variables  Execute application % mpirun –np a.out;  Analyze performance data  jracy, vampir, pprof, paraver …

TAU Manual Instrumentation API  Initialization and runtime configuration  TAU_PROFILE_INIT(argc, argv); TAU_PROFILE_SET_NODE(myNode); TAU_PROFILE_SET_CONTEXT(myContext); TAU_PROFILE_EXIT(message); TAU_REGISTER_THREAD();  Function and class methods  TAU_PROFILE(name, type, group);  Template  TAU_TYPE_STRING(variable, type); TAU_PROFILE(name, type, group); CT(variable);  User-defined timing  TAU_PROFILE_TIMER(timer, name, type, group); TAU_PROFILE_START(timer); TAU_PROFILE_STOP(timer); …

Manual Instrumentation – C++ Example #include int main(int argc, char **argv) { TAU_PROFILE(“int main(int, char **)”, “ ”, TAU_DEFAULT); TAU_PROFILE_INIT(argc, argv); TAU_PROFILE_SET_NODE(0); /* for sequential programs */ foo(); return 0; } int foo(void) { TAU_PROFILE(“int foo(void)”, “ ”, TAU_DEFAULT); // measures entire foo() TAU_PROFILE_TIMER(t, “foo(): for loop”, “[23:45 file.cpp]”, TAU_USER); TAU_PROFILE_START(t); for(int i = 0; i < N ; i++){ work(i); } TAU_PROFILE_STOP(t); // other statements in foo … }

Manual Instrumentation – C Example #include int main(int argc, char **argv) { TAU_PROFILE_TIMER(tmain, “int main(int, char **)”, “ ”, TAU_DEFAULT); TAU_PROFILE_INIT(argc, argv); TAU_PROFILE_SET_NODE(0); /* for sequential programs */ TAU_PROFILE_START(tmain); foo(); … TAU_PROFILE_STOP(tmain); return 0; } int foo(void) { TAU_PROFILE_TIMER(t, “foo()”, “ ”, TAU_USER); TAU_PROFILE_START(t); for(int i = 0; i < N ; i++){ work(i); } TAU_PROFILE_STOP(t); }

Manual Instrumentation – F90 Example cc34567 Cubes program – comment line PROGRAM SUM_OF_CUBES integer profiler(2) save profiler INTEGER :: H, T, U call TAU_PROFILE_INIT() call TAU_PROFILE_TIMER(profiler, 'PROGRAM SUM_OF_CUBES') call TAU_PROFILE_START(profiler) call TAU_PROFILE_SET_NODE(0) ! This program prints all 3-digit numbers that ! equal the sum of the cubes of their digits. DO H = 1, 9 DO T = 0, 9 DO U = 0, 9 IF (100*H + 10*T + U == H**3 + T**3 + U**3) THEN PRINT "(3I1)", H, T, U ENDIF END DO call TAU_PROFILE_STOP(profiler) END PROGRAM SUM_OF_CUBES

Instrumenting Multithreaded Applications #include void * threaded_function(void *data) { TAU_REGISTER_THREAD(); // Before any other TAU calls TAU_PROFILE(“void * threaded_function”, “ ”, TAU_DEFAULT); work(); } int main(int argc, char **argv) { TAU_PROFILE(“int main(int, char **)”, “ ”, TAU_DEFAULT); TAU_PROFILE_INIT(argc, argv); TAU_PROFILE_SET_NODE(0); /* for sequential programs */ pthread_attr_t attr; pthread_t tid; pthread_attr_init(&attr); pthread_create(&tid, NULL, threaded_function, NULL); return 0; }

Compiling: TAU Makefiles  Include TAU Stub Makefile ( /lib) in the user’s Makefile.  Variables:  TAU_CXXSpecify the C++ compiler used by TAU  TAU_CC, TAU_F90Specify the C, F90 compilers  TAU_DEFSDefines used by TAU. Add to CFLAGS  TAU_LDFLAGSLinker options. Add to LDFLAGS  TAU_INCLUDEHeader files include path. Add to CFLAGS  TAU_LIBSStatically linked TAU library. Add to LIBS  TAU_SHLIBSDynamically linked TAU library  TAU_MPI_LIBSTAU’s MPI wrapper library for C/C++  TAU_MPI_FLIBSTAU’s MPI wrapper library for F90  TAU_FORTRANLIBSMust be linked in with C++ linker for F90.  TAU_DISABLETAU’s dummy F90 stub library  Note: Not including TAU_DEFS in CFLAGS disables instrumentation in C/C++ programs (TAU_DISABLE for f90).

Including TAU’s stub Makefile include /usr/tau/sgi64/lib/Makefile.tau-pthread-kcc CXX = $(TAU_CXX) CC = $(TAU_CC) CFLAGS = $(TAU_DEFS) LIBS = $(TAU_LIBS) OBJS =... TARGET= a.out TARGET: $(OBJS) $(CXX) $(LDFLAGS) $(OBJS) -o $(LIBS).cpp.o: $(CC) $(CFLAGS) -c $< -o

TAU Instrumentation Options  Manual instrumentation  TAU Profiling API  Automatic instrumentation approaches  PDT – Source-to-source translation  MPI - Wrapper interposition library  Opari – OpenMP directive rewriting

Program Database Toolkit (PDT)  Program code analysis framework for developing source- based tools  High-level interface to source code information  Integrated toolkit for source code parsing, database creation, and database query  commercial grade front end parsers  portable IL analyzer, database format, and access API  open software approach for tool development  Target and integrate multiple source languages  Use in TAU to build automated performance instrumentation tools

Program Database Toolkit Application / Library C / C++ parser Fortran 77/90 parser C / C++ IL analyzer Fortran 77/90 IL analyzer Program Database Files IL DUCTAPE PDBhtml SILOON CHASM TAU_instr Program documentation Application component glue C++ / F90 interoperability Automatic source instrumentation

PDT Components  Language front end  Edison Design Group (EDG): C, C++  Mutek Solutions Ltd.: F77, F90  creates an intermediate-language (IL) tree  IL Analyzer  processes the intermediate language (IL) tree  creates “program database” (PDB) formatted file  DUCTAPE (Bernd Mohr, ZAM, Germany)  C++ program Database Utilities and Conversion Tools APplication Environment  processes and merges PDB files  C++ library to access the PDB for PDT applications

TAU Makefile for PDT – C++ Example include /usr/tau/include/Makefile CXX = $(TAU_CXX) CC = $(TAU_CC) PDTPARSE = $(PDTDIR)/$(CONFIG_ARCH)/bin/cxxparse TAUINSTR = $(TAUROOT)/$(CONFIG_ARCH)/bin/tau_instrumentor CFLAGS = $(TAU_DEFS) LIBS = $(TAU_LIBS) OBJS =... TARGET= a.out TARGET: $(OBJS) $(CXX) $(LDFLAGS) $(OBJS) -o $(LIBS).cpp.o: $(PDTPARSE) $< $(TAUINSTR) $*.pdb $< -o $*.inst.cpp $(CC) $(CFLAGS) -c $*.inst.cpp -o

Instrumentation Control  Selection of which performance events to observe  Could depend on scope, type, level of interest  Could depend on instrumentation overhead  How is selection supported in instrumentation system?  No choice  Include / exclude lists (TAU)  Environment variables  Static vs. dynamic  Problem: Controlling instrumentation of small routines  High relative measurement overhead  Significant intrusion and possible perturbation

Using PDT: tau_instrumentor % tau_instrumentor Usage : tau_instrumentor [-o ] [-noinline] [-g groupname] [-i headerfile] [-c|-c++|-fortran] [-f ] For selective instrumentation, use –f option % cat selective.dat # Selective instrumentation: Specify an exclude/include list. BEGIN_EXCLUDE_LIST void quicksort(int *, int, int) void sort_5elements(int *) void interchange(int *, int *) END_EXCLUDE_LIST # If an include list is specified, the routines in the list will be the only # routines that are instrumented. # To specify an include list (a list of routines that will be instrumented) # remove the leading # to uncomment the following lines #BEGIN_INCLUDE_LIST #int main(int, char **) #int select_ #END_INCLUDE_LIST

Rule-Based Overhead Analysis (N. Trebon, UO)  Analyze the performance data to determine events with high (relative) overhead performance measurements  Create a select list for excluding those events  Rule grammar (used in TAUreduce tool) [GroupName:] Field Operator Number  GroupName indicates rule applies to events in group  Field is a event metric attribute (from profile statistics)  numcalls, numsubs, percent, usec, cumusec, count [PAPI], totalcount, stdev, usecs/call, counts/call  Operator is one of >, <, or =  Number is any number  Compound rules possible using & between simple rules

Example Rules  #Exclude all events that are members of TAU_USER #and use less than 1000 microseconds TAU_USER:usec < 1000  #Exclude all events that have less than 100 #microseconds and are called only once usec < 1000 & numcalls = 1  #Exclude all events that have less than 1000 usecs per #call OR have a (total inclusive) percent less than 5 usecs/call < 1000 percent < 5  Scientific notation can be used  usec>1000 & numcalls> & usecs/call 25

TAU Instrumentation Options  Manual instrumentation  TAU Profiling API  Automatic instrumentation approaches  PDT – Source-to-source translation  MPI - Wrapper interposition library  Opari – OpenMP directive rewriting

TAU’s MPI Wrapper Interposition Library  Uses standard MPI Profiling Interface  Provides name shifted interface  MPI_Send = PMPI_Send  Weak bindings  Interpose TAU’s MPI wrapper library between MPI and TAU  -lmpi replaced by –lTauMpi –lpmpi –lmpi

MPI Library Instrumentation (MPI_Send) int MPI_Send(…) /* TAU redefines MPI_Send */... { int returnVal, typesize; TAU_PROFILE_TIMER(tautimer, "MPI_Send()", " ", TAU_MESSAGE); TAU_PROFILE_START(tautimer); if (dest != MPI_PROC_NULL) { PMPI_Type_size(datatype, &typesize); TAU_TRACE_SENDMSG(tag, dest, typesize*count); } /* Wrapper calls PMPI_Send */ returnVal = PMPI_Send(buf, count, datatype, dest, tag, comm); TAU_PROFILE_STOP(tautimer); return returnVal; }

Including TAU’s stub Makefile include /usr/tau/sgi64/lib/Makefile.tau-mpi CXX = $(TAU_CXX) CC = $(TAU_CC) CFLAGS = $(TAU_DEFS) LIBS = $(TAU_MPI_LIBS) $(TAU_LIBS) LD_FLAGS = $(USER_OPT) $(TAU_LDFLAGS) OBJS =... TARGET= a.out TARGET: $(OBJS) $(CXX) $(LDFLAGS) $(OBJS) -o $(LIBS).cpp.o: $(CC) $(CFLAGS) -c $< -o

TAU Instrumentation Options  Manual instrumentation  TAU Profiling API  Automatic instrumentation approaches  PDT – Source-to-source translation  MPI - Wrapper interposition library  Opari – OpenMP directive rewriting [FZJ, Germany]

Instrumentation of OpenMP Constructs  OPARI  OpenMP Pragma And Region Instrumentor  Source-to-Source translator to insert POMP calls around OpenMP constructs and API functions  Done: Supports  Fortran77 and Fortran90, OpenMP 2.0  C and C++, OpenMP 1.0  POMP Extensions  EPILOG and TAU POMP implementations  Preserves source code information ( #line line file )  Work in Progress: Investigating standardization through OpenMP Forum

OpenMP API Instrumentation  Transform  omp_#_lock()  pomp_#_lock()  omp_#_nest_lock()  pomp_#_nest_lock() [ # = init | destroy | set | unset | test ]  POMP version  Calls omp version internally  Can do extra stuff before and after call

Example: !$OMP PARALLEL DO Instrumentation !$OMP PARALLEL DO clauses... do loop !$OMP END PARALLEL DO !$OMP PARALLEL other-clauses... !$OMP DO schedule-clauses, ordered-clauses, lastprivate-clauses do loop !$OMP END DO !$OMP END PARALLEL DO NOWAIT !$OMP BARRIER call pomp_parallel_fork(d) call pomp_parallel_begin(d) call pomp_parallel_end(d) call pomp_parallel_join(d) call pomp_do_enter(d) call pomp_do_exit(d) call pomp_barrier_enter(d) call pomp_barrier_exit(d)

Opari Instrumentation: Example  OpenMP directive instrumentation pomp_for_enter(&omp_rd_2); #line 252 "stommel.c" #pragma omp for schedule(static) reduction(+: diff) private(j) firstprivate (a1,a2,a3,a4,a5) nowait for( i=i1;i<=i2;i++) { for(j=j1;j<=j2;j++){ new_psi[i][j]=a1*psi[i+1][j] + a2*psi[i-1][j] + a3*psi[i][j+1] + a4*psi[i][j-1] - a5*the_for[i][j]; diff=diff+fabs(new_psi[i][j]-psi[i][j]); } pomp_barrier_enter(&omp_rd_2); #pragma omp barrier pomp_barrier_exit(&omp_rd_2); pomp_for_exit(&omp_rd_2); #line 261 "stommel.c"

OPARI: Basic Usage (f90)  Reset OPARI state information  rm -f opari.rc  Call OPARI for each input source file  opari file1.f90... opari fileN.f90  Generate OPARI runtime table, compile it with ANSI C  opari -table opari.tab.c cc -c opari.tab.c  Compile modified files *.mod.f90 using OpenMP  Link the resulting object files, the OPARI runtime table opari.tab.o and the TAU POMP RTL

OPARI: Makefile Template (C/C++) OMPCC =...# insert C OpenMP compiler here OMPCXX =...# insert C++ OpenMP compiler here.c.o: opari $< $(OMPCC) $(CFLAGS) -c $*.mod.c.cc.o: opari $< $(OMPCXX) $(CXXFLAGS) -c $*.mod.cc opari.init: rm -rf opari.rc opari.tab.o: opari -table opari.tab.c $(CC) -c opari.tab.c myprog: opari.init myfile*.o... opari.tab.o $(OMPCC) -o myprog myfile*.o opari.tab.o -lpomp myfile1.o: myfile1.c myheader.h myfile2.o:...

OPARI: Makefile Template (Fortran) OMPF77 =...# insert f77 OpenMP compiler here OMPF90 =...# insert f90 OpenMP compiler here.f.o: opari $< $(OMPF77) $(CFLAGS) -c $*.mod.F.f90.o: opari $< $(OMPF90) $(CXXFLAGS) -c $*.mod.F90 opari.init: rm -rf opari.rc opari.tab.o: opari -table opari.tab.c $(CC) -c opari.tab.c myprog: opari.init myfile*.o... opari.tab.o $(OMPF90) -o myprog myfile*.o opari.tab.o -lpomp myfile1.o: myfile1.f90 myfile2.o:...

TAU Measurement  Performance information  High-resolution timer library (real-time / virtual clocks)  General software counter library (user-defined events)  Hardware performance counters  PAPI (Performance API) (UTK, Ptools Consortium)  consistent, portable API  Organization  Node, context, thread levels  Profile groups for collective events (runtime selective)  Performance data mapping between software levels

TAU Measurement (continued)  Parallel profiling  Function-level, block-level, statement-level  Supports user-defined events  TAU parallel profile database  Callpath profiles  Hardware counts values  Tracing  All profile-level events  Inter-process communication events  Timestamp synchronization  User-configurable measurement library (user controlled)

TAU Measurement System Configuration  configure [OPTIONS]  {-c++=, -cc= } Specify C++ and C compilers  {-pthread, -sproc}Use pthread or SGI sproc threads  -openmpUse OpenMP threads  -opari= Specify location of Opari OpenMP tool  -papi= Specify location of PAPI  -pdt= Specify location of PDT  {-mpiinc=, mpilib= }Specify MPI library instrumentation  - TRACE Generate TAU event traces  -PROFILE Generate TAU profiles  -PROFILECALLPATHGenerate Callpath profiles (1-level)  -MULTIPLECOUNTERSUse more than one hardware counter  -CPUTIMEUse usertime+system time  -PAPIWALLCLOCKUse PAPI to access wallclock time  -PAPIVIRTUALUse PAPI for virtual (user) time …

TAU Measurement Configuration – Examples ./configure -c++=xlC -cc=xlc –pdt=/usr/packages/pdtoolkit-2.1 -pthread  Use TAU with IBM’s xlC compiler, PDT and the pthread library  Enable TAU profiling (default) ./configure -TRACE –PROFILE  Enable both TAU profiling and tracing ./configure -c++=CC -cc=cc –MULTIPLECOUNTERS -papi=/usr/local/packages/papi –opari=/usr/local/opari-pomp-1.1 -mpiinc=/usr/packages/mpich/include -mpilib=/usr/packages/mpich/lib –SGITIMERS -PAPIVIRTUAL  Use OpenMP+MPI using SGI’s compiler suite, Opari and use PAPI for accessing hardware performance counters & virtual time for measurements  Typically configure multiple measurement libraries

Setup: Running Applications % setenv PROFILEDIR /home/data/experiments/profile/01 % setenv TRACEDIR/home/data/experiments/trace/01(optional) % set path=($path / /bin) % setenv LD_LIBRARY_PATH $LD_LIBRARY_PATH\: / /lib For PAPI (1 counter): % setenv PAPI_EVENT PAPI_FP_INS For PAPI (multiplecounters): % setenv COUNTER1 PAPI_FP_INS (PAPI’s Floating point ins) % setenv COUNTER2 PAPI_L1_DCM (PAPI’s L1 Data cache misses) % setenv COUNTER3 P_VIRTUAL_TIME (PAPI’s virtual time) % setenv COUNTER4 SGI_TIMERS (Wallclock time) % mpirun –np % llsubmit job.sh

Performance Mapping  Associate performance with “significant” entities (events)  Source code points are important  Functions, regions, control flow events, user events  Execution process and thread entities are important  Some entities are more abstract, harder to measure  Consider callgraph (callpath) profiling  Measure time (metric) along an edge (path) of callgraph  Incident edge gives parent / child view  Edge sequence (path) gives parent / descendant view  Problem: Callpath profiling when callgraph is unknown  Determine callgraph dynamically at runtime  Map performance measurement to dynamic call path state

1-Level Callpath Implementation in TAU  TAU maintains a performance event (routine) callstack  Profiled routine (child) looks in callstack for parent  Previous profiled performance event is the parent  A callpath profile structure created first time parent calls  TAU records parent in a callgraph map for child  String representing 1-level callpath used as its key  “a( )=>b( )” : name for time spent in “b” when called by “a”  Map returns pointer to callpath profile structure  1-level callpath is profiled using this profiling data  Build upon TAU’s performance mapping technology  Measurement is independent of instrumentation  Use –PROFILECALLPATH to configure TAU

TAU Analysis  Profile analysis  pprof  parallel profiler with text-based display  racy  graphical interface to pprof (Tcl/Tk)  jracy  Java implementation of Racy  Trace analysis and visualization  Trace merging and clock adjustment (if necessary)  Trace format conversion (ALOG, SDDF, Vampir)  Vampir (Pallas) trace visualization  Paraver (CEPBA) trace visualization

Pprof Command  pprof [-c|-b|-m|-t|-e|-i] [-r] [-s] [-n num] [-f file] [-l] [nodes]  -cSort according to number of calls  -bSort according to number of subroutines called  -mSort according to msecs (exclusive time total)  -tSort according to total msecs (inclusive time total)  -eSort according to exclusive time per call  -iSort according to inclusive time per call  -vSort according to standard deviation (exclusive usec)  -rReverse sorting order  -sPrint only summary profile information  -n numPrint only first number of functions  -f fileSpecify full path and filename without node ids  -l List all functions and exit

TAU Parallel Performance Profiles

Terminology – Example  For routine “int main( )”:  Exclusive time  =10 secs  Inclusive time  100 secs  Calls  1 call  Subrs (no. of child routines called)  3  Inclusive time/call  100secs int main( ) { /* takes 100 secs */ f1(); /* takes 20 secs */ f2(); /* takes 50 secs */ f1(); /* takes 20 secs */ /* other work */ } /* Time can be replaced by counts */

jracy (NAS Parallel Benchmark – LU) n: node c: context t: thread Global profiles Individual profile Routine profile across all nodes

jracy (Callpath Profiles) (R. A. Bell, UO) Callpath profile across all nodes

Vampir Trace Visualization Tool  Visualization and Analysis of MPI Programs  Originally developed by Forschungszentrum Jülich  Current development by Technical University Dresden  Distributed by PALLAS, Germany 

Using TAU with Vampir  Configure TAU with -TRACE option % configure –TRACE –SGITIMERS …  Execute application % mpirun –np 4 a.out  This generates TAU traces and event descriptors  Merge all traces using tau_merge % tau_merge *.trc app.trc  Convert traces to Vampir Trace format using tau_convert % tau_convert –pv app.trc tau.edf app.pv Note: Use –vampir instead of –pv for multi-threaded traces  Load generated trace file in Vampir % vampir app.pv

Vampir: Main Window  Trace file loading can be  Interrupted at any time  Resumed  Started at a specified time offset  Provides main menu  Access to global and process local displays  Preferences  Help  Trace file can be re–written (re–grouped symbols)

Vampir: Timeline Diagram  Functions organized into groups  Coloring by group  Message lines can be colored by tag or size  Information about states, messages, collective, and I/O operations available by clicking on the representation

Vampir: Timeline Diagram (Message Info)  Source–code references are displayed if recorded in trace

Vampir: Execution Statistics Displays  Aggregated profiling information: execution time, # calls, inclusive/exclusive  Available for all/any group (activity)  Available for all routines (symbols)  Available for any trace part (select in timeline diagram)

Vampir: Communication Statistics Displays  Bytes sent/received for collective operations  Message length statistics  Available for any trace part  Byte and message count, min/max/avg message length and min/max/avg bandwidth for each process pair

Vampir: Other Features  Parallelism display  Powerful filtering and trace comparison features  All diagrams highly customizable (through context menus)  Dynamic global call graph tree

Vampir: Process Displays  Activity chart  Call tree  Timeline  For all selected processes in the global displays

Vampir (NAS Parallel Benchmark – LU) Timeline display Callgraph display Communications display Parallelism display

TAU Performance System Status  Computing platforms  IBM SP, SGI Origin, ASCI Red, Cray T3E, Compaq SC, HP, Sun, Apple, Windows, IA-32, IA-64 (Linux), Hitachi, NEC  Programming languages  C, C++, Fortran 77/90, HPF, Java  Communication libraries  MPI, PVM, Nexus, Tulip, ACLMPL, MPIJava  Thread libraries  pthread, Java,Windows, SGI sproc, Tulip, SMARTS, OpenMP  Compilers  KAI (KCC, KAP/Pro), PGI, GNU, Fujitsu, HP, Sun, Microsoft, SGI, Cray, IBM, HP, Compaq, Hitachi, NEC, Intel

PDT Status  Program Database Toolkit (Version 2.1, web download)  EDG C++ front end (Version )  Mutek Fortran 90 front end (Version 2.4.1)  C++ and Fortran 90 IL Analyzer  DUCTAPE library  Standard C++ system header files (KCC Version 4.0f)  PDT-constructed tools  TAU instrumentor (C/C++/F90)  Program analysis support for SILOON and CHASM  Platforms  SGI, IBM, Compaq, SUN, HP, Linux (IA32/IA64), Apple, Windows, Cray T3E, Hitachi

Work in Progress  Visualization:  TAU will generate event-traces with PAPI performance data. Vampir (v3.0) will support visualization of this data  Performance Monitoring and Steering  Performance Database Framework

Vampir v3.x: HPM Counter  Counter Timeline Display  Process Timeline Display

Performance Monitoring and Steering  Desirable to monitor performance during execution  Long-running applications  Steering computations for improved performance  Large-scale parallel applications complicate solutions  More parallel threads of execution producing data  Large amount of performance data (relative) to access  Analysis and visualization more difficult  Problem: Online performance data access and analysis  Incremental profile sampling (based on files)  Integration in computational steering system  Dynamic performance measurement and access

Online Performance Analysis (K. Li, UO) Application Performance Steering Performance Visualizer Performance Analyzer Performance Data Reader TAU Performance System Performance Data Integrator SCIRun (Univ. of Utah) // performance data streams // performance data output file system sample sequencing reader synchronization accumulated samples

2D Field Performance Visualization in SCIRun SCIRun program

Uintah Computational Framework (UCF)  University of Utah  UCF analysis  Scheduling  MPI library  Components  500 processes  Use for online and offline visualization  Apply SCIRun steering

Empirical-Based Performance Optimization characterization Performance Tuning Performance Diagnosis Performance Experimentation Performance Observation hypotheses properties Experiment Schemas Experiment Trials observability requirements ? Process

TAU Performance Database Framework Performance analysis programs Performance analysis and query toolkit  profile data only  XML representation  project / experiment / trial PerfDML translators... ORDB PostgreSQL PerfDB Performance data description Raw performance data

PerfDBF Architecture (L. Li, R. Bell, UO) App. profiled With TAU Standard TAU Output Data TAU XML Format SQL Database Analysis Tool TAU to XML Converter Database Loader

Scalability Analysis Process  Scalability study on LU  % suite.def # of procs -> 1, 2, 4, and 8  % mpirun -np 1 lu.W1  % mpirun -np 2 lu.W2  % mpirun -np 4 lu.W4  % mpirun -np 8 lu.W8  populateDatabase.sh  run Java translator to translate profiles into XML  run Java XML reader to write XML profiles to database  Read times for routines and program from experiments  Calculate scalability metrics

Contents of Performance Database

Scalability Analysis Results  Scalability of LU performance experiments  Four trial runs Funname| processors| meanspeedup …. applu| 2| applu| 4| applu| 8| … exact| 2| exact| 4| exact| 8|

Current Status and Future  PerfDBF prototype  TAU profile to XML translator  XML to PerfDB populator  PostgresSQL database  Java-based PostgresSQL query module  Use as a layer to support performance analysis tools  Make accessing the Performance Database quicker  Continue development  XML parallel profile representation  Basic specification

Overview  Introduction  Definitions, general problem  Tuning and Analysis Utilities (TAU)  Instrumentation  Measurement  Analysis  Work in progress:  Visualization: Vampir  Performance Monitoring and Steering  Performance Database Framework  Case Study: Uintah  Conclusions

Case Study: Utah ASCI/ASAP Level 1 Center  C-SAFE was established to build a problem-solving environment (PSE) for the numerical simulation of accidental fires and explosions  Fundamental chemistry and engineering physics models  Coupled with non-linear solvers, optimization, computational steering, visualization, and experimental data verification  Very large-scale simulations  Computer science problems:  Coupling of multiple simulation codes  Software engineering across diverse expert teams  Achieving high performance on large-scale systems

Example C-SAFE Simulation Problems ∑ Heptane fire simulation Material stress simulation Typical C-SAFE simulation with a billion degrees of freedom and non-linear time dynamics

Uintah High-Level Component View

Uintah Computational Framework  Execution model based on software (macro) dataflow  Exposes parallelism and hides data transport latency  Computations expressed a directed acyclic graphs of tasks  consumes input and produces output (input to future task)  input/outputs specified for each patch in a structured grid  Abstraction of global single-assignment memory  DataWarehouse  Directory mapping names to values (array structured)  Write value once then communicate to awaiting tasks  Task graph gets mapped to processing resources  Communications schedule approximates global optimal

Uintah Task Graph (Material Point Method)  Diagram of named tasks (ovals) and data (edges)  Imminent computation  Dataflow-constrained  MPM  Newtonian material point motion time step  Solid: values defined at material point (particle)  Dashed: values defined at vertex (grid)  Prime (‘): values updated during time step

Uintah PSE  UCF automatically sets up:  Domain decomposition  Inter-processor communication with aggregation/reduction  Parallel I/O  Checkpoint and restart  Performance measurement and analysis (stay tuned)  Software engineering  Coding standards  CVS (Commits: Y files/day, Y files/day)  Correctness regression testing with bugzilla bug tracking  Nightly build (parallel compiles)  170,000 lines of code (Fortran and C++ tasks supported)

Performance Technology Integration  Uintah present challenges to performance integration  Software diversity and structure  UCF middleware, simulation code modules  component-based hierarchy  Portability objectives  cross-language and cross-platform  multi-parallelism: thread, message passing, mixed  Scalability objectives  High-level programming and execution abstractions  Requires flexible and robust performance technology  Requires support for performance mapping

Task execution time dominates (what task?) MPI communication overheads (where?) Task Execution in Uintah Parallel Scheduler  Profile methods and functions in scheduler and in MPI library Task execution time distribution  Need to map performance data!

Semantics-Based Performance Mapping  Associate performance measurements with high-level semantic abstractions  Need mapping support in the performance measurement system to assign data correctly

Semantic Entities/Attributes/Associations (SEAA)  New dynamic mapping scheme  Entities defined at any level of abstraction  Attribute entity with semantic information  Entity-to-entity associations  Two association types (implemented in TAU API)  Embedded – extends data structure of associated object to store performance measurement entity  External – creates an external look-up table using address of object as the key to locate performance measurement entity

Uintah Task Performance Mapping  Uintah partitions individual particles across processing elements (processes or threads)  Simulation tasks in task graph work on particles  Tasks have domain-specific character in the computation  “interpolate particles to grid” in Material Point Method  Task instances generated for each partitioned particle set  Execution scheduled with respect to task dependencies  How to attributed execution time among different tasks  Assign semantic name (task type) to a task instance  SerialMPM::interpolateParticleToGrid  Map TAU timer object to (abstract) task (semantic entity)  Look up timer object using task type (semantic attribute)  Further partition along different domain-specific axes

Using External Associations  Two level mappings:  Level 1:  Level 2:  Embedded association vs External association Data (object) Performance Data... Hash Table

Task Performance Mapping Instrumentation void MPIScheduler::execute(const ProcessorGroup * pc, DataWarehouseP & old_dw, DataWarehouseP & dw ) {... TAU_MAPPING_CREATE( task->getName(), "[MPIScheduler::execute()]", (TauGroup_t)(void*)task->getName(), task->getName(), 0);... TAU_MAPPING_OBJECT(tautimer) TAU_MAPPING_LINK(tautimer,(TauGroup_t)(void*)task->getName()); // EXTERNAL ASSOCIATION... TAU_MAPPING_PROFILE_TIMER(doitprofiler, tautimer, 0) TAU_MAPPING_PROFILE_START(doitprofiler,0); task->doit(pc); TAU_MAPPING_PROFILE_STOP(0);... }

Task Performance Mapping (Profile) Performance mapping for different tasks Mapped task performance across processes

Task Performance Mapping (Trace) Work packet computation events colored by task type Distinct phases of computation can be identifed based on task

Task Performance Mapping (Trace - Zoom) Startup communication imbalance

Task Performance Mapping (Trace - Parallelism) Communication / load imbalance

Comparing Uintah Traces for Scalability Analysis 8 processes 32 processes

Scaling Performance Optimizations Last year: initial “correct” scheduler Reduce communication by 10 x Reduce task graph overhead by 20 x ASCI Nirvana SGI Origin 2000 Los Alamos National Laboratory

Scalability to 2000 Processors (Fall 2001) ASCI Nirvana SGI Origin 2000 Los Alamos National Laboratory

Concluding Remarks  Complex software and parallel computing systems pose challenging performance analysis problems that require robust methodologies and tools  To build more sophisticated performance tools, existing proven performance technology must be utilized  Performance tools must be integrated with software and systems models and technology  Performance engineered software  Function consistently and coherently in software and system environments  PAPI and TAU performance systems offer robust performance technology that can be broadly integrated

Information  TAU (  PDT (  PAPI (  OPARI (

Support Acknowledgement  TAU and PDT support:  Department of Energy (DOE)  DOE 2000 ACTS contract  DOE MICS contract  DOE ASCI Level 3 (LANL, LLNL)  U. of Utah DOE ASCI Level 1 subcontract  DARPA  NSF National Young Investigator (NYI) award