Sameer Shende, Allen D. Malony, and Alan Morris {sameer, malony, Steven Parker, and J. Davison de St. Germain {sparker,

Slides:



Advertisements
Similar presentations
Machine Learning-based Autotuning with TAU and Active Harmony Nicholas Chaimov University of Oregon Paradyn Week 2013 April 29, 2013.
Advertisements

Dynamic performance measurement control Dynamic event grouping Multiple configurable counters Selective instrumentation Application-Level Performance Access.
Workload Characterization using the TAU Performance System Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer,
Sameer Shende, Allen D. Malony, and Alan Morris {sameer, malony, Department of Computer and Information Science NeuroInformatics.
Robert Bell, Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science.
Allen D. Malony, Sameer Shende, Alan Morris Department of Computer and Information Science Performance Research.
Scalability Study of S3D using TAU Sameer Shende
Sameer Shende Department of Computer and Information Science Neuro Informatics Center University of Oregon Tool Interoperability.
Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science Institute University.
Profiling S3D on Cray XT3 using TAU Sameer Shende
TAU: Tuning and Analysis Utilities. TAU Performance System Framework  Tuning and Analysis Utilities  Performance system framework for scalable parallel.
Allen D. Malony Department of Computer and Information Science Computational Science Institute University of Oregon Integrating Performance.
TAU Parallel Performance System DOD UGC 2004 Tutorial Allen D. Malony, Sameer Shende, Robert Bell Univesity of Oregon.
The TAU Performance Technology for Complex Parallel Systems (Performance Analysis Bring Your Own Code Workshop, NRL Washington D.C.) Sameer Shende, Allen.
Nick Trebon, Alan Morris, Jaideep Ray, Sameer Shende, Allen Malony {ntrebon, amorris, Department of.
TAU Performance System
On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Bernd Mohr 1, Allen D. Malony 2, Rudi Eigenmann 3 1 Forschungszentrum.
Case Study: PETSc ex19  Non-linear solver (snes)  2-D driven cavity code  uses velocity-velocity formulation  finite difference discretization on a.
Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science Institute University.
The TAU Performance System: Advances in Performance Mapping Sameer Shende University of Oregon.
Performance and Memory Evaluation using TAU Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer, malony, Peter.
TAU Performance System Alan Morris, Sameer Shende, Allen D. Malony University of Oregon {amorris, sameer,
Performance Tools BOF, SC’07 5:30pm – 7pm, Tuesday, A9 Sameer S. Shende Performance Research Laboratory University.
Tools for Performance Discovery and Optimization Sameer Shende, Allen D. Malony, Alan Morris, Kevin Huck University of Oregon {sameer, malony, amorris,
Allen D. Malony Department of Computer and Information Science Computational Science Institute University of Oregon TAU Performance.
June 2, 2003ICCS Performance Instrumentation and Measurement for Terascale Systems Jack Dongarra, Shirley Moore, Philip Mucci University of Tennessee.
Allen D. Malony Department of Computer and Information Science Performance Research Laboratory NeuroInformatics Center University.
Performance Evaluation of S3D using TAU Sameer Shende
TAU: Performance Regression Testing Harness for FLASH Sameer Shende
TAU Performance Toolkit (WOMPAT 2004 OpenMP Lab) Sameer Shende, Allen D. Malony University of Oregon {sameer,
Scalability Study of S3D using TAU Sameer Shende
Optimization of Instrumentation in Parallel Performance Evaluation Tools Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer,
Kai Li, Allen D. Malony, Robert Bell, Sameer Shende Department of Computer and Information Science Computational.
The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon.
Sameer Shende, Allen D. Malony Computer & Information Science Department Computational Science Institute University of Oregon.
Performance Technology for Complex Parallel Systems REFERENCES.
SC’01 Tutorial Nov. 7, 2001 TAU Performance System Framework  Tuning and Analysis Utilities  Performance system framework for scalable parallel and distributed.
Allen D. Malony Performance Research Laboratory (PRL) Neuroinformatics Center (NIC) Department.
LLNL-PRES-XXXXXX This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
Using TAU on SiCortex Alan Morris, Aroon Nataraj Sameer Shende, Allen D. Malony University of Oregon {amorris, anataraj, sameer,
1 Cactus in a nutshell... n Cactus facilitates parallel code design, it enables platform independent computations and encourages collaborative code development.
John Mellor-Crummey Robert Fowler Nathan Tallent Gabriel Marin Department of Computer Science, Rice University Los Alamos Computer Science Institute HPCToolkit.
Profile Analysis with ParaProf Sameer Shende Performance Reseaerch Lab, University of Oregon
Performance Analysis Tool List Hans Sherburne Adam Leko HCS Research Laboratory University of Florida.
Dynamic performance measurement control Dynamic event grouping Multiple configurable counters Selective instrumentation Application-Level Performance Access.
Allen D. Malony, Sameer S. Shende, Alan Morris, Robert Bell, Kevin Huck, Nick Trebon, Suravee Suthikulpanit, Kai Li, Li Li
1 SciDAC High-End Computer System Performance: Science and Engineering Jack Dongarra Innovative Computing Laboratory University of Tennesseehttp://
Allen D. Malony, Sameer S. Shende, Robert Bell Kai Li, Li Li, Kevin Huck Department of Computer.
Connections to Other Packages The Cactus Team Albert Einstein Institute
Integrated Performance Analysis in the Uintah Computational Framework Steven G. Parker Allen Morris, Scott Bardenhagen, Biswajit Banerje, James Bigler,
Allen D. Malony Department of Computer and Information Science Performance Research Laboratory.
Performane Analyzer Performance Analysis and Visualization of Large-Scale Uintah Simulations Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance.
TAU Performance System ® TAU is a profiling and tracing toolkit that supports programs written in C, C++, Fortran, Java, Python,
Allen D. Malony Department of Computer and Information Science Computational Science Institute University of Oregon Integrating Performance.
Parallel OpenFOAM CFD Performance Studies Student: Adi Farshteindiker Advisors: Dr. Guy Tel-Zur,Prof. Shlomi Dolev The Department of Computer Science Faculty.
Kai Li, Allen D. Malony, Sameer Shende, Robert Bell
Introduction to the TAU Performance System®
Performance Technology for Scalable Parallel Systems
TAU integration with Score-P
Tutorial Outline – Part 1
Allen D. Malony, Sameer Shende
TAU Parallel Performance System
Performance Technology for Complex Parallel and Distributed Systems
A configurable binary instrumenter
TAU Parallel Performance System
TAU: A Framework for Parallel Performance Analysis
Outline Introduction Motivation for performance mapping SEAA model
Allen D. Malony, Sameer Shende
Parallel Program Analysis Framework for the DOE ACTS Toolkit
TAU Performance DataBase Framework (PerfDBF)
Presentation transcript:

Sameer Shende, Allen D. Malony, and Alan Morris {sameer, malony, Steven Parker, and J. Davison de St. Germain {sparker, University of Oregon and University of Utah Performance Evaluation of Adaptive Scientific Applications using TAU

Parallel CFD Performance Evaluation of Adaptive Scientific Applications using TAU TAU Performance System Framework  Tuning and Analysis Utilities  Performance system framework for scalable parallel and distributed high- performance computing  Targets a general complex system computation model  nodes / contexts / threads  Multi-level: system / software / parallelism  Measurement and analysis abstraction  Integrated toolkit for performance instrumentation, measurement, analysis, and visualization  Portable, configurable performance profiling/tracing facility  Open software approach  University of Oregon, LANL, FZJ Germany 

Parallel CFD Performance Evaluation of Adaptive Scientific Applications using TAU TAU Performance System Architecture Jumpshot Paraver paraprof

Parallel CFD Performance Evaluation of Adaptive Scientific Applications using TAU Program Database Toolkit (PDT)  Program code analysis framework  develop source-based tools  High-level interface to source code information  Integrated toolkit for source code parsing, database creation, and database query  Commercial grade front-end parsers  Portable IL analyzer, database format, and access API  Open software approach for tool development  Multiple source languages  Implement automatic performance instrumentation tools  tau_instrumentor

Parallel CFD Performance Evaluation of Adaptive Scientific Applications using TAU Program Database Toolkit (PDT) Application / Library C / C++ parser Fortran parser F77/90/95 C / C++ IL analyzer Fortran IL analyzer Program Database Files IL DUCTAPE PDBhtml SILOON CHASM TAU_instr Program documentation Application component glue C++ / F90/95 interoperability Automatic source instrumentation

Parallel CFD Performance Evaluation of Adaptive Scientific Applications using TAU AutoInstrumentation using TAU_COMPILER  $(TAU_COMPILER) stub Makefile variable (v )  Invokes PDT parser, TAU instrumentor, compiler through tau_compiler.sh shell script  Requires minimal changes to application Makefile  Compilation rules are not changed  User adds $(TAU_COMPILER) before compiler name  F90=mpxlf90 Changes to F90= $(TAU_COMPILER) mpxlf90  Passes options from TAU stub Makefile to the four compilation stages  Uses original compilation command if an error occurs

Parallel CFD Performance Evaluation of Adaptive Scientific Applications using TAU TAU_COMPILER – Improving Integration in Makefiles include /usr/tau /rs6000/Makefile.tau-mpi-pdt CXX = $(TAU_COMPILER) mpCC F90 = $(TAU_COMPILER) mpxlf90_r CFLAGS = LIBS = -lm OBJS = f1.o f2.o f3.o … fn.o app: $(OBJS) $(CXX) $(LDFLAGS) $(OBJS) -o $(LIBS).cpp.o: $(CC) $(CFLAGS) -c $<

Parallel CFD Performance Evaluation of Adaptive Scientific Applications using TAU TAU_COMPILER Commandline Options  See / /bin/tau_compiler.sh –help  Compilation: % mpxlf90 -c foo.f90 Changes to % f95parse foo.f90 $(OPT1) % tau_instrumentor foo.pdb foo.f90 –o foo.inst.f90 $(OPT2) % mpxlf90 –c foo.f90 $(OPT3)  Linking: % mpxlf90 foo.o bar.o –o app Changes to % mpxlf90 foo.o bar.o –o app $(OPT4)  Where options OPT[1-4] default values may be overridden by the user: F90 = $(TAU_COMPILER) $(MYOPTIONS) mpxlf90

Parallel CFD Performance Evaluation of Adaptive Scientific Applications using TAU Overriding Default Options:TAU_COMPILER include /usr/common/acts/TAU/tau /rs6000/lib/ Makefile.tau-mpi-pdt-trace MYOPTIONS= -optVerbose –optKeepFiles F90 = $(TAU_COMPILER) $(MYOPTIONS) mpxlf90 OBJS = f1.o f2.o f3.o … LIBS = -Lappdir –lapplib1 –lapplib2 … app: $(OBJS) $(F90) $(OBJS) –o app $(LIBS).f90.o: $(F90) –c $<

Parallel CFD Performance Evaluation of Adaptive Scientific Applications using TAU Using TAU  Configuration  Instrumentation  Manual  MPI – Wrapper interposition library  PDT- Source rewriting for C,C++, F77/90/95  OpenMP – Directive rewriting  Component based instrumentation – Proxy components  Binary Instrumentation  DyninstAPI – Runtime Instrumentation/Rewriting binary  Java – Runtime instrumentation  Python – Runtime instrumentation  Measurement  Performance Analysis

Parallel CFD Performance Evaluation of Adaptive Scientific Applications using TAU Profile Measurement – Three Flavors  Flat profiles  Time (or counts) spent in each routine (nodes in callgraph).  Exclusive/inclusive time, no. of calls, child calls  E.g,: MPI_Send, foo, …  Callpath Profiles  Flat profiles, plus  Sequence of actions that led to poor performance  Time spent along a calling path (edges in callgraph)  E.g., “main=> f1 => f2 => MPI_Send” shows the time spent in MPI_Send when called by f2, when f2 is called by f1, when it is called by main. Depth of this callpath = 4 (TAU_CALLPATH_DEPTH environment variable)  Phase based profiles  Flat profiles, plus  Flat profiles under a phase (nested phases are allowed)  Default “main” phase has all phases and routines invoked outside phases  Supports static or dynamic (per-iteration) phases  E.g., “IO => MPI_Send” is time spent in MPI_Send in IO phase

Parallel CFD Performance Evaluation of Adaptive Scientific Applications using TAU TAU Timers and Phases  Static timer  Shows time spent in all invocations of a routine (foo)  E.g., “foo()” 100 secs, 100 calls  Dynamic timer  Shows time spent in each invocation of a routine  E.g., “foo() 3” 4.5 secs, “foo 10” 2 secs (invocations 3 and 10 respectively)  Static phase  Shows time spent in all routines called (directly/indirectly) by a given routine (foo)  E.g., “foo() => MPI_Send()” 100 secs, 10 calls shows that a total of 100 secs were spent in MPI_Send() when it was called by foo.  Dynamic phase  Shows time spent in all routines called by a given invocation of a routine.  E.g., “foo() 4 => MPI_Send()” 12 secs, shows that 12 secs were spent in MPI_Send when it was called by the 4 th invocation of foo.

Parallel CFD Performance Evaluation of Adaptive Scientific Applications using TAU Static Timers in TAU SUBROUTINE SUM_OF_CUBES integer profiler(2) save profiler INTEGER :: H, T, U call TAU_PROFILE_TIMER(profiler, 'SUM_OF_CUBES') call TAU_PROFILE_START(profiler) ! This program prints all 3-digit numbers that ! equal the sum of the cubes of their digits. DO H = 1, 9 DO T = 0, 9 DO U = 0, 9 IF (100*H + 10*T + U == H**3 + T**3 + U**3) THEN PRINT "(3I1)", H, T, U ENDIF END DO call TAU_PROFILE_STOP(profiler) END SUBROUTINE SUM_OF_CUBES

Parallel CFD Performance Evaluation of Adaptive Scientific Applications using TAU Static Phases and Timers SUBROUTINE FOO integer profiler(2) save profiler call TAU_PHASE_CREATE_STATIC(profiler, ‘foo') call TAU_PHASE_START(profiler) call bar() !Here bar calls MPI_Barrier and we evaluate foo=>MPI_Barrier and foo=>bar call TAU_PHASE_STOP(profiler) END SUBROUTINE SUM_OF_CUBES SUBROUTINE BAR integer profiler(2) save profiler call TAU_PROFILE_TIMER(profiler, ‘bar’) call TAU_PROFILE_START(profiler) call MPI_Barrier() call TAU_PROFILE_STOP(profiler) END SUBROUTINE BAR

Parallel CFD Performance Evaluation of Adaptive Scientific Applications using TAU Dynamic Phases SUBROUTINE ITERATE(IER, NIT) IMPLICIT NONE INTEGER IER, NIT character(11) taucharary integer tauiteration / 0 / integer profiler(2) / 0, 0 / save profiler, tauiteration write (taucharary, '(a8,i3)') 'ITERATE ', tauiteration ! Taucharary is the name of the phase e.g.,‘ITERATION 23’ tauiteration = tauiteration + 1 call TAU_PHASE_CREATE_DYNAMIC(profiler,taucharary) call TAU_PHASE_START(profiler) IER = 0 call SOLVE_K_EPSILON_EQ(IER) ! Other work call TAU_PHASE_STOP(profiler)

Parallel CFD Performance Evaluation of Adaptive Scientific Applications using TAU TAU’s ParaProf Profile Browser: Static Timers

Parallel CFD Performance Evaluation of Adaptive Scientific Applications using TAU Dynamic Timers

Parallel CFD Performance Evaluation of Adaptive Scientific Applications using TAU Static Phases MPI_Barrier took 4.85 secs out of secs in the DTM Phase

Parallel CFD Performance Evaluation of Adaptive Scientific Applications using TAU Dynamic Phases The first iteration was expensive for INT_RTE. It took secs. Other iterations took less time – 14.2, 10.5, 10.3, 10.5 seconds

Parallel CFD Performance Evaluation of Adaptive Scientific Applications using TAU Dynamic Phases Time spent in MPI_Barrier, MPI_Recv,… in DTM ITERATION 1 Breakdown of time spent in MPI_Isend based on its static and dynamic parent phases

Parallel CFD Performance Evaluation of Adaptive Scientific Applications using TAU Case Study: Uintah Computational Framework ∑ Heptane fire simulation Material stress simulation Typical C-SAFE simulation with a billion degrees of freedom and non-linear time dynamics

Parallel CFD Performance Evaluation of Adaptive Scientific Applications using TAU Uintah High-Level Component View

Parallel CFD Performance Evaluation of Adaptive Scientific Applications using TAU High Level Architecture C-SAFE Implicitly Connected to All Components UCF Data Control / Light Data Checkpointing Mixing Model Mixing Model Fluid Model Fluid Model Subgrid Model Subgrid Model Chemistry Database Controller Chemistry Database Controller Chemistry Databases Chemistry Databases High Energy Simulations High Energy Simulations Numerical Solvers Numerical Solvers Non-PSE Components Performance Analysis Performance Analysis Simulation Controller Simulation Controller Problem Specification Numerical Solvers Numerical Solvers MPM Material Properties Database Material Properties Database Blazer Database Visualization Data Manager Data Manager Post Processing And Analysis Post Processing And Analysis Parallel Services Parallel Services Resource Management Resource Management PSE Components Scheduler Uintah Parallel Component Architecture

Parallel CFD Performance Evaluation of Adaptive Scientific Applications using TAU Performance Evaluation of Uintah using Patches

Parallel CFD Performance Evaluation of Adaptive Scientific Applications using TAU Patch 1 Phase Profile

Parallel CFD Performance Evaluation of Adaptive Scientific Applications using TAU Node View

Parallel CFD Performance Evaluation of Adaptive Scientific Applications using TAU Callgraph View

Parallel CFD Performance Evaluation of Adaptive Scientific Applications using TAU Parallel CFD Frameworks using TAU  Virtual Test Facility (VTF) [Caltech, ASC Center]  MFIX [NETL]  Earth System Modeling Framework (ESMF) [UCAR,NASA,ANL, …]  SAMRAI [LLNL]  S3D [Sandia, UMD, U.Michigan, …]  GrACE [Rutgers]  Miranda [LLNL]  SAGE [SAIC]  FLASH2 [U. Chicago, ASC Flash Center]  Uintah Computational Framework [ASC C-SAFE Center, U. Utah] …

Parallel CFD Performance Evaluation of Adaptive Scientific Applications using TAU TAU Performance System Status  Computing platforms (selected)  IBM SP / pSeries/BGL, SGI Origin 2K/3K, Cray T3E / SV-1 / X1, HP (Compaq) SC (Tru64), Sun, Hitachi SR8000, NEC SX-5/6, Linux clusters (IA-32/64, Alpha, PPC, PA-RISC, Power, Opteron), Apple (G4/5, OS X), Windows  Programming languages  C, C++, Fortran 77/90/95, HPF, Java, OpenMP, Python  Thread libraries  pthreads, SGI sproc, Java,Windows, OpenMP, Charm++  Compilers (selected)  Intel KAI (KCC, KAP/Pro), PGI, GNU, Fujitsu, Sun, Microsoft, SGI, Cray, IBM (xlc, xlf), Compaq, NEC, Intel

Parallel CFD Performance Evaluation of Adaptive Scientific Applications using TAU Concluding Remarks  Complex parallel systems and software pose challenging performance analysis problems that require robust methodologies and tools  Introduced new measurement techniques in TAU for evaluating performance of adaptive scientific applications  Support for static and dynamic timers and phases  Application to the Uintah Computational Framework (U. Utah)  To build more sophisticated performance tools, existing proven performance technology must be utilized  TAU performance system offers robust performance technology that can be broadly integrated

Parallel CFD Performance Evaluation of Adaptive Scientific Applications using TAU Support Acknowledgements  Department of Energy (DOE)  Office of Science contracts  University of Utah DOE ASC Level 1 sub-contract  DOE ASC/NNSA Level 3 contract  NSF Software and Tools for High-End Computing Grant  Research Centre Juelich  John von Neumann Institute for Computing  Dr. Bernd Mohr  Los Alamos National Laboratory