Workload Characterization using the TAU Performance System Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer,

Slides:

Advertisements

Similar presentations

Machine Learning-based Autotuning with TAU and Active Harmony Nicholas Chaimov University of Oregon Paradyn Week 2013 April 29, 2013.

Advertisements

K T A U Kernel Tuning and Analysis Utilities Department of Computer and Information Science Performance Research Laboratory University of Oregon.

Automated Instrumentation and Monitoring System (AIMS)

Dynamic performance measurement control Dynamic event grouping Multiple configurable counters Selective instrumentation Application-Level Performance Access.

Sameer Shende, Allen D. Malony, and Alan Morris {sameer, malony, Steven Parker, and J. Davison de St. Germain {sparker,

Robert Bell, Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science.

Scalability Study of S3D using TAU Sameer Shende

Sameer Shende Department of Computer and Information Science Neuro Informatics Center University of Oregon Tool Interoperability.

Profiling S3D on Cray XT3 using TAU Sameer Shende

TAU: Tuning and Analysis Utilities. TAU Performance System Framework  Tuning and Analysis Utilities  Performance system framework for scalable parallel.

TAU Parallel Performance System DOD UGC 2004 Tutorial Allen D. Malony, Sameer Shende, Robert Bell Univesity of Oregon.

The TAU Performance Technology for Complex Parallel Systems (Performance Analysis Bring Your Own Code Workshop, NRL Washington D.C.) Sameer Shende, Allen.

Nick Trebon, Alan Morris, Jaideep Ray, Sameer Shende, Allen Malony {ntrebon, amorris, Department of.

TAU Performance System

On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Bernd Mohr 1, Allen D. Malony 2, Rudi Eigenmann 3 1 Forschungszentrum.

The TAU Performance System: Advances in Performance Mapping Sameer Shende University of Oregon.

Workshop on Performance Tools for Petascale Computing 9:30 – 10:30am, Tuesday, July 17, 2007, Snowbird, UT Sameer S. Shende

TAU Performance System Alan Morris, Sameer Shende, Allen D. Malony University of Oregon {amorris, sameer,

Performance Tools BOF, SC’07 5:30pm – 7pm, Tuesday, A9 Sameer S. Shende Performance Research Laboratory University.

Performance Instrumentation and Measurement for Terascale Systems Jack Dongarra, Shirley Moore, Philip Mucci University of Tennessee Sameer Shende, and.

Allen D. Malony Department of Computer and Information Science Computational Science Institute University of Oregon TAU Performance.

June 2, 2003ICCS Performance Instrumentation and Measurement for Terascale Systems Jack Dongarra, Shirley Moore, Philip Mucci University of Tennessee.

Performance Evaluation of S3D using TAU Sameer Shende

TAU: Performance Regression Testing Harness for FLASH Sameer Shende

Sameer Shende, Allen D. Malony and Alan Morris {sameer, malony, Department of Computer and Information Science Performance Research.

Scalability Study of S3D using TAU Sameer Shende

Optimization of Instrumentation in Parallel Performance Evaluation Tools Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer,

Allen D. Malony, Sameer Shende, Robert Bell Department of Computer and Information Science Computational Science Institute, NeuroInformatics.

Kai Li, Allen D. Malony, Robert Bell, Sameer Shende Department of Computer and Information Science Computational.

The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon.

Sameer Shende, Allen D. Malony Computer & Information Science Department Computational Science Institute University of Oregon.

Performance Observation Sameer Shende and Allen D. Malony cs.uoregon.edu.

1 Performance Analysis with Vampir DKRZ Tutorial – 7 August, Hamburg Matthias Weber, Frank Winkler, Andreas Knüpfer ZIH, Technische Universität.

MpiP Evaluation Report Hans Sherburne, Adam Leko UPC Group HCS Research Laboratory University of Florida.

Paradyn Week – April 14, 2004 – Madison, WI DPOMP: A DPCL Based Infrastructure for Performance Monitoring of OpenMP Applications Bernd Mohr Forschungszentrum.

Allen D. Malony Performance Research Laboratory (PRL) Neuroinformatics Center (NIC) Department.

Score-P – A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir Alexandru Calotoiu German Research School for.

Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.

Using TAU on SiCortex Alan Morris, Aroon Nataraj Sameer Shende, Allen D. Malony University of Oregon {amorris, anataraj, sameer,

Profile Analysis with ParaProf Sameer Shende Performance Reseaerch Lab, University of Oregon

1 Performance Analysis with Vampir ZIH, Technische Universität Dresden.

Message Passing Programming Model AMANO, Hideharu Textbook pp. １４０－１４７.

Dynamic performance measurement control Dynamic event grouping Multiple configurable counters Selective instrumentation Application-Level Performance Access.

ASC Tri-Lab Code Development Tools Workshop Thursday, July 29, 2010 Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA This work.

Early Experiences with KTAU on the Blue Gene / L A. Nataraj, A. Malony, A. Morris, S. Shende Performance Research Lab University of Oregon.

Allen D. Malony, Sameer S. Shende, Alan Morris, Robert Bell, Kevin Huck, Nick Trebon, Suravee Suthikulpanit, Kai Li, Li Li

Allen D. Malony Performance Research Laboratory (PRL) Neuroinformatics Center (NIC) Department.

Allen D. Malony Department of Computer and Information Science TAU Performance Research Laboratory University of Oregon Discussion:

Tool Visualizations, Metrics, and Profiled Entities Overview [Brief Version] Adam Leko HCS Research Laboratory University of Florida.

Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.

Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.

Allen D. Malony Department of Computer and Information Science Performance Research Laboratory.

Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs Allen D. Malony, Scott Biersdorff, Sameer Shende, Heike Jagode†, Stanimire.

Performane Analyzer Performance Analysis and Visualization of Large-Scale Uintah Simulations Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance.

3/12/2013Computer Engg, IIT(BHU)1 MPI-1. MESSAGE PASSING INTERFACE A message passing library specification Extended message-passing model Not a language.

TAU Performance System ® TAU is a profiling and tracing toolkit that supports programs written in C, C++, Fortran, Java, Python,

Allen D. Malony Performance Research Laboratory (PRL) Neuroinformatics Center (NIC) Department.

Performance Tool Integration in Programming Environments for GPU Acceleration: Experiences with TAU and HMPP Allen D. Malony1,2, Shangkar Mayanglambam1.

Kai Li, Allen D. Malony, Sameer Shende, Robert Bell

Introduction to the TAU Performance System®

Performance Technology for Scalable Parallel Systems

TAU integration with Score-P

Allen D. Malony, Sameer Shende

TAU Parallel Performance System

Advanced TAU Commander

A configurable binary instrumenter

Introduction to parallelism and the Message Passing Interface

Outline Introduction Motivation for performance mapping SEAA model

Parallel Program Analysis Framework for the DOE ACTS Toolkit

TAU Performance DataBase Framework (PerfDBF)

Presentation transcript:

Workload Characterization using the TAU Performance System Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer, PARA’06: MS25: Tools, Frameworks and Applications for High Performance Computing, 12:10 – 12:30 pm,Wed 6/21/06

2 Workload characterization using the TAU performance system Acknowledgements  David Skinner, NERSC (IPM project)

3 Workload characterization using the TAU performance system Outline  Overview of features  Instrumentation  Measurement (Profiling, Tracing)  Analysis tools  Workload characterization  Conclusions

4 Workload characterization using the TAU performance system TAU Performance System  Tuning and Analysis Utilities (14+ year project effort)  Performance system framework for HPC systems  Integrated, scalable, portable, flexible, and parallel  Integrated toolkit for performance problem solving  Automatic instrumentation  Highly configurable measurement system with support for many flavors of profiling and tracing  Portable analysis and visualization tools  Performance data management and data mining  Open-source, FreeBSD style license 

5 Workload characterization using the TAU performance system TAU Instrumentation Approach  Support for standard program events  Routines  Classes and templates  Statement-level blocks  Support for user-defined events  Begin/End events (“user-defined timers”)  Atomic events (e.g., size of memory allocated/freed)  Support definition of “semantic” entities for mapping  Support for event groups  Instrumentation optimization (eliminate instrumentation in lightweight routines)

6 Workload characterization using the TAU performance system TAU Instrumentation  Flexible instrumentation mechanisms at multiple levels  Source code  manual (TAU API, TAU Component API)  automatic C, C++, F77/90/95 (Program Database Toolkit (PDT)) OpenMP (directive rewriting (Opari), POMP spec)  Object code  pre-instrumented libraries (e.g., MPI using PMPI)  statically-linked and dynamically-linked  Executable code  dynamic instrumentation (pre-execution) (DynInstAPI)  virtual machine instrumentation (e.g., Java using JVMPI)  Runtime Linking (LD_PRELOAD)

7 Workload characterization using the TAU performance system Automatic Instrumentation  We now provide compiler wrapper scripts  Simply replace mpxlf90 with tau_f90.sh  Automatically instruments Fortran source code, links with TAU MPI Wrapper libraries.  Use tau_cc.sh and tau_cxx.sh for C/C++ Before CXX = mpCC F90 = mpxlf90_r CFLAGS = LIBS = -lm OBJS = f1.o f2.o f3.o … fn.o app: $(OBJS) $(CXX) $(LDFLAGS) $(OBJS) -o $(LIBS).cpp.o: $(CC) $(CFLAGS) -c $< After CXX = tau_cxx.sh F90 = tau_f90.sh CFLAGS = LIBS = -lm OBJS = f1.o f2.o f3.o … fn.o app: $(OBJS) $(CXX) $(LDFLAGS) $(OBJS) -o $(LIBS).cpp.o: $(CC) $(CFLAGS) -c $<

8 Workload characterization using the TAU performance system Profiling Options  Flat profiles  Time (or counts) spent in each routine (nodes in callgraph).  Exclusive/inclusive time, no. of calls, child calls  Support for hardware counters (PAPI, PCL), multiple counters.  Callpath Profiles  Flat profiles, plus  Time spent along a calling path (edges in callgraph)  E.g., “ main=> f1 => f2 => MPI_Send ” shows the time spent in MPI_Send when called by f2, when f2 is called by f1, when it is called by main.  Configurable callpath depth limit ( TAU_CALLPATH_DEPTH environment variable)  Phase based profiles  Flat profiles under a phase (nested phases are allowed)  Default “main” phase has all phases and routines invoked outside phases  Supports static or dynamic (per-iteration) phases  E.g., “ IO => MPI_Send ” is time spent in MPI_Send during “ IO ” phase

9 Workload characterization using the TAU performance system ParaProf – Manager Window performance database derived performance metrics

10 Workload characterization using the TAU performance system ParaProf – Full Profile (Miranda) 8K processors!

11 Workload characterization using the TAU performance system ParaProf - Statistics Table (Uintah)

12 Workload characterization using the TAU performance system ParaProf –Callgraph View (MFIX)

13 Workload characterization using the TAU performance system ParaProf – Histogram View (Miranda) 8k processors 16k processors  Scalable 2D displays

14 Workload characterization using the TAU performance system ParaProf – 3D Full Profile (Miranda) 16k processors

15 Workload characterization using the TAU performance system ParaProf – 3D Scatterplot (Miranda)  Each point is a “thread” of execution  Relation between four routines shown at once

16 Workload characterization using the TAU performance system Tracing (Vampir)  Trace analysis provides in-depth understanding of temporal event and message passing relationships  Traces can even store hardware counters

17 Workload characterization using the TAU performance system Runtime MPI shared library instrumentation  We can now interpose the MPI wrapper library for applications that have already been compiled (no re- compilation or re-linking necessary!)  Uses LD_PRELOAD for Linux  On AIX it uses MPI_EUILIB/MPI_EUILIBPATH  Simply compile TAU with MPI support and prefix your MPI program with tau_load.sh (use tau_poe under AIX)  Requires shared library MPI % mpirun –np 4 tau_load.sh a.out % tau_poe a.out –procs 8

18 Workload characterization using the TAU performance system Workload Characterization  Idea: partition performance data for individual functions based on runtime parameters  Enable by configuring with –PROFILEPARAM  TAU call: TAU_PROFILE_PARAM1L (value, “name”)  Simple example: void foo(int input) { TAU_PROFILE("foo", "", TAU_DEFAULT); TAU_PROFILE_PARAM1L(input, "input");... }

19 Workload characterization using the TAU performance system Workload Characterization  5 seconds spent in function “ foo ” becomes  2 seconds for “ foo [ = ] ”  1 seconds for “ foo [ = ] ”  …  Currently used in MPI wrapper library  Allows for partitioning of time spent in MPI routines based on parameters (message size, message tag, destination node)  Can be extrapolated to infer specifics about the MPI subsystem and system as a whole

20 Workload characterization using the TAU performance system Workload Characterization  Simple example, send/receive squared message sizes (0-32MB) #include int main(int argc, char **argv) { int rank, size, i, j; int buffer[16*1024*1024]; MPI_Init(&argc, &argv); MPI_Comm_size( MPI_COMM_WORLD, &size ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); for (i=0;i<1000;i++) for (j=1;j<16*1024*1024;j*=2) { if (rank == 0) { MPI_Send(buffer,j,MPI_INT,1,42,MPI_COMM_WORLD); } else { MPI_Status status; MPI_Recv(buffer,j,MPI_INT,0,42,MPI_COMM_WORLD,&status); } MPI_Finalize(); }

21 Workload characterization using the TAU performance system Workload Characterization  Use tau_load.sh to instrument MPI routines (SGI Altix) % icc mpi.c –lmpi % mpirun –np 2 tau_load.sh –XrunTAU-icpc-mpi-pdt a.out SGI MPI (SGI Altix) Intel MPI (SGI Altix)

22 Workload characterization using the TAU performance system Workload Characterization  MPI Results (NAS Parallel Benchmark 3.1, LU class D on 16 processors of SGI Altix)

23 Workload characterization using the TAU performance system Workload Characterization  Two different message sizes (~3.3MB and ~4K)

24 Workload characterization using the TAU performance system Workload Characterization  Two different message sizes (~3.3MB and ~4K)

25 Workload characterization using the TAU performance system Job Tracking: ParaProf profile browser LU spent seconds sending messages of size It got Mflops!

26 Workload characterization using the TAU performance system Memory Consumption Tracking  At every 10 second interval TAU can track the heap memory used  For LU.A.4, the max heap memory used was 44.9MB on node 0.

27 Workload characterization using the TAU performance system Vampir, VNG, and OTF  Commercial trace based tools developed at ZiH, T.U. Dresden  Wolfgang Nagel, Holger Brunst and others…  Vampir Trace Visualizer (aka Intel ® Trace Analyzer v4.0)  Sequential program  Vampir Next Generation (VNG)  Client (vng) runs on a desktop, server (vngd) on a cluster  Parallel trace analysis  Orders of magnitude bigger traces (more memory)  Open Trace Format (OTF)  Hierarchical trace format, efficient streams based parallel access with VNGD  Replacement for proprietary formats such as STF  Tracing library available on IBM BG/L platform  Open Source release of OTF by SC06  Development of OTF supported by LLNL contract

28 Workload characterization using the TAU performance system VNG Timeline Display (Miranda on BGL)

29 Workload characterization using the TAU performance system VNG Timeline Zoomed In

30 Workload characterization using the TAU performance system VNG Process Timeline with PAPI Counters

31 Workload characterization using the TAU performance system KTAU on BG/L  KTAU designed for Linux Kernel profiling  Provides merged application/system profile  Runs on I/O-Node of BG/L

32 Workload characterization using the TAU performance system KTAU on BG/L  Current status  Detailed I/O Node kernel profiling/tracing  KTAU integrated into ZeptoOS build system  KTAU-Daemon (KTAU-D) on I/O Node  Monitors system-wide and/or individual processes  Visualization of trace/profile of ZeptoOS and CIOD  Vampir/JumpShot (trace), and Paraprof (profile)

33 Workload characterization using the TAU performance system KTAU on BG/L  Example of I/O Node profile data  Numbers in microseconds, inclusive left, exclusive right

34 Workload characterization using the TAU performance system KTAU on BG/L, Trace Data

35 Workload characterization using the TAU performance system Support Acknowledgements  Department of Energy (DOE)  Office of Science contracts  University of Utah ASC Level 1 sub-contract  LLNL ASC/NNSA Level 3 contract  LLNL ParaTools/GWT contract  NSF  High-End Computing Grant  T.U. Dresden, GWT  Dr. Wolfgang Nagel and Holger Brunst  Research Centre Juelich  Dr. Bernd Mohr  Los Alamos National Laboratory contracts