Grouping Performance Data in TAU  Profile Groups  A group of related routines forms a profile group  Statically defined  TAU_DEFAULT, TAU_USER[1-5],

Slides:

Advertisements

Similar presentations

Three types of remote process invocation

Advertisements

Practical techniques & Examples

Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.

Programming Languages Marjan Sirjani 2 2. Language Design Issues Design to Run efficiently : early languages Easy to write correctly : new languages.

Dynamic performance measurement control Dynamic event grouping Multiple configurable counters Selective instrumentation Application-Level Performance Access.

Robert Bell, Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science.

Sameer Shende Department of Computer and Information Science Neuro Informatics Center University of Oregon Tool Interoperability.

Recent Advances in the TAU Performance System Sameer Shende, Allen D. Malony University of Oregon.

Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science Institute University.

Allen D. Malony Department of Computer and Information Science Computational Science Institute University of Oregon Integrating Performance.

The TAU Performance Technology for Complex Parallel Systems (Performance Analysis Bring Your Own Code Workshop, NRL Washington D.C.) Sameer Shende, Allen.

On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Bernd Mohr 1, Allen D. Malony 2, Rudi Eigenmann 3 1 Forschungszentrum.

1 Dr. Frederica Darema Senior Science and Technology Advisor NSF Future Parallel Computing Systems – what to remember from the past RAMP Workshop FCRC.

Case Study: PETSc ex19  Non-linear solver (snes)  2-D driven cavity code  uses velocity-velocity formulation  finite difference discretization on a.

Chapter 13 Embedded Systems

The TAU Performance System: Advances in Performance Mapping Sameer Shende University of Oregon.

1 ES 314 Advanced Programming Lec 2 Sept 3 Goals: Complete the discussion of problem Review of C++ Object-oriented design Arrays and pointers.

Inferring the Topology and Traffic Load of Parallel Programs in a VM environment Ashish Gupta Peter Dinda Department of Computer Science Northwestern University.

Sameer Shende, Allen D. Malony Computer & Information Science Department Computational Science Institute University of Oregon.

Instrumentation and Measurement CSci 599 Class Presentation Shreyans Mehta.

Portable and Predictable Performance on Heterogeneous Embedded Manycores (ARTEMIS ) ARTEMIS Project Review 28 nd October 2014 Multimedia Demonstrator.

Parallel Programming in Java with Shared Memory Directives.

CS 591x – Cluster Computing and Programming Parallel Computers Parallel Libraries.

An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos.

MpiP Evaluation Report Hans Sherburne, Adam Leko UPC Group HCS Research Laboratory University of Florida.

© 2008, Renesas Technology America, Inc., All Rights Reserved 1 Purpose  This training course describes how to configure the the C/C++ compiler options.

chap13 Chapter 13 Programming in the Large.

POOMA 2.4 Progress and Plans Scott Haney, Mark Mitchell, James Crotinger, Jeffrey Oldham, and Stephen Smith October 22, 2001 Los Alamos National Laboratory.

Adventures in Mastering the Use of Performance Evaluation Tools Manuel Ríos Morales ICOM 5995 December 4, 2002.

Support for Debugging Automatically Parallelized Programs Robert Hood Gabriele Jost CSC/MRJ Technology Solutions NASA.

UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.

Early Adopter: Integrating Concepts from Parallel and Distributed Computing into the Undergraduate Curriculum Eileen Kraemer Computer Science Department.

1 Computer Programming (ECGD2102 ) Using MATLAB Instructor: Eng. Eman Al.Swaity Lecture (1): Introduction.

Lecture 2 Foundations and Definitions Processes/Threads.

SciDAC All Hands Meeting, March 2-3, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Avery Ching, Kenin Coloma, Jianwei.

Principles of Program Design What should be taught in core programming curricula.

Dynamic performance measurement control Dynamic event grouping Multiple configurable counters Selective instrumentation Application-Level Performance Access.

CS333 Intro to Operating Systems Jonathan Walpole.

1 1 What does Performance Across the Software Stack mean?  High level view: Providing performance for physics simulations meaningful to applications 

Chapter 2 Processes and Threads Introduction 2.2 Processes A Process is the execution of a Program More specifically… – A process is a program.

Connections to Other Packages The Cactus Team Albert Einstein Institute

Overview of AIMS Hans Sherburne UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative note Green:

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

MPI and OpenMP.

© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Processes and Threads.

PHP vs. Python. Similarities are interpreted, high level languages with dynamic typing are Open Source are supported by large developer communities are.

Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.

Chapter 18 Object Database Management Systems. Outline Motivation for object database management Object-oriented principles Architectures for object database.

SDM Center Parallel I/O Storage Efficient Access Team.

3/12/2013Computer Engg, IIT(BHU)1 MPI-1. MESSAGE PASSING INTERFACE A message passing library specification Extended message-passing model Not a language.

C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.

Operating Systems Unit 2: – Process Context switch Interrupt Interprocess communication – Thread Thread models Operating Systems.

Hello world !!! ASCII representation of hello.c.

Sung-Dong Kim Dept. of Computer Engineering, Hansung University Chapter 3 Programming Tools.

Allen D. Malony Department of Computer and Information Science Computational Science Institute University of Oregon Integrating Performance.

Beyond Application Profiling to System Aware Analysis Elena Laskavaia, QNX Bill Graham, QNX.

SESSION 1 Introduction in Java. Objectives Introduce classes and objects Starting with Java Introduce JDK Writing a simple Java program Using comments.

Advanced Operating Systems CS6025 Spring 2016 Processes and Threads (Chapter 2)

Parallel OpenFOAM CFD Performance Studies Student: Adi Farshteindiker Advisors: Dr. Guy Tel-Zur,Prof. Shlomi Dolev The Department of Computer Science Faculty.

Kai Li, Allen D. Malony, Sameer Shende, Robert Bell

OPERATING SYSTEMS CS3502 Fall 2017

In-situ Visualization using VisIt

CS399 New Beginnings Jonathan Walpole.

A configurable binary instrumenter

TAU: A Framework for Parallel Performance Analysis

Allen D. Malony Computer & Information Science Department

Outline Introduction Motivation for performance mapping SEAA model

Parallel Program Analysis Framework for the DOE ACTS Toolkit

Foundations and Definitions

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

Grouping Performance Data in TAU  Profile Groups  A group of related routines forms a profile group  Statically defined  TAU_DEFAULT, TAU_USER[1-5], TAU_MESSAGE, TAU_IO, …  Dynamically defined  Group name based on string “adlib”, “particles”  Runtime lookup in a map to get unique group identifier  tau_instrumentor file.pdb file.cpp –o file.i.cpp -g “particles” Assigns all routines in file.cpp to group “particles”  Ability to change group names at runtime  Instrumentation control based on profile groups

TAU Instrumentation Control API  Enabling Profile Groups  TAU_ENABLE_INSTRUMENTATION(); // Global control  TAU_ENABLE_GROUP(TAU_GROUP); // statically defined  TAU_ENABLE_GROUP_NAME(“group name”); // dynamic  TAU_ENABLE_ALL_GROUPS(); // for all groups  Disabling Profile Groups  TAU_DISABLE_INSTRUMENTATION();  TAU_DISABLE_GROUP(TAU_GROUP);  TAU_DISABLE_GROUP_NAME();  TAU_DISABLE_ALL_GROUPS();  Obtaining Profile Group Identifier  TAU_GET_PROFILE_GROUP(“group name”);  Runtime Switching of Profile Groups  TAU_PROFILE_SET_GROUP(TAU_GROUP);  TAU_PROFILE_SET_GROUP_NAME(“group name”);

Disabling Dynamic Profile Group -- Example int main(int argc, char **argv) { /* Invoke program with --profile field+particles */ TAU_INIT(&argc, &argv); … } void foo(void) { TAU_PROFILE(“void foo(void)”, “ “, TAU_DEFAULT); Field f; TAU_DISABLE_GROUP_NAME(“field"); // other routines in “field” dynamic group are affected for (int i=0; i<N; i++) f.applyrules(i); }

TAU Pre-execution Control  Dynamic groups defined at file scope  Group names and group associations may be modified at runtime  Controlling groups at pre-execution time using --profile option % tau_instrumentor app.pdb app.cpp –o app.i.cpp –g “particles” % mpirun –np 4 application –profile particles+field+mesh+io  Enables instrumentation for TAU_DEFAULT and particles, field, mesh and io groups.  Examples:  POOMA v1 (LANL)  Static groups used  VTF (ASAP Caltech)  Dynamic execution instrumentation control by python based controller

Applications of TAU  POOMA  PETSc  SAMRAI

Performance Mapping in TAU: Motivation  Complexity  Layered software  Multi-level instrumentation  Entities not directly in source  Mapping  User-level abstractions

Hypothetical Mapping Example Engine  Particles distributed on surfaces of a cube Work packets

Hypothetical Mapping Example Source Particle* P[MAX]; /* Array of particles */ int GenerateParticles() { /* distribute particles over all faces of the cube */ for (int face=0, last=0; face < 6; face++){ /* particles on this face */ int particles_on_this_face = num(face); for (int i=last; i < particles_on_this_face; i++) { /* particle properties are a function of face */ P[i] =... f(face);... } last+= particles_on_this_face; }

Hypothetical Mapping Example (continued)  How much time is spent processing face i particles?  What is the distribution of performance among faces? int ProcessParticle(Particle *p) { /* perform some computation on p */ } int main() { GenerateParticles(); /* create a list of particles */ for (int i = 0; i < N; i++) /* iterates over the list */ ProcessParticle(P[i]); }

No Performance Mapping versus Mapping  Typical performance tools report performance with respect to routines  Do not provide support for mapping  Performance tools with SEAA mapping can observe performance with respect to scientist’s programming and problem abstractions without mappingwith mapping

Semantic Entities/Attributes/Associations  New dynamic mapping scheme - SEAA  Entities defined at any level of abstraction  Attribute entity with semantic information  Entity-to-entity associations  Two association types:  Embedded – extends data structure of associated object to store performance measurement entity  External – creates an external look-up table using address of object as the key to locate performance measurement entity

Mapping in POOMA II  POOMA [LANL] is a C++ framework for Computational Physics  Provides high-level abstractions:  Fields (Arrays), Particles, FFT, etc.  Encapsulates details of parallelism, data-distribution  Uses custom-computation kernels for efficient expression evaluation [PETE]  Uses vertical-execution of array statements to re-use cache [SMARTS]

POOMA II Array Example  Multi- dimensional array statements  A=B+C+D;

POOMA, PETE and SMARTS

Using Synchronous Timers

Form of Expression Templates in POOMA

Mapping Problem  One-to-many upward mapping  Traditional methods of mapping (ammortization/aggregation) lack resolution and accuracy! Template <class LHS, class RHS, class Op, class EvalTag> void ExpressionKernel<LHS,RHS,Op, EvalTag>::run() {/* iterate execution */ } A=1.0; B=2.0; … A= B+C+D; C=E-A+2.0*D;...

POOMA II Mappings  Each work packet belongs to an ExpressionKernel object  Each statement’s form associated with timer in the constructor of ExpressionKernel  ExpressionKernel class extended with embedded timer  Timing calls and entry and exit of run() method start and stop per object timer

Results of TAU Mappings  Per-statement profile!

POOMA Traces  Helps bridge the semantic-gap!

PETSc (ANL)  Portable, Extensible Toolkit for Scientific Computation  Suite of data structures and routines for the scalable (parallel) solution of scientific applications modeled by partial differential equations  Uses MPI for inter-process communication  Instrumentation  PDT for C/C++ source instrumentation  MPI wrapper library layer instrumentation  Example:  Solves a set of linear equations (Ax=b) in parallel (SLES)

PETSc Linear Equation Solver Profile

PETSc Traces

PETSc Calltree and Communication Matrix

SAMRAI (LLNL)  Structured Adaptive Mesh Refinement Application Infrastructure  Instrumentation for TAU:  PDT based C++ instrumentation  MPI wrapper interposition library based instrumentation  SAMRAI timers mapped to TAU timers  TAU’s Mapping API  Embedded association

Mapping in TAU  Embedded association vs External association SAMRAI Timer Performance Data... Hash Table TAU Timer

SAMRAI Euler Gas Dynamics Application (2D)  Adaptive Mesh Refinement (AMR) application  A single overarching algorithm object drives the time integration and adaptive gridding processes  Discrete Euler equations are solved on each patch in the AMR hierarchy

Euler Profile (ComputeFluxesOnPatch)

Euler Profile (Summary)

Euler Profile (Inclusive Time)

Euler Profile (Contribution of Flux computation)

Euler Traces

Euler CallTree (ComputeFluxesOnPatch)

Hands-on session  On mcurie.nersc.gov, copy files from /usr/local/pkg/acts/tau/tau2/tau-2.9/training  See README file  Set correct path e.g., % set path=($path /usr/local/pkg/acts/tau/tau2/tau2.9/t3e/bin)  Examine the Makefile.  Type “make” in each directory; then execute the program  Type “racy” or “vampir”  Type a project name e.g., “matrix.pmf” and click OK to see the performance data.

Examples The training directory contains example programs that illustrate the use of TAU instrumentation and measuremen options. instrument -This contains a simple C++ example that shows how TAU's API can be used for manually instrumenting a C++ program. It highlights instrumentation for templates and user defined events. threads - A simple multi-threaded program that shows how the main function of a thread is instrumented. Performance data is generated for each thread of execution. Configure with -pthread. cthreads - Same as threads above, but for a C program. An instrumented C program may be compiled with a C compiler, but needs to be linked with a C++ linker. Configure with -pthread. pi - An MPI program that calculates the value of pi and e. It highlights the use of TAU's MPI wrapper library. TAU needs to be configured with -mpiinc= and -mpilib=. Run using mpirun -np cpi. papi - A matrix multiply example that shows how to use TAU statement level timers for comparing the performance of two algorithms for matrix multiplication. When used with PAPI or PCL, this can highlight the cache behaviors of these algorithms. TAU should be configured with -papi= or -pcl= and the user should set PAPI_EVENT or PCL_EVENT respective environment variables, to use this.

Examples - (cont.) papithreads - Same as papi, but uses threads to highlight how hardware performance counters may be used in a multi-threaded application. When it is used with PAPI, TAU should be configured with -papi= -pthread autoinstrument - Shows the use of Program Database Toolkit (PDT) for automating the insertion of TAU macros in the source code. It requires configuring TAU with the -pdt= option. The Makefile is modified to illustrate the use of a source to source translator (tau_instrumentor). NPB2.3 - The NAS Parallel Benchmark 2.3 [from NASA Ames]. It shows how to use TAU's MPI wrapper with a manually instrumented Fortran program. LU and SP are the two benchmarks. LU is instrumented completely, while only parts of the SP program are instrumented to contrast the coverage of routines. In both cases MPI level instrumentation is complete. TAU needs to be configured with -mpiinc= and -mpilib= to use this.