Allen D. Malony, Sameer Shende, Alan Morris Department of Computer and Information Science Performance Research.

Slides:

Advertisements

Similar presentations

Introduction to Object Orientation System Analysis and Design

Advertisements

Software Modeling SWE5441 Lecture 3 Eng. Mohammed Timraz

Irwin/McGraw-Hill Copyright © 2004 The McGraw-Hill Companies. All Rights reserved Whitten Bentley DittmanSYSTEMS ANALYSIS AND DESIGN METHODS6th Edition.

Programming Logic and Design Fourth Edition, Introductory

OBJECT ORIENTED PROGRAMMING M Taimoor Khan

Robert Bell, Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science.

Sameer Shende Department of Computer and Information Science Neuro Informatics Center University of Oregon Tool Interoperability.

Knowledge Acquisitioning. Definition The transfer and transformation of potential problem solving expertise from some knowledge source to a program.

PROCESS MODELING Transform Description. A model is a representation of reality. Just as a picture is worth a thousand words, most models are pictorial.

The TAU Performance Technology for Complex Parallel Systems (Performance Analysis Bring Your Own Code Workshop, NRL Washington D.C.) Sameer Shende, Allen.

Nick Trebon, Alan Morris, Jaideep Ray, Sameer Shende, Allen Malony {ntrebon, amorris, Department of.

On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Bernd Mohr 1, Allen D. Malony 2, Rudi Eigenmann 3 1 Forschungszentrum.

Run time vs. Compile time

The TAU Performance System: Advances in Performance Mapping Sameer Shende University of Oregon.

Modified from Sommerville’s originalsSoftware Engineering, 7th edition. Chapter 8 Slide 1 System models.

Chapter 9: Subprogram Control

Outline Chapter 1 Hardware, Software, Programming, Web surfing, … Chapter Goals –Describe the layers of a computer system –Describe the concept.

1 Run time vs. Compile time The compiler must generate code to handle issues that arise at run time Representation of various data types Procedure linkage.

TAU: Performance Regression Testing Harness for FLASH Sameer Shende

Programming Logic and Design, Introductory, Fourth Edition1 Understanding Computer Components and Operations (continued) A program must be free of syntax.

Performance Profiling Overhead Compensation for MPI Programs Sameer Shende, Allen D. Malony, Alan Morris, Felix Wolf

Kai Li, Allen D. Malony, Robert Bell, Sameer Shende Department of Computer and Information Science Computational.

Lesson-21Process Modeling Define systems modeling and differentiate between logical and physical system models. Define process modeling and explain its.

Object-oriented design CS 345 September 20,2002. Unavoidable Complexity Many software systems are very complex: –Many developers –Ongoing lifespan –Large.

Chapter 6: The Traditional Approach to Requirements

PROCESS MODELING Chapter 8 - Process Modeling

Abstraction IS 101Y/CMSC 101 Computational Thinking and Design Tuesday, September 17, 2013 Carolyn Seaman University of Maryland, Baltimore County.

O BJECT O RIENTATION F UNDAMENTALS Prepared by: Gunjan Chhabra.

SOFTWARE ENGINEERING BIT-8 APRIL, 16,2008 Introduction to UML.

An Introduction to Software Architecture

Introduction to MDA (Model Driven Architecture) CYT.

Department of Computer Science A Static Program Analyzer to increase software reuse Ramakrishnan Venkitaraman and Gopal Gupta.

Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.

Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.

Abstraction IS 101Y/CMSC 101 Computational Thinking and Design Tuesday, September 17, 2013 Marie desJardins University of Maryland, Baltimore County.

CHAPTER TEN AUTHORING.

Profile Analysis with ParaProf Sameer Shende Performance Reseaerch Lab, University of Oregon

Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,

Sommerville 2004,Mejia-Alvarez 2009Software Engineering, 7th edition. Chapter 8 Slide 1 System models.

Software Engineering Prof. Ing. Ivo Vondrak, CSc. Dept. of Computer Science Technical University of Ostrava

IFS310: Module 6 3/1/2007 Data Modeling and Entity-Relationship Diagrams.

SOFTWARE DESIGN. INTRODUCTION There are 3 distinct types of activities in design 1.External design 2.Architectural design 3.Detailed design Architectural.

1. 2 Preface In the time since the 1986 edition of this book, the world of compiler design has changed significantly 3.

A.Alzubair Hassan Abdullah Dept. Computer Sciences Kassala University A.Alzubair Hassan Abdullah Dept. Computer Sciences Kassala University NESTED SUBPROGRAMS.

Allen D. Malony Department of Computer and Information Science TAU Performance Research Laboratory University of Oregon Discussion:

Connections to Other Packages The Cactus Team Albert Einstein Institute

Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.

How Are Computers Programmed? CPS120: Introduction to Computer Science Lecture 5.

Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.

Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.

21/1/ Analysis - Model of real-world situation - What ? System Design - Overall architecture (sub-systems) Object Design - Refinement of Design.

Performane Analyzer Performance Analysis and Visualization of Large-Scale Uintah Simulations Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance.

INTRODUCTION TO COMPUTER PROGRAMMING(IT-303) Basics.

Online Performance Analysis and Visualization of Large-Scale Parallel Applications Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance Research.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

Kai Li, Allen D. Malony, Sameer Shende, Robert Bell

Review of last class Software Engineering Modeling Problem Solving

Tools Of Structured Analysis

Chapter 6 The Traditional Approach to Requirements.

Names and Attributes Names are a key programming language feature

Course Outcomes of Object Oriented Modeling Design (17630,C604)

Performance Technology for Scalable Parallel Systems

Allen D. Malony, Sameer Shende

Analysis models and design models

An Introduction to Software Architecture

Allen D. Malony Computer & Information Science Department

Outline Introduction Motivation for performance mapping SEAA model

Design Yaodong Bi.

Trees-2, Graphs Data Structures with C Chpater-6 Course code: 10CS35

Presentation transcript:

Allen D. Malony, Sameer Shende, Alan Morris Department of Computer and Information Science Performance Research Laboratory NeuroInformatics Center University of Oregon Phase-Based Parallel Performance Profiling

ParCo Outline of Talk  Motivation  Models in parallel scientific applications  Phases and performance mapping  Problem description  Motivating example  Profiling techniques  Flat, callpath, phase profiling  Approach and implementation  Applications  Future work and concluding remarks

Phase-Based Parallel Performance ProfilingParCo Motivation  Scientific applications designed based on models  Computational: structural, logical, numerical models, …  Correctness: execution order, data consistency, …  Performance: expected, factors, parallelism/scalability, …  Computational models form developer’s “mental” model  How the program is intended to behave and perform  Want to relate performance model to computation model  View performance data with respect to “mental” model  Better identify problems and guide tuning decisions  Must link computational abstractions to performance  Bridge semantic gap – measurements  “mental” model

Phase-Based Parallel Performance ProfilingParCo Computational Models  Structural models  Program organization and code relationships  Language used, layout of application parts, …  Constructed generally and unfolds during execution  Logical and numerical models  Capture algorithmic characteristics of the application  “Semantic” properties of the computation  correct flow of operation and assertions on application state  Numerical models  Algorithms for simulating physical phenomena  Accuracy properties from numerical calculations  Structural and logical models implicit

Phase-Based Parallel Performance ProfilingParCo Performance Mapping  General problem of linking performance to computation  Performance mapping (Irvin and Miller, ‘96; Shende, ‘01)  Associate (map) measured performance data  To higher level, semantic representations  Those with model significance to the user  What is the difficulty of making the association  Depends on performance information  performance events/state visible from instrumentation  what performance data can be measured  How the performance information is used in mapping  Difficulty in how performance information is presented  Model-based views (LeBlanc et al., ‘90)

Phase-Based Parallel Performance ProfilingParCo Phases and Performance Mapping  Like to support the association between model and data  Concept of “phases” is common in scientific applications  How developers think about structure, logic, numerics  How performance can be interpreted (Worley, ‘92)  Worthwhile to consider support for phases  In performance measurement  Bridge semantic gap in parallel performance mapping?  tracing has long demonstrated the benefits! (Heath, ‘91)  phase-based analysis and interpretation  Main contribution  Support for phases in parallel performance profiling

Phase-Based Parallel Performance ProfilingParCo Problem Description  Performance measured as a consequence of events  Events represent actions that occur during execution  Events of interest determine performance information  Events have semantics and context (pragmatics)  Semantics  Defines what the event represents  Example: subroutine entry  Context  Properties of the state in which event occurred  Example: subroutine’s calling parent  Interrogate context to map event performance data

Phase-Based Parallel Performance ProfilingParCo heat() stress() MPIrecv() MPIsend() other routines Motivating Example – Multi-Physics Application  Assembly of physical objects  Different shapes  Different materials  Calculate physics  Heat transfer  Mechanical stress  Within / between objects  Iterate to error tolerance  How is performance attributed?  Between events (e.g., routines) and execution components  With respect to computational objects (e.g., data objects)

Phase-Based Parallel Performance ProfilingParCo Context and Standard Profiling  Flat profiles  Context is whole program (i.e., program code)  Performance distribution across (static) program structure  Cannot differentiate dynamics (e.g., callpath or objects)  Callgraph / callpath profiles  Identify parent-child calling relationships at exectution  Context is calling (event) parent / calling (event) path  Extend event semantics to encode context  create new event with callpath name  requires dynamic event creation for complex callpaths  burdens event mechanisms for context identification  simple performance associations require many events

Phase-Based Parallel Performance ProfilingParCo Context and Phase Profiling  View the program execution as collection of phases  Transition between phases (sequenced, nested)  easiest to think of as phase hierarchy (or phase graph)  Phases are not events  phase boundaries can mark entry/exit events  Context is the current phase  How do we know what phase we are in?  Phases are identified separately from events  phases are not encoded in event names  event mechanisms are not overloaded  A phase profile is event performance attributed to phases  Phase-specific performance profiles (flat or callpath)

Phase-Based Parallel Performance ProfilingParCo Approach (Flat Profile)  Create a profile object for each entry/exit event  Each profile object has a name  Static profile object (static event)  event has a single instance (single name)  Dynamic profile object (dynamic event)  event can have multiple instances (created dynamically)  Inclusive and exclusive performance statistics  Must maintain an event stack (or callstack)  Context are generally thought of as code locations  Dynamic events do allow for dynamic context awareness  User code can check “state” and create new events  BUT only see one level of event!

Phase-Based Parallel Performance ProfilingParCo Approach (Callpath Profile)  Show event calling (nesting) relationships  Create a profile object for each event calling context  Each profile object has a name that encodes the callpath  Static profile object  callpath has a single instance (single name)  Dynamic profile object  callpath can have multiple instances (created dynamically)  Reuse event mechanisms  Interrogate the event stack to form event names  “ main=> f1 => f2 => MPI_Send ”  Inclusive and exclusive performance statistics  Callpath length and callgraph depth options

Phase-Based Parallel Performance ProfilingParCo Approach (Phase Profile)  A phase is an execution abstraction  Two questions  How to inform the measurement systems about phases?  How to collect the performance data?  Create a phase object when new phase is created  Each phase object has a name  Static and dynamic phase objects  Phase relationships  Phases may be nested (cannot overlap)  “Active” phase object follows scoping rules  Default (top-level) phase is outermost event (e.g., main )

Phase-Based Parallel Performance ProfilingParCo Approach (Phase Profile - API)  Phase creation TAU_PHASE_CREATE_STATIC(var, name, type, group) TAU_PHASE_CREATE_DYNAMIC(var, name, type, group) TAU_GLOBAL_PHASE(var, name, type, group) TAU_GLOBAL_PHASE_EXTERNAL(var)  Global phases have global scope (accessible anywhere)  External declarations for defined phases outside file scope  Phase control TAU_PHASE_START(var) TAU_PHASE_STOP(var) TAU_GLOBAL_PHASE_START(var) TAU_GLOBAL_PHASE_STOP(var)  Collects a callgraph profile (depth 2) PER PHASE!  Phases default as standard events (when disable)

Phase-Based Parallel Performance ProfilingParCo Approach (Phase Profile - Data Collection)  Leverages performance mapping and callpath profiling  Phase entry  Phase object pushed to measurement (event) callstack  Phase / event entry  Need to determine (event, phase) tuple  traverse callstack to find enclosing phase  construct key for (event, phase) tuple  Maintain global map  new keys for new (event, phase) tuples put into global map create new profile object for every (event, phase) tuple  search global map to determine is tuple occurred before  Use mapping support to store performance data on exit

Phase-Based Parallel Performance ProfilingParCo Multi-Physics Example heat() stress() MPIrecv() MPIsend() other routines events only two events! phases iterate phase Instrumentation heat phase stress phase

Phase-Based Parallel Performance ProfilingParCo Implementation  Parallel profiling in the TAU performance system  Flat profiling  Callpath and callgraph (2-level callpath) profiling  Phase profiling  Multiple performance metrics  Execution time  Hardware performance counters (using PAPI)  Scalable to tens of thousands of processors  Profile analysis and data management tools  ParaProf parallel profile analyzer / visualizer  PerfDMF parallel profile database

Phase-Based Parallel Performance ProfilingParCo Application – NAS Parallel Benchmarks  Phase profiling can provide more refined profile results  Specific to phase localities  Defining phases is an application-specific issue  Apply understanding of computational models  Unfortunately, we were not the application developers  How to decide on phases and phase instrumentation?  Informed by application documentation and code  Look at NAS parallel benchmark application suite  Identify benchmarks with phase behavior  SP, BT, LU (simulated CFD codes) and CG  Focus on BT

Phase-Based Parallel Performance ProfilingParCo NAS BT – Phase Analysis  Emulates a CFD application  System of linear equations  Implicit finite-difference discretization of Navier-Stokes  Solve three sets of uncoupled systems of equations  in X, Y, Z directions  Block tridiagonal with 5x5 blocks  Square number of processors  Phase analysis  Highlight performance for each solution direction  Identified in code by three main functions  x_solve, y_solve, z_solve  Static phases

Phase-Based Parallel Performance ProfilingParCo NAS BT – Instrumentation call TAU_PHASE_CREATE_STATIC(xsolvephase,’x_solve phase’) call TAU_PHASE_START(xsolvephase) call x_solve call TAU_PHASE_STOP(xsolvephase) call TAU_PHASE_CREATE_STATIC(ysolvephase,’y_solve phase’) call TAU_PHASE_START(ysolvephase) call y_solve call TAU_PHASE_STOP(ysolvephase) call TAU_PHASE_CREATE_STATIC(zsolvephase,’z_solve phase’) call TAU_PHASE_START(zsolvephase) call z_solve call TAU_PHASE_STOP(zsolvephase)

Phase-Based Parallel Performance ProfilingParCo NAS BT – Flat Profile How is MPI_Wait() distributed relative to solver direction? Application routine names reflect phase semantics

Phase-Based Parallel Performance ProfilingParCo NAS BT – Phase Profile (Main and X, Y, Z) Main phase shows nested phases and immediate events

Phase-Based Parallel Performance ProfilingParCo Application – MFIX  Multiphase Flow with Interphase eXchanges (MFIX)  National Energy Transfer Laboratory (NETL)  Study physical/chemistry properties in fluid-solid systems  hydrodynamics, heat transfer, chemical reactions  Characteristic of large-scale iterative simulations  major loop executed as simulation advances in time  Testcase  Models Ozone decomposition in a bubbling fluidized bed  Flat profile  Iterate phase profile  Demonstrate dynamic phases

Phase-Based Parallel Performance ProfilingParCo MFIX– Phase Instrumentation (ITERATE) SUBROUTINE ITERATE(IER, NIT) character(11) taucharary integer tauiteration / 0 / integer profiler(2) / 0, 0 / save profiler, tauiteration write (taucharary, ’(a8,i3)’) ’ITERATE ’, tauiteration tauiteration = tauiteration + 1 call TAU_PHASE_CREATE_DYNAMIC(profiler,taucharary) call TAU_PHASE_START(profiler) ! WORK call TAU_PHASE_STOP(profiler) END SUBROUTINE ITERATE

Phase-Based Parallel Performance ProfilingParCo MFIX – Phase Profile (MPI_Waitall) In 51 st iteration, time spent in MPI_Waitall was secs Total time spent in MPI_Waitall was secs across all 92 iterations dynamic phases one per interation

Phase-Based Parallel Performance ProfilingParCo MFIX Iterate Phase Behavior

Phase-Based Parallel Performance ProfilingParCo Concluding Discussion and Future Work  Phased-based profiling can help to bridge semantic gap  Computational models  performance measurements  Application-specific performance analysis  Implemented phase profiling in TAU  Demonstrated phase profiling  NAS BT benchmark and MFIX application  Also used in S3D, Uintah, Flash on large-scale platforms  Requires application-specific knowledge  Might be possible to link to auto phase identification  Based on memory tracing or application state change  Can this idea be extended to global parallel phases?  Working on better ways to present phase performance

Phase-Based Parallel Performance ProfilingParCo Support Acknowledgements  Department of Energy (DOE)  Office of Science contracts  University of Utah ASCI Level 1 sub-contract  ASC/NNSA Level 3 contract  Department of Defense (DoD)  HPC Modernization Office (HPCMO)  Programming Environment and Training (PET)  NSF  Research Centre Juelich  Los Alamos National Laboratory 