1 Allen D. Malony, Sameer Shende, Alan Morris {malony,sameer,amorris} Department of Computer and Information Science Performance Research Laboratory NeuroInformatics Center University of Oregon Phase-Based Parallel Performance Profiling

2 ParCo 20052 Outline of Talk  Motivation  Models in parallel scientific applications  Phases and performance mapping  Problem description  Motivating example  Profiling techniques  Flat, callpath, phase profiling  Approach and implementation  Applications  Future work and concluding remarks

3 Phase-Based Parallel Performance ProfilingParCo 20053 Motivation  Scientific applications designed based on models  Computational: structural, logical, numerical models, …  Correctness: execution order, data consistency, …  Performance: expected, factors, parallelism/scalability, …  Computational models form developer’s “mental” model  How the program is intended to behave and perform  Want to relate performance model to computation model  View performance data with respect to “mental” model  Better identify problems and guide tuning decisions  Must link computational abstractions to performance  Bridge semantic gap – measurements  “mental” model

4 Phase-Based Parallel Performance ProfilingParCo 20054 Computational Models  Structural models  Program organization and code relationships  Language used, layout of application parts, …  Constructed generally and unfolds during execution  Logical and numerical models  Capture algorithmic characteristics of the application  “Semantic” properties of the computation  correct flow of operation and assertions on application state  Numerical models  Algorithms for simulating physical phenomena  Accuracy properties from numerical calculations  Structural and logical models implicit

5 Phase-Based Parallel Performance ProfilingParCo 20055 Performance Mapping  General problem of linking performance to computation  Performance mapping (Irvin and Miller, ‘96; Shende, ‘01)  Associate (map) measured performance data  To higher level, semantic representations  Those with model significance to the user  What is the difficulty of making the association  Depends on performance information  performance events/state visible from instrumentation  what performance data can be measured  How the performance information is used in mapping  Difficulty in how performance information is presented  Model-based views (LeBlanc et al., ‘90)

6 Phase-Based Parallel Performance ProfilingParCo 20056 Phases and Performance Mapping  Like to support the association between model and data  Concept of “phases” is common in scientific applications  How developers think about structure, logic, numerics  How performance can be interpreted (Worley, ‘92)  Worthwhile to consider support for phases  In performance measurement  Bridge semantic gap in parallel performance mapping?  tracing has long demonstrated the benefits! (Heath, ‘91)  phase-based analysis and interpretation  Main contribution  Support for phases in parallel performance profiling

7 Phase-Based Parallel Performance ProfilingParCo 20057 Problem Description  Performance measured as a consequence of events  Events represent actions that occur during execution  Events of interest determine performance information  Events have semantics and context (pragmatics)  Semantics  Defines what the event represents  Example: subroutine entry  Context  Properties of the state in which event occurred  Example: subroutine’s calling parent  Interrogate context to map event performance data

8 Phase-Based Parallel Performance ProfilingParCo 20058 heat() stress() MPIrecv() MPIsend() other routines Motivating Example – Multi-Physics Application  Assembly of physical objects  Different shapes  Different materials  Calculate physics  Heat transfer  Mechanical stress  Within / between objects  Iterate to error tolerance  How is performance attributed?  Between events (e.g., routines) and execution components  With respect to computational objects (e.g., data objects)

9 Phase-Based Parallel Performance ProfilingParCo 20059 Context and Standard Profiling  Flat profiles  Context is whole program (i.e., program code)  Performance distribution across (static) program structure  Cannot differentiate dynamics (e.g., callpath or objects)  Callgraph / callpath profiles  Identify parent-child calling relationships at exectution  Context is calling (event) parent / calling (event) path  Extend event semantics to encode context  create new event with callpath name  requires dynamic event creation for complex callpaths  burdens event mechanisms for context identification  simple performance associations require many events

10 Phase-Based Parallel Performance ProfilingParCo 200510 Context and Phase Profiling  View the program execution as collection of phases  Transition between phases (sequenced, nested)  easiest to think of as phase hierarchy (or phase graph)  Phases are not events  phase boundaries can mark entry/exit events  Context is the current phase  How do we know what phase we are in?  Phases are identified separately from events  phases are not encoded in event names  event mechanisms are not overloaded  A phase profile is event performance attributed to phases  Phase-specific performance profiles (flat or callpath)

11 Phase-Based Parallel Performance ProfilingParCo 200511 Approach (Flat Profile)  Create a profile object for each entry/exit event  Each profile object has a name  Static profile object (static event)  event has a single instance (single name)  Dynamic profile object (dynamic event)  event can have multiple instances (created dynamically)  Inclusive and exclusive performance statistics  Must maintain an event stack (or callstack)  Context are generally thought of as code locations  Dynamic events do allow for dynamic context awareness  User code can check “state” and create new events  BUT only see one level of event!

12 Phase-Based Parallel Performance ProfilingParCo 200512 Approach (Callpath Profile)  Show event calling (nesting) relationships  Create a profile object for each event calling context  Each profile object has a name that encodes the callpath  Static profile object  callpath has a single instance (single name)  Dynamic profile object  callpath can have multiple instances (created dynamically)  Reuse event mechanisms  Interrogate the event stack to form event names  “ main=> f1 => f2 => MPI_Send ”  Inclusive and exclusive performance statistics  Callpath length and callgraph depth options

13 Phase-Based Parallel Performance ProfilingParCo 200513 Approach (Phase Profile)  A phase is an execution abstraction  Two questions  How to inform the measurement systems about phases?  How to collect the performance data?  Create a phase object when new phase is created  Each phase object has a name  Static and dynamic phase objects  Phase relationships  Phases may be nested (cannot overlap)  “Active” phase object follows scoping rules  Default (top-level) phase is outermost event (e.g., main )

14 Phase-Based Parallel Performance ProfilingParCo 200514 Approach (Phase Profile - API)  Phase creation TAU_PHASE_CREATE_STATIC(var, name, type, group) TAU_PHASE_CREATE_DYNAMIC(var, name, type, group) TAU_GLOBAL_PHASE(var, name, type, group) TAU_GLOBAL_PHASE_EXTERNAL(var)  Global phases have global scope (accessible anywhere)  External declarations for defined phases outside file scope  Phase control TAU_PHASE_START(var) TAU_PHASE_STOP(var) TAU_GLOBAL_PHASE_START(var) TAU_GLOBAL_PHASE_STOP(var)  Collects a callgraph profile (depth 2) PER PHASE!  Phases default as standard events (when disable)

15 Phase-Based Parallel Performance ProfilingParCo 200515 Approach (Phase Profile - Data Collection)  Leverages performance mapping and callpath profiling  Phase entry  Phase object pushed to measurement (event) callstack  Phase / event entry  Need to determine (event, phase) tuple  traverse callstack to find enclosing phase  construct key for (event, phase) tuple  Maintain global map  new keys for new (event, phase) tuples put into global map create new profile object for every (event, phase) tuple  search global map to determine is tuple occurred before  Use mapping support to store performance data on exit

16 Phase-Based Parallel Performance ProfilingParCo 200516 Multi-Physics Example heat() stress() MPIrecv() MPIsend() other routines events only two events! phases iterate phase Instrumentation heat phase stress phase

17 Phase-Based Parallel Performance ProfilingParCo 200517 Implementation  Parallel profiling in the TAU performance system  Flat profiling  Callpath and callgraph (2-level callpath) profiling  Phase profiling  Multiple performance metrics  Execution time  Hardware performance counters (using PAPI)  Scalable to tens of thousands of processors  Profile analysis and data management tools  ParaProf parallel profile analyzer / visualizer  PerfDMF parallel profile database

18 Phase-Based Parallel Performance ProfilingParCo 200518 Application – NAS Parallel Benchmarks  Phase profiling can provide more refined profile results  Specific to phase localities  Defining phases is an application-specific issue  Apply understanding of computational models  Unfortunately, we were not the application developers  How to decide on phases and phase instrumentation?  Informed by application documentation and code  Look at NAS parallel benchmark application suite  Identify benchmarks with phase behavior  SP, BT, LU (simulated CFD codes) and CG  Focus on BT

19 Phase-Based Parallel Performance ProfilingParCo 200519 NAS BT – Phase Analysis  Emulates a CFD application  System of linear equations  Implicit finite-difference discretization of Navier-Stokes  Solve three sets of uncoupled systems of equations  in X, Y, Z directions  Block tridiagonal with 5x5 blocks  Square number of processors  Phase analysis  Highlight performance for each solution direction  Identified in code by three main functions  x_solve, y_solve, z_solve  Static phases

20 Phase-Based Parallel Performance ProfilingParCo 200520 NAS BT – Instrumentation call TAU_PHASE_CREATE_STATIC(xsolvephase,’x_solve phase’) call TAU_PHASE_START(xsolvephase) call x_solve call TAU_PHASE_STOP(xsolvephase) call TAU_PHASE_CREATE_STATIC(ysolvephase,’y_solve phase’) call TAU_PHASE_START(ysolvephase) call y_solve call TAU_PHASE_STOP(ysolvephase) call TAU_PHASE_CREATE_STATIC(zsolvephase,’z_solve phase’) call TAU_PHASE_START(zsolvephase) call z_solve call TAU_PHASE_STOP(zsolvephase)

21 Phase-Based Parallel Performance ProfilingParCo 200521 NAS BT – Flat Profile How is MPI_Wait() distributed relative to solver direction? Application routine names reflect phase semantics

22 Phase-Based Parallel Performance ProfilingParCo 200522 NAS BT – Phase Profile (Main and X, Y, Z) Main phase shows nested phases and immediate events

23 Phase-Based Parallel Performance ProfilingParCo 200523 Application – MFIX  Multiphase Flow with Interphase eXchanges (MFIX)  National Energy Transfer Laboratory (NETL)  Study physical/chemistry properties in fluid-solid systems  hydrodynamics, heat transfer, chemical reactions  Characteristic of large-scale iterative simulations  major loop executed as simulation advances in time  Testcase  Models Ozone decomposition in a bubbling fluidized bed  Flat profile  Iterate phase profile  Demonstrate dynamic phases

24 Phase-Based Parallel Performance ProfilingParCo 200524 MFIX– Phase Instrumentation (ITERATE) SUBROUTINE ITERATE(IER, NIT) character(11) taucharary integer tauiteration / 0 / integer profiler(2) / 0, 0 / save profiler, tauiteration write (taucharary, ’(a8,i3)’) ’ITERATE ’, tauiteration tauiteration = tauiteration + 1 call TAU_PHASE_CREATE_DYNAMIC(profiler,taucharary) call TAU_PHASE_START(profiler) ! WORK call TAU_PHASE_STOP(profiler) END SUBROUTINE ITERATE

25 Phase-Based Parallel Performance ProfilingParCo 200525 MFIX – Phase Profile (MPI_Waitall) In 51 st iteration, time spent in MPI_Waitall was 85.81 secs Total time spent in MPI_Waitall was 4137.9 secs across all 92 iterations dynamic phases one per interation

26 Phase-Based Parallel Performance ProfilingParCo 200526 MFIX Iterate Phase Behavior

27 Phase-Based Parallel Performance ProfilingParCo 200527 Concluding Discussion and Future Work  Phased-based profiling can help to bridge semantic gap  Computational models  performance measurements  Application-specific performance analysis  Implemented phase profiling in TAU  Demonstrated phase profiling  NAS BT benchmark and MFIX application  Also used in S3D, Uintah, Flash on large-scale platforms  Requires application-specific knowledge  Might be possible to link to auto phase identification  Based on memory tracing or application state change  Can this idea be extended to global parallel phases?  Working on better ways to present phase performance

28 Phase-Based Parallel Performance ProfilingParCo 200528 Support Acknowledgements  Department of Energy (DOE)  Office of Science contracts  University of Utah ASCI Level 1 sub-contract  ASC/NNSA Level 3 contract  Department of Defense (DoD)  HPC Modernization Office (HPCMO)  Programming Environment and Training (PET)  NSF  Research Centre Juelich  Los Alamos National Laboratory 

