Download presentation
Presentation is loading. Please wait.
Allen D. Malony, Sameer Shende, Alan Morris {malony,sameer,amorris} Department of Computer and Information Science Performance Research Laboratory NeuroInformatics Center University of Oregon Phase-Based Parallel Performance Profiling
ParCo 20052 Outline of Talk Motivation Models in parallel scientific applications Phases and performance mapping Problem description Motivating example Profiling techniques Flat, callpath, phase profiling Approach and implementation Applications Future work and concluding remarks
Phase-Based Parallel Performance ProfilingParCo 20053 Motivation Scientific applications designed based on models Computational: structural, logical, numerical models, … Correctness: execution order, data consistency, … Performance: expected, factors, parallelism/scalability, … Computational models form developer’s “mental” model How the program is intended to behave and perform Want to relate performance model to computation model View performance data with respect to “mental” model Better identify problems and guide tuning decisions Must link computational abstractions to performance Bridge semantic gap – measurements “mental” model
Phase-Based Parallel Performance ProfilingParCo 20054 Computational Models Structural models Program organization and code relationships Language used, layout of application parts, … Constructed generally and unfolds during execution Logical and numerical models Capture algorithmic characteristics of the application “Semantic” properties of the computation correct flow of operation and assertions on application state Numerical models Algorithms for simulating physical phenomena Accuracy properties from numerical calculations Structural and logical models implicit
Phase-Based Parallel Performance ProfilingParCo 20055 Performance Mapping General problem of linking performance to computation Performance mapping (Irvin and Miller, ‘96; Shende, ‘01) Associate (map) measured performance data To higher level, semantic representations Those with model significance to the user What is the difficulty of making the association Depends on performance information performance events/state visible from instrumentation what performance data can be measured How the performance information is used in mapping Difficulty in how performance information is presented Model-based views (LeBlanc et al., ‘90)
Phase-Based Parallel Performance ProfilingParCo 20056 Phases and Performance Mapping Like to support the association between model and data Concept of “phases” is common in scientific applications How developers think about structure, logic, numerics How performance can be interpreted (Worley, ‘92) Worthwhile to consider support for phases In performance measurement Bridge semantic gap in parallel performance mapping? tracing has long demonstrated the benefits! (Heath, ‘91) phase-based analysis and interpretation Main contribution Support for phases in parallel performance profiling
Phase-Based Parallel Performance ProfilingParCo 20057 Problem Description Performance measured as a consequence of events Events represent actions that occur during execution Events of interest determine performance information Events have semantics and context (pragmatics) Semantics Defines what the event represents Example: subroutine entry Context Properties of the state in which event occurred Example: subroutine’s calling parent Interrogate context to map event performance data
Phase-Based Parallel Performance ProfilingParCo 20058 heat() stress() MPIrecv() MPIsend() other routines Motivating Example – Multi-Physics Application Assembly of physical objects Different shapes Different materials Calculate physics Heat transfer Mechanical stress Within / between objects Iterate to error tolerance How is performance attributed? Between events (e.g., routines) and execution components With respect to computational objects (e.g., data objects)
Phase-Based Parallel Performance ProfilingParCo 20059 Context and Standard Profiling Flat profiles Context is whole program (i.e., program code) Performance distribution across (static) program structure Cannot differentiate dynamics (e.g., callpath or objects) Callgraph / callpath profiles Identify parent-child calling relationships at exectution Context is calling (event) parent / calling (event) path Extend event semantics to encode context create new event with callpath name requires dynamic event creation for complex callpaths burdens event mechanisms for context identification simple performance associations require many events
Phase-Based Parallel Performance ProfilingParCo 200510 Context and Phase Profiling View the program execution as collection of phases Transition between phases (sequenced, nested) easiest to think of as phase hierarchy (or phase graph) Phases are not events phase boundaries can mark entry/exit events Context is the current phase How do we know what phase we are in? Phases are identified separately from events phases are not encoded in event names event mechanisms are not overloaded A phase profile is event performance attributed to phases Phase-specific performance profiles (flat or callpath)
Phase-Based Parallel Performance ProfilingParCo 200511 Approach (Flat Profile) Create a profile object for each entry/exit event Each profile object has a name Static profile object (static event) event has a single instance (single name) Dynamic profile object (dynamic event) event can have multiple instances (created dynamically) Inclusive and exclusive performance statistics Must maintain an event stack (or callstack) Context are generally thought of as code locations Dynamic events do allow for dynamic context awareness User code can check “state” and create new events BUT only see one level of event!
Phase-Based Parallel Performance ProfilingParCo 200512 Approach (Callpath Profile) Show event calling (nesting) relationships Create a profile object for each event calling context Each profile object has a name that encodes the callpath Static profile object callpath has a single instance (single name) Dynamic profile object callpath can have multiple instances (created dynamically) Reuse event mechanisms Interrogate the event stack to form event names “ main=> f1 => f2 => MPI_Send ” Inclusive and exclusive performance statistics Callpath length and callgraph depth options
Phase-Based Parallel Performance ProfilingParCo 200513 Approach (Phase Profile) A phase is an execution abstraction Two questions How to inform the measurement systems about phases? How to collect the performance data? Create a phase object when new phase is created Each phase object has a name Static and dynamic phase objects Phase relationships Phases may be nested (cannot overlap) “Active” phase object follows scoping rules Default (top-level) phase is outermost event (e.g., main )
Phase-Based Parallel Performance ProfilingParCo 200514 Approach (Phase Profile - API) Phase creation TAU_PHASE_CREATE_STATIC(var, name, type, group) TAU_PHASE_CREATE_DYNAMIC(var, name, type, group) TAU_GLOBAL_PHASE(var, name, type, group) TAU_GLOBAL_PHASE_EXTERNAL(var) Global phases have global scope (accessible anywhere) External declarations for defined phases outside file scope Phase control TAU_PHASE_START(var) TAU_PHASE_STOP(var) TAU_GLOBAL_PHASE_START(var) TAU_GLOBAL_PHASE_STOP(var) Collects a callgraph profile (depth 2) PER PHASE! Phases default as standard events (when disable)
Phase-Based Parallel Performance ProfilingParCo 200515 Approach (Phase Profile - Data Collection) Leverages performance mapping and callpath profiling Phase entry Phase object pushed to measurement (event) callstack Phase / event entry Need to determine (event, phase) tuple traverse callstack to find enclosing phase construct key for (event, phase) tuple Maintain global map new keys for new (event, phase) tuples put into global map create new profile object for every (event, phase) tuple search global map to determine is tuple occurred before Use mapping support to store performance data on exit
Phase-Based Parallel Performance ProfilingParCo 200516 Multi-Physics Example heat() stress() MPIrecv() MPIsend() other routines events only two events! phases iterate phase Instrumentation heat phase stress phase
Phase-Based Parallel Performance ProfilingParCo 200517 Implementation Parallel profiling in the TAU performance system Flat profiling Callpath and callgraph (2-level callpath) profiling Phase profiling Multiple performance metrics Execution time Hardware performance counters (using PAPI) Scalable to tens of thousands of processors Profile analysis and data management tools ParaProf parallel profile analyzer / visualizer PerfDMF parallel profile database
Phase-Based Parallel Performance ProfilingParCo 200518 Application – NAS Parallel Benchmarks Phase profiling can provide more refined profile results Specific to phase localities Defining phases is an application-specific issue Apply understanding of computational models Unfortunately, we were not the application developers How to decide on phases and phase instrumentation? Informed by application documentation and code Look at NAS parallel benchmark application suite Identify benchmarks with phase behavior SP, BT, LU (simulated CFD codes) and CG Focus on BT
Phase-Based Parallel Performance ProfilingParCo 200519 NAS BT – Phase Analysis Emulates a CFD application System of linear equations Implicit finite-difference discretization of Navier-Stokes Solve three sets of uncoupled systems of equations in X, Y, Z directions Block tridiagonal with 5x5 blocks Square number of processors Phase analysis Highlight performance for each solution direction Identified in code by three main functions x_solve, y_solve, z_solve Static phases
Phase-Based Parallel Performance ProfilingParCo 200520 NAS BT – Instrumentation call TAU_PHASE_CREATE_STATIC(xsolvephase,’x_solve phase’) call TAU_PHASE_START(xsolvephase) call x_solve call TAU_PHASE_STOP(xsolvephase) call TAU_PHASE_CREATE_STATIC(ysolvephase,’y_solve phase’) call TAU_PHASE_START(ysolvephase) call y_solve call TAU_PHASE_STOP(ysolvephase) call TAU_PHASE_CREATE_STATIC(zsolvephase,’z_solve phase’) call TAU_PHASE_START(zsolvephase) call z_solve call TAU_PHASE_STOP(zsolvephase)
Phase-Based Parallel Performance ProfilingParCo 200521 NAS BT – Flat Profile How is MPI_Wait() distributed relative to solver direction? Application routine names reflect phase semantics
Phase-Based Parallel Performance ProfilingParCo 200522 NAS BT – Phase Profile (Main and X, Y, Z) Main phase shows nested phases and immediate events
Phase-Based Parallel Performance ProfilingParCo 200523 Application – MFIX Multiphase Flow with Interphase eXchanges (MFIX) National Energy Transfer Laboratory (NETL) Study physical/chemistry properties in fluid-solid systems hydrodynamics, heat transfer, chemical reactions Characteristic of large-scale iterative simulations major loop executed as simulation advances in time Testcase Models Ozone decomposition in a bubbling fluidized bed Flat profile Iterate phase profile Demonstrate dynamic phases
Phase-Based Parallel Performance ProfilingParCo 200524 MFIX– Phase Instrumentation (ITERATE) SUBROUTINE ITERATE(IER, NIT) character(11) taucharary integer tauiteration / 0 / integer profiler(2) / 0, 0 / save profiler, tauiteration write (taucharary, ’(a8,i3)’) ’ITERATE ’, tauiteration tauiteration = tauiteration + 1 call TAU_PHASE_CREATE_DYNAMIC(profiler,taucharary) call TAU_PHASE_START(profiler) ! WORK call TAU_PHASE_STOP(profiler) END SUBROUTINE ITERATE
Phase-Based Parallel Performance ProfilingParCo 200525 MFIX – Phase Profile (MPI_Waitall) In 51 st iteration, time spent in MPI_Waitall was 85.81 secs Total time spent in MPI_Waitall was 4137.9 secs across all 92 iterations dynamic phases one per interation
Phase-Based Parallel Performance ProfilingParCo 200526 MFIX Iterate Phase Behavior
Phase-Based Parallel Performance ProfilingParCo 200527 Concluding Discussion and Future Work Phased-based profiling can help to bridge semantic gap Computational models performance measurements Application-specific performance analysis Implemented phase profiling in TAU Demonstrated phase profiling NAS BT benchmark and MFIX application Also used in S3D, Uintah, Flash on large-scale platforms Requires application-specific knowledge Might be possible to link to auto phase identification Based on memory tracing or application state change Can this idea be extended to global parallel phases? Working on better ways to present phase performance
Phase-Based Parallel Performance ProfilingParCo 200528 Support Acknowledgements Department of Energy (DOE) Office of Science contracts University of Utah ASCI Level 1 sub-contract ASC/NNSA Level 3 contract Department of Defense (DoD) HPC Modernization Office (HPCMO) Programming Environment and Training (PET) NSF Research Centre Juelich Los Alamos National Laboratory
Similar presentations
© 2025 Inc.
All rights reserved.