Download presentation
Presentation is loading. Please wait.
Published byBartholomew Peters Modified over 9 years ago
1
Histograms detect imbalances Variable sizes capture variance Lawrence Livermore National Laboratory Center for Applied Scientific Computing NC STATE UNIVERSITY Department of Computer Science An Open Framework for Scalable, Reconfigurable Performance Analysis Todd Gamblin ‡, Prasun Ratn *†, Bronis R. de Supinski *, Martin Schulz *, Frank Mueller †, Robert J. Fowler ‡, Daniel Reed ‡ * Lawrence Livermore National Laboratory, † North Carolina State University, ‡ Renaissance Computing Institute This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. This work was also supported in part by NSF grants CNS-0410203, CCF-0429653 and CAREER CCR-0237570, as well as the SciDAC Performance Engineering Research Institute, DE-FC02-06ER25764. UCRL-POST-236200 ScalaTrace: Reconfigurable, Scalable Performance Analysis Size of machines is rapidly increasing (130,000+ processors) Tools will be overwhelmed with data Need scalable, online measurement and analysis Size of machines is rapidly increasing (130,000+ processors) Tools will be overwhelmed with data Need scalable, online measurement and analysis Future Directions Large Scale MPI Applications Implicit Program Behavior Interactive User Tools and Visualization Understanding of Program Behavior and Optimizations Feedback, Steering & Global Configuration Data Collection & Analysis … MPI tasks User Interaction & Storage Reduction Nodes Local Configuration Data Input: from Tracer, Profiler, Child Nodes Data Output: to Parent Node, Tool Communication paths OPSAttr.1Attr.2Attr.N… extensible, self-describing format Attribute examples: PRSDs, timing, … Reducer Transformer Annotater Analyzer CompressorPRSDsTiming Loop Detection Annotator MPIPRSDTimeLoop MPIPRSDTime MPIPRSDTime MPIPRSDTime Loop Information Time PRSD Types of tool components LOCAL Node INSTANTIATION GLOBAL REDUCTION Trace representation using Power Regular Section Descriptors (PRSDs) Idea: preserve time in compressed traces Encode time deltas instead of timestamps Create delta histograms automatically Dynamically balance histograms Time depends on path taken Distinguish histograms by path Description of problem Bins generated for synthetic input span entire range with similar sample counts Number of histograms per record depends on number of possible call paths Sample bimodal distribution from UMT2K collectives Replay accuracy (NAS Benchmarks and UMT2K) IS LU FT IS FT Trace sizes (NAS Benchmarks and UMT2K) MG BT Near-constant trace size DT, EP, LU Sub-linear trace size CG, MG, FT Non-scalable trace size BT, IS, UMT2K Accurate replay: DT, EP, FT, LU, IS, UMT2K Replay inaccurate in MPI Time: CG, MG Replay inaccurate in Compute Time: BT ScalaReplay: Replay Using Histogram Timing Annotations ScalaReplay: Replay Using Histogram Timing Annotations Evolutionary Load-Balance Analysis with Scalable Data Collection Evolutionary Load-Balance Analysis with Scalable Data Collection Idea: Normalize measurements and models based on application semantics. Progress loops Typically outer loops in SPMD codes Indicate absolute progress towards some domain-specific goal Basis for comparison of load over time Effort Loops Variable-time loops, represent load Data-dependent execution Load Modeling Framework Visualization Effort Region Analysis Progress Event Annotator MPIPRSD Compressor Trace merge Wavelet Compressio n MPIPRSDProgress Local Effort Records PRSDs MPI PRSD MPI Tracer Load-annotated Trace MPIPRSDProgress Progress instrumentation User marks progress loop explicitly Effort modeled with code regions Dynamically detected at runtime Split MPI-op trace at collectives and wait operations User can further divide code into phases with instrumentation Models dislocation dynamics in crystals Dislocations discretized as nodes & arms Recursive spatial domain decomposition Balancer subdivides nodes/arms along x, then y, then z Compress... Scalable Compression Using Wavelet Transform Distributed per-process effort measurements are consolidated onto fewer processes Parallel wavelet transform Transformed data is locally compressed by thresholding, run-length and entropy coding Compressed representation is merged in parallel Load measurements are reconstructed on client Visualization tool allows intuitive data browsing Compress Near-constant low-overhead MPI traces Annotate with additional reconfigurable data, e.g. Time using adaptive histograms (left) Progress rates, load imbalance (right) Compression Ratio Merge Time Error Process Id Progress Process Id Progress Process Id Progress Process Id Progress Process Id Progress Process Id Progress Checkpoint Force Computation Effort for 64-node ParaDiS Run ExactReconstructed Re-partition Load Balance in ParaDiS z x y Flexible framework for application-specific tools Ability to select compression schemes for different fields, data types Foster combination, interaction between data collection, analysis mechanisms Adaptive, self-tuning runtime systems Near term: Adaptive wavelet transform topology, equivalence class detection Full time-sequence compression
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.