Download presentation
Presentation is loading. Please wait.
Published byAbigail Wilkerson Modified over 6 years ago
1
Stack Trace Analysis for Large Scale Debugging using MRNet
UCRL-PRES Stack Trace Analysis for Large Scale Debugging using MRNet Dorian C. Arnold, Barton P. Miller University of Wisconsin Dong Ahn, Bronis R. de Supinski, Gregory L. Lee, Martin Schulz Lawrence Livermore National Laboratory
2
Scaling Tools Machine sizes are increasing
New cluster close to or above 10,000 cores Blue Gene/L: over 131,000 cores Not only applications need to scale Support environment Tools Challenges Data collection, storage, and analysis Scalable process management and control Visualization Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007
3
Debugging on BlueGene/L
TotalView on BG/L – 4096 Processes Operation Latency Single step ~15-20 secs. Breakpoint Insertion ~30 secs. Stack trace sampling ~120 secs. Work w/ teams like TV to reduce these latencies Typical debug session includes many interactions 4096 is only 3% of BG/L! Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007
4
Scalability Limitations
Large volumes of debug data Single frontend for all node connections Centralized data analysis Vendor licensing limitations Approach: scalable, lightweight debugger Reduce exploration space to small subset Online aggregation using a TBŌN Full-featured debugger for deeper digging Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007
5
Outline Case study: CCSM STAT Approach Implementation Evaluation
Concept of Stack Traces Identification of Equivalence Classes Implementation Using Tree-based Overlay Networks Data and Work Flow in STAT Evaluation Conclusions Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007
6
Case Study: CCSM Community Climate System Model (CCSM) Implementation
Used to make climate predictions Coupled models for atmosphere, ocean, sea ice and land surface Implementation Multiple Program Multiple Data (MPMD) model MPI-based application Distinct components for each model Typically requires significant node count Models executed concurrently Several hundred tasks Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007
7
Observations Intermittently hangs with 472 tasks Current approach:
Non-deterministic Only at large scale Appears at seemingly random code locations Hard to reproduce: 2 hangs over next 10 days (~50 runs) Current approach: Attach to job using TotalView Collect stack traces from all 472 tasks Visualize cross-node callgraph Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007
8
CCSM Callgraph Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007
9
Lessons Learned Some bugs only occur at large scales
Non-deterministic & hard to reproduce Stack traces can provide useful insight Many bugs are temporal in nature Need tools that: Combine spatial and temporal observations Discover application behavior Run effectively at scale Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007
10
STAT Approach Sample application stack traces Merge/analyze traces:
Across time and space Through third party interface Using a DynInst based daemon Merge/analyze traces: Discover equivalent process behavior Group similar processes Facilitate scalable analysis/data presentation Leverage TBŌN model (MRNet) Communicate traces back to a frontend Merge on the fly within MRNet filters Count/freq -> representative application profile. Spmd -> good compression Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007
11
Singleton Stack Trace Appl.
Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007
12
Merging Stack Traces Multiple traces over space or time
Taken independently Stored in graph representation Create call graph prefix tree Only merge nodes with identical stack backtrace Retains context information Advantages Compressed representation Scalable visualization Scalable analysis Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007
13
Merging Stack Traces Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007
14
2D-Trace/Space Analysis
Appl Appl Appl … Single sample, multiple processes: - Loosely synchronized distributed snapshot - Inconclusive information Appl Appl Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007
15
Prefix Tree vs. DAG TotalView STAT
Color equiv. classes. We have trees. Can’t tell if app is stuck w/out time STAT TotalView Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007
16
2D-Trace/Time Analysis
… Appl Multiple samples, single process: - Track process behavior over time - Useful for sequential application - Non-scalable approach for parallel program Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007
17
Time & Space Analysis Both 2D techniques insufficient
Spatial aggregation misses temporal component Temporal aggregation misses parallel aspects Multiple samples, multiple processes Track global program behavior over time Merge into single, 3D prefix tree Challenges: Scalable data representation Scalable analysis Scalable and useful visualization/results Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007
18
3D-Trace/Space/Time Analysis
4 Nodes / 10 Snapshots Appl Appl … Appl … Single sample, multiple processes: - Loosely synchronized distributed snapshot - Inconclusive information Appl Appl Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007
19
3D-Trace/Space/Time Analysis
288 Nodes / 10 Snapshots Discuss scalability problem. Solutions: pruning, level of detail Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007
20
Implementation Details
Communication through MRNet Single data stream from BE to FE Filters implement tree merge Tree depth can be configured Three major components Backend (BE) daemons gathering traces Communication processes merging prefix trees Frontend (FE) tool storing the final graph Final result saved as GML or DOT file Node classes color coded External visualization tools Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007
21
… Work and Data Flow trace( count, freq. ) FE STAT Frontend
Filter MRNet Communication Process Tree Merge CP CP CP CP STAT Tool Daemon BE BE BE BE … MPI Node 1 Node 2 Node N-1 Node N Application Processes Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007
22
STAT Performance 1024x4 Cluster 1.4 GHz Itanium2 Quadrics QsNetII
3844 processors, seconds Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007
23
Conclusions Scaling tools poses challenges
Data management and process control New strategies for tools needed STAT – Scalable Stacktrace Analysis Lightweight tool to identify process classes Based on merged callgraph prefix trees Aggregation in Time and Space Orthogonal to full featured debuggers Implementation based on TBŌNs Scalable data collection and aggregation Enables significant speedup Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007
24
More Information Paper published at IPDPS Stack Trace Analysis for Large Scale Debugging D. Arnold, D.H. Ahn, B.R. de Supinski, G. Lee, B.P. Miller, and M. Schulz Project website & Demo tomorrow TBŌN computing papers & open-source prototype, MRNet, available at Stack Trace Analysis for Large Scale Debugging using MRNet, Paradyn Week, 4/30-5/1 2007
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.