Download presentation
Presentation is loading. Please wait.
Published byDina Griffin Modified over 9 years ago
1
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 1 On-line Automated Performance Diagnosis on Thousands of Processors Philip C. Roth Future Technologies Group Computer Science and Mathematics Division Oak Ridge National Laboratory Paradyn Research Group Computer Sciences Department University of Wisconsin-Madison
2
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 2 High Performance Computing Today Large parallel computing resources Tightly coupled systems (Earth Simulator, BlueGene/L, XT3) Clusters (LANL Lightning, LLNL Thunder) Grid Large, complex applications ASCI Blue Mountain job sizes (2001) 512 cpus: 17.8% 1024 cpus: 34.9% 2048 cpus: 19.9% Small fraction of peak performance is the rule
3
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 3 Achieving Good Performance Need to know what and where to tune Diagnosis and tuning tools are critical for realizing potential of large-scale systems On-line automated tools are especially desirable Manual tuning is difficult Finding interesting data in large data volume Understanding application, OS, hardware interactions Automated tools require minimal user involvement; expertise is built into the tool On-line automated tools can adapt dynamically Dynamic control over data volume Useful results from a single run But: tools that work well in small-scale environments often don’t scale
4
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 4 Tool Front End d0d0 d1d1 d2d2 d3d3 d P-4 d P-3 d P-2 d P-1 a0a0 a1a1 a2a2 a3a3 a P-4 a P-3 a P-2 a P-1 Tool Daemons App Processes Managing performance data volume Communicating efficiently between distributed tool components Making scalable presentation of data and analysis results Barriers to Large-Scale Performance Diagnosis
5
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 5 Our Approach for Addressing These Scalability Barriers MRNet: multicast/reduction infrastructure for scalable tools Distributed Performance Consultant: strategy for efficiently finding performance bottlenecks in large-scale applications Sub-Graph Folding Algorithm: algorithm for effectively presenting bottleneck diagnosis results for large-scale applications
6
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 6 Outline Performance Consultant MRNet Distributed Performance Consultant Sub-Graph Folding Algorithm Evaluation Summary
7
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 7 Automated performance diagnosis Search for application performance problems Start with global, general experiments (e.g., test CPUbound across all processes) Collect performance data using dynamic instrumentation Collect only the data desired Remove the instrumentation when no longer needed Make decisions about truth of each experiment Refine search: create more specific experiments based on “true” experiments (those whose data is above user- configurable threshold) Performance Consultant
8
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 8 Performance Consultant c002.cs.wisc.educ001.cs.wisc.educ128.cs.wisc.edu myapp367myapp4287myapp27549
9
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 9 c002.cs.wisc.educ001.cs.wisc.edu CPUbound c001.cs.wisc.edu main myapp{367} Do_rowDo_col Do_mult c128.cs.wisc.edu main myapp{27549} Do_rowDo_col Do_mult main Do_rowDo_col Do_mult c002.cs.wisc.edu main myapp{4287} Do_rowDo_col Do_mult … …… … … … … … c128.cs.wisc.edu myapp367myapp4287myapp27549 Performance Consultant
10
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 10 c002.cs.wisc.educ001.cs.wisc.edu CPUbound c001.cs.wisc.edu main myapp{367} Do_rowDo_col Do_mult c128.cs.wisc.edu main myapp{27549} Do_rowDo_col Do_mult main Do_rowDo_col Do_mult c002.cs.wisc.edu main myapp{4287} Do_rowDo_col Do_mult … …… … … … … … cham.cs.wisc.educ128.cs.wisc.edu myapp367myapp4287myapp27549 Performance Consultant
11
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 11 Outline Performance Consultant MRNet Distributed Performance Consultant Sub-Graph Folding Algorithm Evaluation Summary
12
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 12 MRNet: Multicast/Reduction Overlay Network Parallel tool infrastructure providing: Scalable multicast Scalable data synchronization and transformation Network of processes between tool front-end and back-ends Useful for parallelizing and distributing tool activities Reduce latency Reduce computation and communication load at tool front-end Joint work with Dorian Arnold (University of Wisconsin-Madison)
13
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 13 Typical Parallel Tool Organization Tool Front End d0d0 d1d1 d2d2 d3d3 a0a0 a1a1 a2a2 a3a3 d P-4 d P-3 d P-2 d P-1 a P-4 a P-3 a P-2 a P-1 Tool Daemons App Processes
14
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 14 MRNet-based Parallel Tool Organization Tool Front End d0d0 d1d1 d2d2 d3d3 a0a0 a1a1 a2a2 a3a3 d P-4 d P-3 d P-2 d P-1 a P-4 a P-3 a P-2 a P-1 Tool Daemons App Processes Multicast/ Reduction Network Internal Process Filter
15
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 15 Outline Performance Consultant MRNet Distributed Performance Consultant Sub-Graph Folding Algorithm Evaluation Summary
16
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 16 Performance Consultant: Scalability Barriers MRNet can alleviate scalability problem for global performance data (e.g., CPU utilization across all processes) But front-end still processes local performance data (e.g., utilization of process 5247 on host mcr398.llnl.gov)
17
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 17 c002.cs.wisc.educ001.cs.wisc.edu CPUbound c001.cs.wisc.edu main myapp{367} Do_rowDo_col Do_mult c128.cs.wisc.edu main myapp{27549} Do_rowDo_col Do_mult main Do_rowDo_col Do_mult c002.cs.wisc.edu main myapp{4287} Do_rowDo_col Do_mult … …… … … … … … cham.cs.wisc.educ128.cs.wisc.edu myapp367myapp4287myapp27549 Performance Consultant
18
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 18 c002.cs.wisc.educ001.cs.wisc.edu CPUbound c001.cs.wisc.edu main myapp{367} Do_rowDo_col Do_mult c128.cs.wisc.edu main myapp{27549} Do_rowDo_col Do_mult main Do_rowDo_col Do_mult c002.cs.wisc.edu main myapp{4287} Do_rowDo_col Do_mult … …… … … … … … cham.cs.wisc.educ128.cs.wisc.edu myapp367myapp4287myapp27549 Distributed Performance Consultant
19
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 19 Distributed Performance Consultant: Variants Natural steps from traditional centralized approach (CA) Partially Distributed Approach (PDA) Distributed local searches, centralized global search Requires complex instrumentation management Truly Distributed Approach (TDA) Distributed local searches only Insight into global behavior from combining local search results (e.g., using Sub-Graph Folding Algorithm) Simpler tool design than PDA
20
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 20 c002.cs.wisc.educ001.cs.wisc.edu CPUbound c001.cs.wisc.edu main myapp{367} Do_rowDo_col Do_mult c128.cs.wisc.edu main myapp{27549} Do_rowDo_col Do_mult main Do_rowDo_col Do_mult c002.cs.wisc.edu main myapp{4287} Do_rowDo_col Do_mult … …… … … … … … cham.cs.wisc.educ128.cs.wisc.edu myapp367myapp4287myapp27549 Distributed Performance Consultant: PDA
21
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 21 c002.cs.wisc.educ001.cs.wisc.edu main myapp{367} Do_rowDo_col Do_mult c128.cs.wisc.edu main myapp{27549} Do_rowDo_col Do_mult c002.cs.wisc.edu main myapp{4287} Do_rowDo_col Do_mult … …… … … … … cham.cs.wisc.educ128.cs.wisc.edu myapp367myapp4287myapp27549 Distributed Performance Consultant: TDA
22
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 22 c002.cs.wisc.educ001.cs.wisc.edu main myapp{367} Do_rowDo_col Do_mult c128.cs.wisc.edu main myapp{27549} Do_rowDo_col Do_mult c002.cs.wisc.edu main myapp{4287} Do_rowDo_col Do_mult … …… … … … … cham.cs.wisc.educ128.cs.wisc.edu myapp367myapp4287myapp27549 Distributed Performance Consultant: TDA Sub-Graph Folding Algorithm
23
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 23 Outline Paradyn and the Performance Consultant MRNet Distributed Performance Consultant Sub-Graph Folding Algorithm Evaluation Summary
24
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 24 Search History Graph Example CPUbound c34.cs.wisc.edu myapp{7624} main AB C D AB C D myapp{1272} main AB C D myapp{1273} main AB C DE myapp{7625} main AB C D c33.cs.wisc.edu
25
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 25 Search History Graphs Search History Graph is effective for presenting search-based performance diagnosis results… …but it does not scale to a large number of processes because it shows one sub-graph per process
26
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 26 Sub-Graph Folding Algorithm Combines host-specific sub-graphs into composite sub-graphs Each composite sub-graph represents a behavioral category among application processes Dynamic clustering of processes by qualitative behavior
27
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 27 SGFA: Example CPUbound c34.cs.wisc.edu myapp{7624} main AB C D AB C D myapp{1272} main AB C D myapp{1273} main AB C DE myapp{7625} main AB C D c33.cs.wisc.edu myapp{*} D E c*.cs.wisc.edu
28
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 28 SGFA: Implementation Custom MRNet filter Filter in each MRNet process keeps folded graph of search results from all reachable daemons Updates periodically sent upstream By induction, filter in front-end holds entire folded graph Optimization for unchanged graphs
29
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 29 Outline Performance Consultant MRNet Distributed Performance Consultant Sub-Graph Folding Algorithm Evaluation Summary
30
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 30 DPC + SGFA: Evaluation Modified Paradyn to perform bottleneck searches using CA, PDA, or TDA approach Modified instrumentation cost tracking to support PDA Track global, per-process instrumentation cost separately Simple fixed-partition policy for scheduling global and local instrumentation Implemented Sub-Graph Folding Algorithm as custom MRNet filter to support TDA (used by all) Instrumented front-end, daemons, and MRNet internal processes to collect CPU, I/O load information
31
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 31 DPC + SGFA: Evaluation su3_rmd QCD pure lattice gauge theory code C, MPI Weak scaling scalability study LLNL MCR cluster 1152 nodes (1048 compute nodes) Two 2.4 GHz Intel Xeons per node 4 GB memory per node Quadrics Elan3 interconnect (fat tree) Lustre parallel file system
32
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 32 DPC + SGFA: Evaluation PDA and TDA: bottleneck searches with up to 1024 processes so far, limited by partition size CA: scalability limit at less than 64 processes Similar qualitative results from all approaches
33
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 33 DPC: Evaluation
34
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 34 DPC: Evaluation
35
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 35 DPC: Evaluation
36
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 36 DPC: Evaluation
37
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 37 DPC: Evaluation
38
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 38 DPC: Evaluation
39
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 39 DPC: Evaluation
40
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 40 DPC: Evaluation
41
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 41 DPC: Evaluation
42
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 42 SGFA: Evaluation
43
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 43 Summary Tool scalability is critical for effective use of large-scale computing resources On-line automated performance tools are especially important at large scale Our approach: MRNet Distributed Performance Consultant (TDA) plus Sub-Graph Folding Algorithm
44
O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 44 References P.C. Roth, D.C. Arnold, and B.P. Miller, “MRNet: a Software-Based Multicast/Reduction Network for Scalable Tools,” SC 2003, Phoenix, Arizona, November 2003 P.C. Roth and B.P. Miller, “The Distributed Performance Consultant and the Sub-Graph Folding Algorithm: On-line Automated Performance Diagnosis on Thousands of Processes,” in submission Publications available from http://www.paradyn.orghttp://www.paradyn.org MRNet software available from http://www.paradyn.org/mrnet http://www.paradyn.org/mrnet
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.