O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 1 On-line Automated Performance Diagnosis on Thousands of Processors Philip C. Roth Future.

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 1 On-line Automated Performance Diagnosis on Thousands of Processors Philip C. Roth Future Technologies Group Computer Science and Mathematics Division Oak Ridge National Laboratory Paradyn Research Group Computer Sciences Department University of Wisconsin-Madison

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 2 High Performance Computing Today  Large parallel computing resources  Tightly coupled systems (Earth Simulator, BlueGene/L, XT3)  Clusters (LANL Lightning, LLNL Thunder)  Grid  Large, complex applications  ASCI Blue Mountain job sizes (2001)  512 cpus: 17.8%  1024 cpus: 34.9%  2048 cpus: 19.9%  Small fraction of peak performance is the rule

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 3 Achieving Good Performance  Need to know what and where to tune  Diagnosis and tuning tools are critical for realizing potential of large-scale systems  On-line automated tools are especially desirable  Manual tuning is difficult  Finding interesting data in large data volume  Understanding application, OS, hardware interactions  Automated tools require minimal user involvement; expertise is built into the tool  On-line automated tools can adapt dynamically  Dynamic control over data volume  Useful results from a single run  But: tools that work well in small-scale environments often don’t scale

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 4 Tool Front End d0d0 d1d1 d2d2 d3d3 d P-4 d P-3 d P-2 d P-1 a0a0 a1a1 a2a2 a3a3 a P-4 a P-3 a P-2 a P-1 Tool Daemons App Processes Managing performance data volume Communicating efficiently between distributed tool components Making scalable presentation of data and analysis results Barriers to Large-Scale Performance Diagnosis

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 5 Our Approach for Addressing These Scalability Barriers  MRNet: multicast/reduction infrastructure for scalable tools  Distributed Performance Consultant: strategy for efficiently finding performance bottlenecks in large-scale applications  Sub-Graph Folding Algorithm: algorithm for effectively presenting bottleneck diagnosis results for large-scale applications

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 6 Outline  Performance Consultant  MRNet  Distributed Performance Consultant  Sub-Graph Folding Algorithm  Evaluation  Summary

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 7  Automated performance diagnosis  Search for application performance problems  Start with global, general experiments (e.g., test CPUbound across all processes)  Collect performance data using dynamic instrumentation  Collect only the data desired  Remove the instrumentation when no longer needed  Make decisions about truth of each experiment  Refine search: create more specific experiments based on “true” experiments (those whose data is above user- configurable threshold) Performance Consultant

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 8 Performance Consultant c002.cs.wisc.educ001.cs.wisc.educ128.cs.wisc.edu myapp367myapp4287myapp27549

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 9 c002.cs.wisc.educ001.cs.wisc.edu CPUbound c001.cs.wisc.edu main myapp{367} Do_rowDo_col Do_mult c128.cs.wisc.edu main myapp{27549} Do_rowDo_col Do_mult main Do_rowDo_col Do_mult c002.cs.wisc.edu main myapp{4287} Do_rowDo_col Do_mult … …… … … … … … c128.cs.wisc.edu myapp367myapp4287myapp27549 Performance Consultant

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 10 c002.cs.wisc.educ001.cs.wisc.edu CPUbound c001.cs.wisc.edu main myapp{367} Do_rowDo_col Do_mult c128.cs.wisc.edu main myapp{27549} Do_rowDo_col Do_mult main Do_rowDo_col Do_mult c002.cs.wisc.edu main myapp{4287} Do_rowDo_col Do_mult … …… … … … … … cham.cs.wisc.educ128.cs.wisc.edu myapp367myapp4287myapp27549 Performance Consultant

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 12 MRNet: Multicast/Reduction Overlay Network  Parallel tool infrastructure providing:  Scalable multicast  Scalable data synchronization and transformation  Network of processes between tool front-end and back-ends  Useful for parallelizing and distributing tool activities  Reduce latency  Reduce computation and communication load at tool front-end  Joint work with Dorian Arnold (University of Wisconsin-Madison)

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 13 Typical Parallel Tool Organization Tool Front End d0d0 d1d1 d2d2 d3d3 a0a0 a1a1 a2a2 a3a3 d P-4 d P-3 d P-2 d P-1 a P-4 a P-3 a P-2 a P-1 Tool Daemons App Processes

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 14 MRNet-based Parallel Tool Organization Tool Front End d0d0 d1d1 d2d2 d3d3 a0a0 a1a1 a2a2 a3a3 d P-4 d P-3 d P-2 d P-1 a P-4 a P-3 a P-2 a P-1 Tool Daemons App Processes Multicast/ Reduction Network Internal Process Filter

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 16 Performance Consultant: Scalability Barriers  MRNet can alleviate scalability problem for global performance data (e.g., CPU utilization across all processes)  But front-end still processes local performance data (e.g., utilization of process 5247 on host mcr398.llnl.gov)

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 17 c002.cs.wisc.educ001.cs.wisc.edu CPUbound c001.cs.wisc.edu main myapp{367} Do_rowDo_col Do_mult c128.cs.wisc.edu main myapp{27549} Do_rowDo_col Do_mult main Do_rowDo_col Do_mult c002.cs.wisc.edu main myapp{4287} Do_rowDo_col Do_mult … …… … … … … … cham.cs.wisc.educ128.cs.wisc.edu myapp367myapp4287myapp27549 Performance Consultant

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 18 c002.cs.wisc.educ001.cs.wisc.edu CPUbound c001.cs.wisc.edu main myapp{367} Do_rowDo_col Do_mult c128.cs.wisc.edu main myapp{27549} Do_rowDo_col Do_mult main Do_rowDo_col Do_mult c002.cs.wisc.edu main myapp{4287} Do_rowDo_col Do_mult … …… … … … … … cham.cs.wisc.educ128.cs.wisc.edu myapp367myapp4287myapp27549 Distributed Performance Consultant

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 19 Distributed Performance Consultant: Variants  Natural steps from traditional centralized approach (CA)  Partially Distributed Approach (PDA)  Distributed local searches, centralized global search  Requires complex instrumentation management  Truly Distributed Approach (TDA)  Distributed local searches only  Insight into global behavior from combining local search results (e.g., using Sub-Graph Folding Algorithm)  Simpler tool design than PDA

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 20 c002.cs.wisc.educ001.cs.wisc.edu CPUbound c001.cs.wisc.edu main myapp{367} Do_rowDo_col Do_mult c128.cs.wisc.edu main myapp{27549} Do_rowDo_col Do_mult main Do_rowDo_col Do_mult c002.cs.wisc.edu main myapp{4287} Do_rowDo_col Do_mult … …… … … … … … cham.cs.wisc.educ128.cs.wisc.edu myapp367myapp4287myapp27549 Distributed Performance Consultant: PDA

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 21 c002.cs.wisc.educ001.cs.wisc.edu main myapp{367} Do_rowDo_col Do_mult c128.cs.wisc.edu main myapp{27549} Do_rowDo_col Do_mult c002.cs.wisc.edu main myapp{4287} Do_rowDo_col Do_mult … …… … … … … cham.cs.wisc.educ128.cs.wisc.edu myapp367myapp4287myapp27549 Distributed Performance Consultant: TDA

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 22 c002.cs.wisc.educ001.cs.wisc.edu main myapp{367} Do_rowDo_col Do_mult c128.cs.wisc.edu main myapp{27549} Do_rowDo_col Do_mult c002.cs.wisc.edu main myapp{4287} Do_rowDo_col Do_mult … …… … … … … cham.cs.wisc.educ128.cs.wisc.edu myapp367myapp4287myapp27549 Distributed Performance Consultant: TDA Sub-Graph Folding Algorithm

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 23 Outline  Paradyn and the Performance Consultant  MRNet  Distributed Performance Consultant  Sub-Graph Folding Algorithm  Evaluation  Summary

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 24 Search History Graph Example CPUbound c34.cs.wisc.edu myapp{7624} main AB C D AB C D myapp{1272} main AB C D myapp{1273} main AB C DE myapp{7625} main AB C D c33.cs.wisc.edu

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 25 Search History Graphs  Search History Graph is effective for presenting search-based performance diagnosis results…  …but it does not scale to a large number of processes because it shows one sub-graph per process

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 26 Sub-Graph Folding Algorithm  Combines host-specific sub-graphs into composite sub-graphs  Each composite sub-graph represents a behavioral category among application processes  Dynamic clustering of processes by qualitative behavior

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 27 SGFA: Example CPUbound c34.cs.wisc.edu myapp{7624} main AB C D AB C D myapp{1272} main AB C D myapp{1273} main AB C DE myapp{7625} main AB C D c33.cs.wisc.edu myapp{*} D E c*.cs.wisc.edu

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 28 SGFA: Implementation  Custom MRNet filter  Filter in each MRNet process keeps folded graph of search results from all reachable daemons  Updates periodically sent upstream  By induction, filter in front-end holds entire folded graph  Optimization for unchanged graphs

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 30 DPC + SGFA: Evaluation  Modified Paradyn to perform bottleneck searches using CA, PDA, or TDA approach  Modified instrumentation cost tracking to support PDA  Track global, per-process instrumentation cost separately  Simple fixed-partition policy for scheduling global and local instrumentation  Implemented Sub-Graph Folding Algorithm as custom MRNet filter to support TDA (used by all)  Instrumented front-end, daemons, and MRNet internal processes to collect CPU, I/O load information

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 31 DPC + SGFA: Evaluation  su3_rmd  QCD pure lattice gauge theory code  C, MPI  Weak scaling scalability study  LLNL MCR cluster  1152 nodes (1048 compute nodes)  Two 2.4 GHz Intel Xeons per node  4 GB memory per node  Quadrics Elan3 interconnect (fat tree)  Lustre parallel file system

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 32 DPC + SGFA: Evaluation  PDA and TDA: bottleneck searches with up to 1024 processes so far, limited by partition size  CA: scalability limit at less than 64 processes  Similar qualitative results from all approaches

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 33 DPC: Evaluation

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 42 SGFA: Evaluation

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 43 Summary  Tool scalability is critical for effective use of large-scale computing resources  On-line automated performance tools are especially important at large scale  Our approach:  MRNet  Distributed Performance Consultant (TDA) plus Sub-Graph Folding Algorithm

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 44 References  P.C. Roth, D.C. Arnold, and B.P. Miller, “MRNet: a Software-Based Multicast/Reduction Network for Scalable Tools,” SC 2003, Phoenix, Arizona, November 2003  P.C. Roth and B.P. Miller, “The Distributed Performance Consultant and the Sub-Graph Folding Algorithm: On-line Automated Performance Diagnosis on Thousands of Processes,” in submission  Publications available from http://www.paradyn.orghttp://www.paradyn.org  MRNet software available from http://www.paradyn.org/mrnet http://www.paradyn.org/mrnet

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 1 On-line Automated Performance Diagnosis on Thousands of Processors Philip C. Roth Future.

Similar presentations

Presentation on theme: "O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 1 On-line Automated Performance Diagnosis on Thousands of Processors Philip C. Roth Future."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 1 On-line Automated Performance Diagnosis on Thousands of Processors Philip C. Roth Future.

Similar presentations

Presentation on theme: "O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 1 On-line Automated Performance Diagnosis on Thousands of Processors Philip C. Roth Future."— Presentation transcript:

Similar presentations

About project

Feedback