April 14, 2004 The Distributed Performance Consultant: Automated Performance Diagnosis on 1000s of Processors Philip C. Roth Computer.

April 14, 2004 The Distributed Performance Consultant: Automated Performance Diagnosis on 1000s of Processors Philip C. Roth pcroth@cs.wisc.edu Computer Sciences Department University of Wisconsin-Madison Madison, WI 53706 USA

Distributed Performance Consultant © Philip C. Roth 2004-2- High Performance Computing Today Large parallel computing resources –Tightly coupled systems Earth Simulator (Japan, 5120 CPUs) ASCI Purple (LLNL, 12K CPUs planned) BlueGene/L (LLNL, 128K CPUs planned) –Clusters Lightning (LANL, 2816 CPUs) Aspen Systems (Forecast Systems Lab, 1536 CPUs) –Grid Large applications –ASCI Q job sizes (1/1 to 1/24 2004) 257–512 cpus: 19.7% 513–1024 cpus: 24.6% 1024–2048 cpus: 16.4%

Distributed Performance Consultant © Philip C. Roth 2004-3- Tool Front End d0d0 d1d1 d2d2 d3d3 d P-4 d P-3 d P-2 d P-1 Barriers to Large-Scale Performance Diagnosis a0a0 a1a1 a2a2 a3a3 a P-4 a P-3 a P-2 a P-1 Tool Daemons App Processes Managing performance data volume Communicating efficiently between distributed tool components Making scalable presentation of data and analysis results

Distributed Performance Consultant © Philip C. Roth 2004-4- Overcoming Scalability Barriers: Our Approach 1.MRNet: multicast/reduction infrastructure for building scalable tools 2.SStart: strategy for improving tool start-up latency 3.Distributed Performance Consultant: strategy for efficiently finding performance bottlenecks in large-scale applications 4.Distributed Data Manager: for efficient, distributed performance data management 5.Sub-Graph Folding Algorithm: algorithm for effectively presenting bottleneck search results for large-scale applications

Distributed Performance Consultant © Philip C. Roth 2004-5- Overcoming Scalability Barriers: Our Approach 1.MRNet: multicast/reduction infrastructure for building scalable tools 2.SStart: strategy for improving tool start-up latency 3.Distributed Performance Consultant: strategy for efficiently finding performance bottlenecks in large-scale applications 4.Distributed Data Manager: for efficient, distributed performance data management 5.Sub-Graph Folding Algorithm: algorithm for effectively presenting bottleneck search results for large-scale applications

Distributed Performance Consultant © Philip C. Roth 2004-6- The Performance Consultant Automated performance diagnosis Search for application performance problems –Start with global, general experiments (e.g., test CPUbound across all processes) –Collect performance data using dynamic instrumentation –Make decisions about truth of each experiment –Refine search: create more specific experiments based on “true” experiments –Repeat CPUbound main Do_rowDo_col … … … c002.cs.wisc.educ001.cs.wisc.educ128.cs.wisc.edu … main … … … … … …

Distributed Performance Consultant © Philip C. Roth 2004-7- The Performance Consultant Works well for sequential and small-scale parallel applications… …but front-end is a bottleneck when looking for performance problems in large-scale applications –High data processing load –Limited network bandwidth

Distributed Performance Consultant © Philip C. Roth 2004-8- Our Approach MRNet to reduce load for processing global performance data (e.g., average CPU utilization across all processes) But front-end still processes local performance data (e.g., CPU utilization in process 5247 on host blue26.pacific.llnl.gov)  Examine behavior of a specific process on the host that runs the process

Distributed Performance Consultant © Philip C. Roth 2004-9- c002.cs.wisc.educ001.cs.wisc.edu CPUbound c001.cs.wisc.edu main myapp{367} Do_rowDo_col Do_mult c128.cs.wisc.edu main myapp{27549} Do_rowDo_col Do_mult main Do_rowDo_col Do_mult c002.cs.wisc.edu main myapp{4287} Do_rowDo_col Do_mult … …… … … … … … cham.cs.wisc.educ128.cs.wisc.edu myapp367myapp4287myapp27549 Our Approach: Example

Distributed Performance Consultant © Philip C. Roth 2004-10- The Distributed Performance Consultant (DPC) Distributed performance problem search strategy Distributed performance data management DPC-aware Dynamic Instrumentation management MRNet Front-End Data Manager Local Performance Consultant Performance Consultant Daemon Data Collector Local Instrumentation Manager Local Data Manager Data Control

Distributed Performance Consultant © Philip C. Roth 2004-12- DPC: Search Strategy Local Performance Consultant agents (LPCs) in each daemon, Global PC in front-end GPC controls overall search, manage global experiments LPCs control portion of search specific to their local host MRNet Front-End Data Manager Local Performance Consultant Performance Consultant Daemon Data Collector Local Instrumentation Manager Local Data Manager Data Control

Distributed Performance Consultant © Philip C. Roth 2004-13- DPC: Search Strategy Like traditional PC, DPC begins search with global, general experiments (e.g., CPUbound across all processes) When search is refined to specific host, GPC delegates that portion of the search to LPC on that host LPC examines local behavior of all processes on the local host, returns results to GPC

Distributed Performance Consultant © Philip C. Roth 2004-15- DPC: Performance Data Management Local Data Managers (LDMs) in each daemon, Global Data Manager in front-end Publish/subscribe model for data flow MRNet for global data aggregation Possible caching at LDMs (no caching in initial implementation) MRNet Front-End Data Manager Local Performance Consultant Performance Consultant Daemon Data Collector Local Instrumentation Manager Local Data Manager Data Control

Distributed Performance Consultant © Philip C. Roth 2004-17- DPC: Instrumentation Management Local Instrumentation Manager (LIM) services requests from GPC and LPC without starving either Instrumentation management tailored to distributed nature of DPC –New model for instrumentation cost –New policy for scheduling instrumentation MRNet Front-End Data Manager Local Performance Consultant Performance Consultant Daemon Data Collector Local Instrumentation Manager Local Data Manager Data Control

Distributed Performance Consultant © Philip C. Roth 2004-18- DPC: Instrumentation Management Mismatch between centralized control of instrumentation cost and distributed nature of DPC DPC’s instrumentation management –New model for cost of Dynamic Instrumentation –New policy for scheduling Dynamic Instrumentation requests

Distributed Performance Consultant © Philip C. Roth 2004-19- DPC: Instrumentation Cost Model User-specified cost threshold, broadcast to each Local Instrumentation Manager (LIM) Each LIM restricts active instrumentation to cost threshold GPC throttles instrumentation requests when it does not receive timely responses to instrumentation requests (avoids need to gather instrumentation cost from all LIMs to front-end)

Distributed Performance Consultant © Philip C. Roth 2004-20- DPC: Instrumentation Scheduling Policy LIM needs to schedule instrumentation requests from both LPC and GPC –GPC requests must be satisfied by LIM in all daemons to be useful –LPC requests are restricted to a single daemon Prototype LIM’s instrumentation scheduling policy chosen to ensure both local and global searches make progress –Distinguishes instrumentation cost by instrumentation type (local vs. global) –If both types are present, guarantee a percentage of total instrumentation cost threshold for each type (percentage is user-configurable)

Distributed Performance Consultant © Philip C. Roth 2004-21- DPC: Status Implementation well underway, but prototype implementation not yet complete –Daemons have been multithreaded –Local Data Managers in place –Local Performance Consultant agents in progress Planning experiments to evaluate overall scalability with various instrumentation cost models and scheduling policies

Distributed Performance Consultant © Philip C. Roth 2004-22- Summary The DPC is a major part of our approach for overcoming barriers that limit the scalability of automated performance diagnosis tools. Using a distributed search strategy, distributed performance data manager, and instrumentation management techniques tailored to its distributed nature, the DPC will greatly increase the scalability of Paradyn’s automated performance diagnosis functionality. http://www.paradyn.org paradyn@cs.wisc.edu

April 14, 2004 The Distributed Performance Consultant: Automated Performance Diagnosis on 1000s of Processors Philip C. Roth Computer.

Similar presentations

Presentation on theme: "April 14, 2004 The Distributed Performance Consultant: Automated Performance Diagnosis on 1000s of Processors Philip C. Roth Computer."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

April 14, 2004 The Distributed Performance Consultant: Automated Performance Diagnosis on 1000s of Processors Philip C. Roth Computer.

Similar presentations

Presentation on theme: "April 14, 2004 The Distributed Performance Consultant: Automated Performance Diagnosis on 1000s of Processors Philip C. Roth Computer."— Presentation transcript:

Similar presentations

About project

Feedback