O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 1 On-line Automated Performance Diagnosis on Thousands of Processors Philip C. Roth Future.

Slides:



Advertisements
Similar presentations
Database Architectures and the Web
Advertisements

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Center for Computational Sciences Cray X1 and Black Widow at ORNL Center for Computational.
© 2005 Dorian C. Arnold Reliability in Tree-based Overlay Networks Dorian C. Arnold University of Wisconsin Paradyn/Condor Week March 14-18, 2005 Madison,
The Charm++ Programming Model and NAMD Abhinav S Bhatele Department of Computer Science University of Illinois at Urbana-Champaign
O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Global Climate Modeling Research John Drake Computational Climate Dynamics Group Computer.
CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.
A 100,000 Ways to Fa Al Geist Computer Science and Mathematics Division Oak Ridge National Laboratory July 9, 2002 Fast-OS Workshop Advanced Scientific.
CS CS 5150 Software Engineering Lecture 19 Performance.
Tools for Engineering Analysis of High Performance Parallel Programs David Culler, Frederick Wong, Alan Mainwaring Computer Science Division U.C.Berkeley.
Chapter 10: Stream-based Data Management Title: Design, Implementation, and Evaluation of the Linear Road Benchmark on the Stream Processing Core Authors:
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Science Advisory Committee Meeting - 20 September 3, 2010 Stanford University 1 04_Parallel Processing Parallel Processing Majid AlMeshari John W. Conklin.
Instrumentation and Measurement CSci 599 Class Presentation Shreyans Mehta.
23 September 2004 Evaluating Adaptive Middleware Load Balancing Strategies for Middleware Systems Department of Electrical Engineering & Computer Science.
1 Down Place Hammersmith London UK 530 Lytton Ave. Palo Alto CA USA.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.
Abstract Load balancing in the cloud computing environment has an important impact on the performance. Good load balancing makes cloud computing more.
Tree-Based Density Clustering using Graphics Processors
Computer System Architectures Computer System Software
© 2008 The MathWorks, Inc. ® ® Parallel Computing with MATLAB ® Silvina Grad-Freilich Manager, Parallel Computing Marketing
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04.
A Lightweight Platform for Integration of Resource Limited Devices into Pervasive Grids Stavros Isaiadis and Vladimir Getov University of Westminster
Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
An affinity-driven clustering approach for service discovery and composition for pervasive computing J. Gaber and M.Bakhouya Laboratoire SeT Université.
Profiling Grid Data Transfer Protocols and Servers George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA.
CCA Common Component Architecture Manoj Krishnan Pacific Northwest National Laboratory MCMD Programming and Implementation Issues.
Cluster Reliability Project ISIS Vanderbilt University.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.
O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY 1 Parallel Solution of the 3-D Laplace Equation Using a Symmetric-Galerkin Boundary Integral.
OPERATING SYSTEMS Goals of the course Definitions of operating systems Operating system goals What is not an operating system Computer architecture O/S.
Chapter 101 Multiprocessor and Real- Time Scheduling Chapter 10.
1 Multiprocessor and Real-Time Scheduling Chapter 10 Real-Time scheduling will be covered in SYSC3303.
Crystal Ball Panel ORNL Heterogeneous Distributed Computing Research Al Geist ORNL March 6, 2003 SOS 7.
Center for Computational Sciences O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY Vision for OSC Computing and Computational Sciences
BOF: Megajobs Gracie: Grid Resource Virtualization and Customization Infrastructure How to execute hundreds of thousands tasks concurrently on distributed.
Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC.
Headline in Arial Bold 30pt HPC User Forum, April 2008 John Hesterberg HPC OS Directions and Requirements.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
1 CMPE 511 HIGH PERFORMANCE COMPUTING CLUSTERS Dilek Demirel İşçi.
Towards Exascale File I/O Yutaka Ishikawa University of Tokyo, Japan 2009/05/21.
Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.
GVis: Grid-enabled Interactive Visualization State Key Laboratory. of CAD&CG Zhejiang University, Hangzhou
Copyright © 2011 Curt Hill MIMD Multiple Instructions Multiple Data.
Diskless Checkpointing on Super-scale Architectures Applied to the Fast Fourier Transform Christian Engelmann, Al Geist Oak Ridge National Laboratory Februrary,
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
April 14, 2004 The Distributed Performance Consultant: Automated Performance Diagnosis on 1000s of Processors Philip C. Roth Computer.
1 Spallation Neutron Source Data Analysis Jessica Travierso Research Alliance in Math and Science Program Austin Peay State University Mentor: Vickie E.
Full and Para Virtualization
Data Structures and Algorithms in Parallel Computing Lecture 7.
Proposal of Asynchronous Distributed Branch and Bound Atsushi Sasaki†, Tadashi Araragi†, Shigeru Masuyama‡ †NTT Communication Science Laboratories, NTT.
Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC.
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
© 2001 Week (14 March 2001)Paradyn & Dyninst Demonstrations Paradyn & Dyninst Demos Barton P. Miller Computer.
Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.
Background Computer System Architectures Computer System Software.
BLUE GENE Sunitha M. Jenarius. What is Blue Gene A massively parallel supercomputer using tens of thousands of embedded PowerPC processors supporting.
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Parallel Programming in C with MPI and OpenMP
QNX Technology Overview
ExaO: Software Defined Data Distribution for Exascale Sciences
The Globus Toolkit™: Information Services
Stack Trace Analysis for Large Scale Debugging using MRNet
Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale
Parallel Programming in C with MPI and OpenMP
Presentation transcript:

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 1 On-line Automated Performance Diagnosis on Thousands of Processors Philip C. Roth Future Technologies Group Computer Science and Mathematics Division Oak Ridge National Laboratory Paradyn Research Group Computer Sciences Department University of Wisconsin-Madison

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 2 High Performance Computing Today  Large parallel computing resources  Tightly coupled systems (Earth Simulator, BlueGene/L, XT3)  Clusters (LANL Lightning, LLNL Thunder)  Grid  Large, complex applications  ASCI Blue Mountain job sizes (2001)  512 cpus: 17.8%  1024 cpus: 34.9%  2048 cpus: 19.9%  Small fraction of peak performance is the rule

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 3 Achieving Good Performance  Need to know what and where to tune  Diagnosis and tuning tools are critical for realizing potential of large-scale systems  On-line automated tools are especially desirable  Manual tuning is difficult  Finding interesting data in large data volume  Understanding application, OS, hardware interactions  Automated tools require minimal user involvement; expertise is built into the tool  On-line automated tools can adapt dynamically  Dynamic control over data volume  Useful results from a single run  But: tools that work well in small-scale environments often don’t scale

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 4 Tool Front End d0d0 d1d1 d2d2 d3d3 d P-4 d P-3 d P-2 d P-1 a0a0 a1a1 a2a2 a3a3 a P-4 a P-3 a P-2 a P-1 Tool Daemons App Processes Managing performance data volume Communicating efficiently between distributed tool components Making scalable presentation of data and analysis results Barriers to Large-Scale Performance Diagnosis

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 5 Our Approach for Addressing These Scalability Barriers  MRNet: multicast/reduction infrastructure for scalable tools  Distributed Performance Consultant: strategy for efficiently finding performance bottlenecks in large-scale applications  Sub-Graph Folding Algorithm: algorithm for effectively presenting bottleneck diagnosis results for large-scale applications

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 6 Outline  Performance Consultant  MRNet  Distributed Performance Consultant  Sub-Graph Folding Algorithm  Evaluation  Summary

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 7  Automated performance diagnosis  Search for application performance problems  Start with global, general experiments (e.g., test CPUbound across all processes)  Collect performance data using dynamic instrumentation  Collect only the data desired  Remove the instrumentation when no longer needed  Make decisions about truth of each experiment  Refine search: create more specific experiments based on “true” experiments (those whose data is above user- configurable threshold) Performance Consultant

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 8 Performance Consultant c002.cs.wisc.educ001.cs.wisc.educ128.cs.wisc.edu myapp367myapp4287myapp27549

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 9 c002.cs.wisc.educ001.cs.wisc.edu CPUbound c001.cs.wisc.edu main myapp{367} Do_rowDo_col Do_mult c128.cs.wisc.edu main myapp{27549} Do_rowDo_col Do_mult main Do_rowDo_col Do_mult c002.cs.wisc.edu main myapp{4287} Do_rowDo_col Do_mult … …… … … … … … c128.cs.wisc.edu myapp367myapp4287myapp27549 Performance Consultant

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 10 c002.cs.wisc.educ001.cs.wisc.edu CPUbound c001.cs.wisc.edu main myapp{367} Do_rowDo_col Do_mult c128.cs.wisc.edu main myapp{27549} Do_rowDo_col Do_mult main Do_rowDo_col Do_mult c002.cs.wisc.edu main myapp{4287} Do_rowDo_col Do_mult … …… … … … … … cham.cs.wisc.educ128.cs.wisc.edu myapp367myapp4287myapp27549 Performance Consultant

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 11 Outline  Performance Consultant  MRNet  Distributed Performance Consultant  Sub-Graph Folding Algorithm  Evaluation  Summary

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 12 MRNet: Multicast/Reduction Overlay Network  Parallel tool infrastructure providing:  Scalable multicast  Scalable data synchronization and transformation  Network of processes between tool front-end and back-ends  Useful for parallelizing and distributing tool activities  Reduce latency  Reduce computation and communication load at tool front-end  Joint work with Dorian Arnold (University of Wisconsin-Madison)

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 13 Typical Parallel Tool Organization Tool Front End d0d0 d1d1 d2d2 d3d3 a0a0 a1a1 a2a2 a3a3 d P-4 d P-3 d P-2 d P-1 a P-4 a P-3 a P-2 a P-1 Tool Daemons App Processes

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 14 MRNet-based Parallel Tool Organization Tool Front End d0d0 d1d1 d2d2 d3d3 a0a0 a1a1 a2a2 a3a3 d P-4 d P-3 d P-2 d P-1 a P-4 a P-3 a P-2 a P-1 Tool Daemons App Processes Multicast/ Reduction Network Internal Process Filter

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 15 Outline  Performance Consultant  MRNet  Distributed Performance Consultant  Sub-Graph Folding Algorithm  Evaluation  Summary

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 16 Performance Consultant: Scalability Barriers  MRNet can alleviate scalability problem for global performance data (e.g., CPU utilization across all processes)  But front-end still processes local performance data (e.g., utilization of process 5247 on host mcr398.llnl.gov)

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 17 c002.cs.wisc.educ001.cs.wisc.edu CPUbound c001.cs.wisc.edu main myapp{367} Do_rowDo_col Do_mult c128.cs.wisc.edu main myapp{27549} Do_rowDo_col Do_mult main Do_rowDo_col Do_mult c002.cs.wisc.edu main myapp{4287} Do_rowDo_col Do_mult … …… … … … … … cham.cs.wisc.educ128.cs.wisc.edu myapp367myapp4287myapp27549 Performance Consultant

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 18 c002.cs.wisc.educ001.cs.wisc.edu CPUbound c001.cs.wisc.edu main myapp{367} Do_rowDo_col Do_mult c128.cs.wisc.edu main myapp{27549} Do_rowDo_col Do_mult main Do_rowDo_col Do_mult c002.cs.wisc.edu main myapp{4287} Do_rowDo_col Do_mult … …… … … … … … cham.cs.wisc.educ128.cs.wisc.edu myapp367myapp4287myapp27549 Distributed Performance Consultant

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 19 Distributed Performance Consultant: Variants  Natural steps from traditional centralized approach (CA)  Partially Distributed Approach (PDA)  Distributed local searches, centralized global search  Requires complex instrumentation management  Truly Distributed Approach (TDA)  Distributed local searches only  Insight into global behavior from combining local search results (e.g., using Sub-Graph Folding Algorithm)  Simpler tool design than PDA

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 20 c002.cs.wisc.educ001.cs.wisc.edu CPUbound c001.cs.wisc.edu main myapp{367} Do_rowDo_col Do_mult c128.cs.wisc.edu main myapp{27549} Do_rowDo_col Do_mult main Do_rowDo_col Do_mult c002.cs.wisc.edu main myapp{4287} Do_rowDo_col Do_mult … …… … … … … … cham.cs.wisc.educ128.cs.wisc.edu myapp367myapp4287myapp27549 Distributed Performance Consultant: PDA

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 21 c002.cs.wisc.educ001.cs.wisc.edu main myapp{367} Do_rowDo_col Do_mult c128.cs.wisc.edu main myapp{27549} Do_rowDo_col Do_mult c002.cs.wisc.edu main myapp{4287} Do_rowDo_col Do_mult … …… … … … … cham.cs.wisc.educ128.cs.wisc.edu myapp367myapp4287myapp27549 Distributed Performance Consultant: TDA

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 22 c002.cs.wisc.educ001.cs.wisc.edu main myapp{367} Do_rowDo_col Do_mult c128.cs.wisc.edu main myapp{27549} Do_rowDo_col Do_mult c002.cs.wisc.edu main myapp{4287} Do_rowDo_col Do_mult … …… … … … … cham.cs.wisc.educ128.cs.wisc.edu myapp367myapp4287myapp27549 Distributed Performance Consultant: TDA Sub-Graph Folding Algorithm

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 23 Outline  Paradyn and the Performance Consultant  MRNet  Distributed Performance Consultant  Sub-Graph Folding Algorithm  Evaluation  Summary

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 24 Search History Graph Example CPUbound c34.cs.wisc.edu myapp{7624} main AB C D AB C D myapp{1272} main AB C D myapp{1273} main AB C DE myapp{7625} main AB C D c33.cs.wisc.edu

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 25 Search History Graphs  Search History Graph is effective for presenting search-based performance diagnosis results…  …but it does not scale to a large number of processes because it shows one sub-graph per process

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 26 Sub-Graph Folding Algorithm  Combines host-specific sub-graphs into composite sub-graphs  Each composite sub-graph represents a behavioral category among application processes  Dynamic clustering of processes by qualitative behavior

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 27 SGFA: Example CPUbound c34.cs.wisc.edu myapp{7624} main AB C D AB C D myapp{1272} main AB C D myapp{1273} main AB C DE myapp{7625} main AB C D c33.cs.wisc.edu myapp{*} D E c*.cs.wisc.edu

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 28 SGFA: Implementation  Custom MRNet filter  Filter in each MRNet process keeps folded graph of search results from all reachable daemons  Updates periodically sent upstream  By induction, filter in front-end holds entire folded graph  Optimization for unchanged graphs

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 29 Outline  Performance Consultant  MRNet  Distributed Performance Consultant  Sub-Graph Folding Algorithm  Evaluation  Summary

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 30 DPC + SGFA: Evaluation  Modified Paradyn to perform bottleneck searches using CA, PDA, or TDA approach  Modified instrumentation cost tracking to support PDA  Track global, per-process instrumentation cost separately  Simple fixed-partition policy for scheduling global and local instrumentation  Implemented Sub-Graph Folding Algorithm as custom MRNet filter to support TDA (used by all)  Instrumented front-end, daemons, and MRNet internal processes to collect CPU, I/O load information

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 31 DPC + SGFA: Evaluation  su3_rmd  QCD pure lattice gauge theory code  C, MPI  Weak scaling scalability study  LLNL MCR cluster  1152 nodes (1048 compute nodes)  Two 2.4 GHz Intel Xeons per node  4 GB memory per node  Quadrics Elan3 interconnect (fat tree)  Lustre parallel file system

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 32 DPC + SGFA: Evaluation  PDA and TDA: bottleneck searches with up to 1024 processes so far, limited by partition size  CA: scalability limit at less than 64 processes  Similar qualitative results from all approaches

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 33 DPC: Evaluation

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 34 DPC: Evaluation

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 35 DPC: Evaluation

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 36 DPC: Evaluation

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 37 DPC: Evaluation

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 38 DPC: Evaluation

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 39 DPC: Evaluation

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 40 DPC: Evaluation

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 41 DPC: Evaluation

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 42 SGFA: Evaluation

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 43 Summary  Tool scalability is critical for effective use of large-scale computing resources  On-line automated performance tools are especially important at large scale  Our approach:  MRNet  Distributed Performance Consultant (TDA) plus Sub-Graph Folding Algorithm

O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY 44 References  P.C. Roth, D.C. Arnold, and B.P. Miller, “MRNet: a Software-Based Multicast/Reduction Network for Scalable Tools,” SC 2003, Phoenix, Arizona, November 2003  P.C. Roth and B.P. Miller, “The Distributed Performance Consultant and the Sub-Graph Folding Algorithm: On-line Automated Performance Diagnosis on Thousands of Processes,” in submission  Publications available from  MRNet software available from