Analysis of Cluster Failures on Blue Gene Supercomputers

Slides:



Advertisements
Similar presentations
Misbah Mubarak, Christopher D. Carothers
Advertisements

Christian Delbe1 Christian Delbé OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis November Automatic Fault Tolerance in ProActive.
Winter 2004 UCSC CMPE252B1 CMPE 257: Wireless and Mobile Networking SET 3f: Medium Access Control Protocols.
Priority Research Direction Key challenges General Evaluation of current algorithms Evaluation of use of algorithms in Applications Application of “standard”
Parallel and Distributed Simulation Time Warp: Basic Algorithm.
Parallel Research at Illinois Parallel Everywhere
Warp Speed: Executing Time Warp on 1,966,080 Cores Chris Carothers Justin LaPre RPI {chrisc, Peter Barnes David Jefferson LLNL {barnes26,
4.1.5 System Management Background What is in System Management Resource control and scheduling Booting, reconfiguration, defining limits for resource.
Availability in Globally Distributed Storage Systems
Workshop on HPC in India Grid Middleware for High Performance Computing Sathish Vadhiyar Grid Applications Research Lab (GARL) Supercomputer Education.
Parallel and Distributed Simulation Time Warp: State Saving.
Presented by Scalable Systems Software Project Al Geist Computer Science Research Group Computer Science and Mathematics Division Research supported by.
A 100,000 Ways to Fa Al Geist Computer Science and Mathematics Division Oak Ridge National Laboratory July 9, 2002 Fast-OS Workshop Advanced Scientific.
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
Distributed Computations MapReduce
Understanding Network Failures in Data Centers: Measurement, Analysis and Implications Phillipa Gill University of Toronto Navendu Jain & Nachiappan Nagappan.
ROSS: Parallel Discrete-Event Simulations on Near Petascale Supercomputers Christopher D. Carothers Department of Computer Science Rensselaer Polytechnic.
© Fujitsu Laboratories of Europe 2009 HPC and Chaste: Towards Real-Time Simulation 24 March
1 Enabling Large Scale Network Simulation with 100 Million Nodes using Grid Infrastructure Hiroyuki Ohsaki Graduate School of Information Sci. & Tech.
Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
Parallel and Distributed Systems Instructor: Xin Yuan Department of Computer Science Florida State University.
Computer Science Section National Center for Atmospheric Research Department of Computer Science University of Colorado at Boulder Blue Gene Experience.
So far we have covered … Basic visualization algorithms Parallel polygon rendering Occlusion culling They all indirectly or directly help understanding.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Workshop on the Future of Scientific Workflows Break Out #2: Workflow System Design Moderators Chris Carothers (RPI), Doug Thain (ND)
Crystal Ball Panel ORNL Heterogeneous Distributed Computing Research Al Geist ORNL March 6, 2003 SOS 7.
Time Warp State Saving and Simultaneous Events. Outline State Saving Techniques –Copy State Saving –Infrequent State Saving –Incremental State Saving.
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
Iterative Partition Improvement using Mesh Topology for Parallel Adaptive Analysis M.S. Shephard, C. Smith, M. Zhou Scientific Computation Research Center,
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Thermal-aware Issues in Computers IMPACT Lab. Part A Overview of Thermal-related Technologies.
The Fresh Breeze Memory Model Status: Linear Algebra and Plans Guang R. Gao Jack Dennis MIT CSAIL University of Delaware Funded in part by NSF HECURA Grant.
Fault Tolerant Extensions to Charm++ and AMPI presented by Sayantan Chakravorty Chao Huang, Celso Mendes, Gengbin Zheng, Lixia Shi.
Chapter 10 Verification and Validation of Simulation Models
Diskless Checkpointing on Super-scale Architectures Applied to the Fast Fourier Transform Christian Engelmann, Al Geist Oak Ridge National Laboratory Februrary,
Automatic Statistical Evaluation of Resources for Condor Daniel Nurmi, John Brevik, Rich Wolski University of California, Santa Barbara.
EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Modeling Billion-Node Torus Networks Using Massively Parallel Discrete-Event Simulation Ning Liu, Christopher Carothers 1.
System-Directed Resilience for Exascale Platforms LDRD Proposal Ron Oldfield (PI)1423 Ron Brightwell1423 Jim Laros1422 Kevin Pedretti1423 Rolf.
NETE4631: Network Information System Capacity Planning (2) Suronapee Phoomvuthisarn, Ph.D. /
DISTIN: Distributed Inference and Optimization in WSNs A Message-Passing Perspective SCOM Team
Building Dependable Distributed Systems, Copyright Wenbing Zhao
Data Structures and Algorithms in Parallel Computing Lecture 7.
Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.
Lawrence Livermore National Laboratory 1 Science & Technology Principal Directorate - Computation Directorate Scalable Fault Tolerance for Petascale Systems.
Predictive Load Balancing Using Mesh Adjacencies for Mesh Adaptation  Cameron Smith, Onkar Sahni, Mark S. Shephard  Scientific Computation Research Center.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
Scalable Parallel I/O Alternatives1 Scalable Parallel I/O Alternatives for Massively Parallel Partitioned Solver Systems Jing Fu, Ning Liu, Onkar Sahni,
Presented by Fault Tolerance Challenges and Solutions Al Geist Network and Cluster Computing Computational Sciences and Mathematics Division Research supported.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
A CASE STUDY IN USING MASSIVELY PARALLEL SIMULATION FOR EXTREME- SCALE TORUS NETWORK CO-DESIGN Misbah Mubarak, Rensselaer Polytechnic Institute Christopher.
By: Joel Dominic and Carroll Wongchote 4/18/2012.
BLUE GENE Sunitha M. Jenarius. What is Blue Gene A massively parallel supercomputer using tens of thousands of embedded PowerPC processors supporting.
EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
PDES Introduction The Time Warp Mechanism
So far we have covered … Basic visualization algorithms
CHAPTER 3 Architectures for Distributed Systems
Supporting Fault-Tolerance in Streaming Grid Applications
湖南大学-信息科学与工程学院-计算机与科学系
BlueGene/L Supercomputer
EEC 688/788 Secure and Dependable Computing
Parallel and Distributed Simulation
Speculative execution and storage
Co-designed Virtual Machines for Reliable Computer Systems
On the Role of Burst Buffers in Leadership-Class Storage Systems
Support for Adaptivity in ARMCI Using Migratable Objects
Lecture 29: Distributed Systems
Facts About High-Performance Computing
Presentation transcript:

Analysis of Cluster Failures on Blue Gene Supercomputers Tom Hacker* Fabian Romero+ Chris Carothers Scientific Computation Research Center Department of Computer Science Rensselaer Polytechnic Institute Depart. Of Computer & Information Tech.* Discovery Cyber Center+ Purdue University

*To appear in upcoming issue of JPDC Outline Update on NSF PetaApps CFD Project PetaApps Project Components Current Scaling Results Challenges for Fault Tolerance Analysis of Clustered Failures* EPFL – 8K Blue Gene/L RPI – 32K Blue Gene/L Analysis Approach Findings Summary *To appear in upcoming issue of JPDC

NSF PetaApps: Parallel Adaptive CFD PetaApps Components CFD Solver Adaptivity Petascale Perf Sim Fault Recovery Demonstration Apps Cardiovascular Flow Flow Control Two-phase Flow Ken Jansen (PD), Onkar Sahni, Chris Carothers, Mark S. Shephard Scientific Computation Research Center Rensselaer Polytechnic Institute Acknowledgments: Partners: Simmetrix, Acusim, Kitware, IBM NSF: PetaApps, ITR, CTS; DOE: SciDAC-ITAPS, NERI; AFOSR Industry:IBM, Northrup Grumman, Boeing, Lockheed Martin, Motorola Computer Resources: TeraGrid, ANL, NERSC, RPI-CCNI

PHASTA Flow Solver Parallel Paradigm Time-accurate, stabilized FEM flow solver Two types of work: Equation formation O(40) peer-to-peer non/blocking comms Overlapping comms with comp Scales well on many machines Implicit, iterative equation solution Matrix assembled on processor ONLY Each Krylov vector is: q=Ap (matrix-vector product) Same peer-to-peer comm of q PLUS Orthogonalize against prior vectors REQUIRES NORMS=>MPI_Allreduce This sets up a cycle of global comms. separated by modest amount of work Not currently able to overlap Comms Even if work is balanced perfectly, OS jitter can imbalance it. Imbalance WILL show up in MPI_Allreduce Scales well on machines with low noise (like Blue Gene) P1 P2 P3

Parallel Implicit Flow Solver – Incompressible Abdominal Aorta Aneurysm (AAA) fafdfafd adsf a Cores (avg. elems./core) IBM BG/L RPI-CCNI t (secs.) scale factor 512 (204800) 2119.7 1 (base) 1024 (102400) 1052.4 1.01 2048 (51200) 529.1 1.00 4096 (25600) 267.0 0.99 8192 (12800) 130.5 1.02 16384 (6400) 64.5 1.03 32768 (3200) 35.6 0.93 32K parts show modest degradation due to 15% node imbalance (with only about 600 mesh-nodes/part) Rgn./elem. ratioi = rgnsi/avg_rgns Node ratioi = nodesi/avg_nodes (Min Zhou)

Scaling of “AAA” 105M Case Scaling loss due to OS jitter in MPI_Allreduce

AAA Adapted to 109 Elements: Scaling on Blue Gene /P #of cores Rgn imb Vtx imb Time (s) Scaling 32k 1.72% 8.11% 112.43 0.987 128k 5.49% 17.85% 31.35 0.885

Local Control Mechanism: Global Control Mechanism: ROSS: Massively Parallel Discrete-Event Simulation Local Control Mechanism: error detection and rollback Global Control Mechanism: compute Global Virtual Time (GVT) V i r t u a l T m e V i r t u a l T m e (1) undo state D’s (2) cancel “sent” events collect versions of state / events & perform I/O operations that are < GVT GVT LP 1 LP 2 LP 3 LP 1 LP 2 LP 3 processed event unprocessed event “straggler” event “committed” event

PHOLD is a stress-test. On BG/L @ CCNI 1 Million LPs (note: DES of BG/L comms had 6M LPs). 10 events per LP Upto 100% probablity any event scheduled to any other LP Other events scheduled to self.

At 64K cores, only 16 LPs per core with 10 events per LP. PHOLD on Blue Gene/P At 64K cores, only 16 LPs per core with 10 events per LP. At 128K cores, only 8 LPs per core == MAX parallelism and performance drops off significantly. Peak performance of 12.26 billion events/sec for 10% remote case.

Challenges for Petascale Fault Tolerance Good news with caveats… Our applications are scaling well. But…scaling runs are relatively short. (i.e., < 5 mins) and so don’t experience failures… One early example… Phasta could only run for at most 5 hours using 32K nodes before Blue Gene/L lost at least one node and whole program died… Systems I/O bound..cannot checkpoint.. BG/P “Intrepid” has 550+ TFlops of compute but only 55 to 60 GB/sec disk IO bandwidth using 4 MB data blocks… At 10% of peak flops, can only do ~ 1 “double” of I/O per 8000 “double” precision FLOPS. So we need to understand how systems fail in order to build efficient fault tolerant supercomputer systems…

Assessing Reliability on Petascale Systems Systems containing a large number of components experience a low mean time to failure Usual methods of failure analysis assume Independent failure events Exponentially distributed time between events Necessary for homogenous Markov modeling and Poisson processes Practical experience with systems of this scale show that Failures are frequently cascading Analyzing time between events in reality is difficult Difficult to put knowledge about reliability to practical use Better understanding of reliability from a practical perspective would be useful Increase reliability of large and long-running jobs Decrease checkpoint frequency Squeeze more efficiency from large-scale systems

Our approach Understand underlying statistics of failure on a large system in the field Use this understanding to attempt to predict node reliability Put this knowledge to work to improve job reliability

Characteristics of Failure Failures in large-scale systems are rarely independent singular events A set of failures can arise from a single underlying cause Network problems Rack-level problems Software subsystem problems Failures are manifested as a cluster of failures Grouped spatially (e.g., in a rack) Grouped temporally

Blue Gene We gathered RAS logs from two large Blue Gene systems (EPFL and RPI)

Blue Gene RAS Data Events include a level of severity INFO (least severe), WARNING, SEVERE, ERROR, FATAL, FAILURE Location and time Mapped these events into a 3D space to understand what was happening over time on the system Node address -> X axis Time of event -> Y axis Severity level -> Z axis

RPI Blue Gene Event Graph

EPFL Blue Gene Event Graph

Assessing Events Significant spatial and temporal clustering Used cluster analysis in R to reduce clustering Time between events transformed to Weibull Needed a model to predict node reliability In practice, nodes are either Healthy and operating normally Degraded and suspect Down

Node Reliability Model

Two Markov Models Accurate, but slow Less accurate, but much faster Continuous time Markov model Cluster analysis takes over 10 hours Less accurate, but much faster Discrete time Markov model Time step adjustable Computed in minutes

Predicted Reliability RPI & EPFL

Practical Application We can use this information to guide the scheduler Rank nodes by predicted reliability High reliability – least likely to fail Low reliability – most likely to fail Assign most reliable nodes to largest queues Assign least reliable nodes to smallest queues

Summary Our apps are scaling well on balanced hardware but have strong need to understand how failures impact performance, especially at petascale levels Analysis of failure logs suggests failures follow a Weibull distribution. Semi-Markov models are able to access reliability of nodes on the systems Nodes that log a large number of RAS events (i.e., are noisy) are less reliable than nodes that log few events (i.e., < 3). Grouping less “noisy” nodes together creates a partition that is much less likely to fail which significantly improves overall job completion rates and reduces the need for checkpointing.