Presented by Reliability, Availability, and Serviceability (RAS) for High-Performance Computing Stephen L. Scott and Christian Engelmann Computer Science.

Slides:



Advertisements
Similar presentations
Christian Delbe1 Christian Delbé OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis November Automatic Fault Tolerance in ProActive.
Advertisements

© 2005 Dorian C. Arnold Reliability in Tree-based Overlay Networks Dorian C. Arnold University of Wisconsin Paradyn/Condor Week March 14-18, 2005 Madison,
Introduction to DBA.
High Availability 24 hours a day, 7 days a week, 365 days a year… Vik Nagjee Product Manager, Core Technologies InterSystems Corporation.
Distributed Systems Brief Overview CNT Mobile & Pervasive Computing Dr. Sumi Helal University of Florida.
Oracle Data Guard Ensuring Disaster Recovery for Enterprise Data
1 Cheriton School of Computer Science 2 Department of Computer Science RemusDB: Transparent High Availability for Database Systems Umar Farooq Minhas 1,
Presented by Scalable Systems Software Project Al Geist Computer Science Research Group Computer Science and Mathematics Division Research supported by.
1 © Copyright 2010 EMC Corporation. All rights reserved. EMC RecoverPoint/Cluster Enabler for Microsoft Failover Cluster.
Business Continuity and DR, A Practical Implementation Mich Talebzadeh, Consultant, Deutsche Bank
A 100,000 Ways to Fa Al Geist Computer Science and Mathematics Division Oak Ridge National Laboratory July 9, 2002 Fast-OS Workshop Advanced Scientific.
Chiba City: A Testbed for Scalablity and Development FAST-OS Workshop July 10, 2002 Rémy Evard Mathematics.
NPACI: National Partnership for Advanced Computational Infrastructure August 17-21, 1998 NPACI Parallel Computing Institute 1 Cluster Archtectures and.
1© Copyright 2011 EMC Corporation. All rights reserved. EMC RECOVERPOINT/ CLUSTER ENABLER FOR MICROSOFT FAILOVER CLUSTER.
National Manager Database Services
11 SERVER CLUSTERING Chapter 6. Chapter 6: SERVER CLUSTERING2 OVERVIEW  List the types of server clusters.  Determine which type of cluster to use for.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
ATIF MEHMOOD MALIK KASHIF SIDDIQUE Improving dependability of Cloud Computing with Fault Tolerance and High Availability.
1 MOLAR: MOdular Linux and Adaptive Runtime support Project Team David Bernholdt 1, Christian Engelmann 1, Stephen L. Scott 1, Jeffrey Vetter 1 Arthur.
SANPoint Foundation Suite HA Robert Soderbery Sr. Director, Product Management VERITAS Software Corporation.
Lecture 13 Fault Tolerance Networked vs. Distributed Operating Systems.
DISTRIBUTED ALGORITHMS Luc Onana Seif Haridi. DISTRIBUTED SYSTEMS Collection of autonomous computers, processes, or processors (nodes) interconnected.
High-Availability Linux.  Reliability  Availability  Serviceability.
Copyright © Clifford Neuman and Dongho Kim - UNIVERSITY OF SOUTHERN CALIFORNIA - INFORMATION SCIENCES INSTITUTE Advanced Operating Systems Lecture.
DISTRIBUTED COMPUTING
© 2005 Mt Xia Technical Consulting Group - All Rights Reserved. HACMP – High Availability Introduction Presentation November, 2005.
IMPROUVEMENT OF COMPUTER NETWORKS SECURITY BY USING FAULT TOLERANT CLUSTERS Prof. S ERB AUREL Ph. D. Prof. PATRICIU VICTOR-VALERIU Ph. D. Military Technical.
Transparency in Distributed Operating Systems Vijay Akkineni.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.
Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.
Crystal Ball Panel ORNL Heterogeneous Distributed Computing Research Al Geist ORNL March 6, 2003 SOS 7.
Module 10: Maintaining High-Availability. Overview Introduction to Availability Increasing Availability Using Failover Clustering Standby Servers and.
Distributed Systems and Algorithms Sukumar Ghosh University of Iowa Spring 2011.
Headline in Arial Bold 30pt HPC User Forum, April 2008 John Hesterberg HPC OS Directions and Requirements.
Distributed Computing CSC 345 – Operating Systems By - Fure Unukpo 1 Saturday, April 26, 2014.
7. Replication & HA Objectives –Understand Replication and HA Contents –Standby server –Failover clustering –Virtual server –Cluster –Replication Practicals.
PARALLEL COMPUTING overview What is Parallel Computing? Traditionally, software has been written for serial computation: To be run on a single computer.
Fault-Tolerant Parallel and Distributed Computing for Software Engineering Undergraduates Ali Ebnenasir and Jean Mayo {aebnenas, Department.
OS and System Software for Ultrascale Architectures – Panel Jeffrey Vetter Oak Ridge National Laboratory Presented to SOS8 13 April 2004 ack.
Diskless Checkpointing on Super-scale Architectures Applied to the Fast Fourier Transform Christian Engelmann, Al Geist Oak Ridge National Laboratory Februrary,
 Load balancing is the process of distributing a workload evenly throughout a group or cluster of computers to maximize throughput.  This means that.
11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.
A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance Chao Wang, Frank Mueller North Carolina State University Christian Engelmann, Stephen.
HPC Components for CCA Manoj Krishnan and Jarek Nieplocha Computational Sciences and Mathematics Division Pacific Northwest National Laboratory.
Distributed Systems CS Consistency and Replication – Part I Lecture 10, September 30, 2013 Mohammad Hammoud.
Presented by Reliability, Availability, and Serviceability (RAS) for High-Performance Computing Stephen L. Scott Christian Engelmann Computer Science Research.
System-Directed Resilience for Exascale Platforms LDRD Proposal Ron Oldfield (PI)1423 Ron Brightwell1423 Jim Laros1422 Kevin Pedretti1423 Rolf.
Distributed Computing Systems CSCI 4780/6780. Scalability ConceptExample Centralized servicesA single server for all users Centralized dataA single on-line.
HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.
Experiments in Utility Computing: Hadoop and Condor Sameer Paranjpye Y! Web Search.
1 Chapter Overview Using Standby Servers Using Failover Clustering.
Cluster computing. 1.What is cluster computing? 2.Need of cluster computing. 3.Architecture 4.Applications of cluster computing 5.Advantages of cluster.
VCS Building Blocks. Topic 1: Cluster Terminology After completing this topic, you will be able to define clustering terminology.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
Presented by Fault Tolerance Challenges and Solutions Al Geist Network and Cluster Computing Computational Sciences and Mathematics Division Research supported.
Seminar On Rain Technology
1 High-availability and disaster recovery  Dependability concepts:  fault-tolerance, high-availability  High-availability classification  Types of.
Presented by Robust Storage Management On Desktop, in Machine Room, and Beyond Xiaosong Ma Computer Science and Mathematics Oak Ridge National Laboratory.
Presented by SciDAC-2 Petascale Data Storage Institute Philip C. Roth Computer Science and Mathematics Future Technologies Group.
Advanced Network Administration Computer Clusters.
Chapter 1 Characterization of Distributed Systems
High Availability 24 hours a day, 7 days a week, 365 days a year…
Maximum Availability Architecture Enterprise Technology Centre.
Grid Aware: HA-OSCAR By
Fault Tolerance Distributed Web-based Systems
CLUSTER COMPUTING.
Consistency and Replication
Distributed Systems and Concurrency: Distributed Systems
Presentation transcript:

Presented by Reliability, Availability, and Serviceability (RAS) for High-Performance Computing Stephen L. Scott and Christian Engelmann Computer Science and Mathematics Division Oak Ridge National Laboratory, Oak Ridge, TN, USA

2 Scott_RAS_0614 Research and development goals  Develop techniques to enable HPC systems to run computational jobs 24x7  Develop proof-of-concept prototypes and production-type RAS solutions  Provide high-level RAS capabilities for current terascale and next-generation petascale high-performance computing (HPC) systems  Eliminate many of the numerous single points of failure and control in today’s HPC systems

3 Scott_RAS_0614 MOLAR: Adaptive runtime support for high-end computing operating and runtime systems  Addresses the challenges for operating and runtime systems to run large applications efficiently on future ultra-scale high-end computers  Part of the Forum to Address Scalable Technology for Runtime and Operating Systems (FAST-OS)  MOLAR is a collaborative research effort (

4 Scott_RAS_0614 Active/standby with shared storage  Single active head node  Backup to shared storage  Simple checkpoint/restart  Fail-over to standby node Possible corruption of backup state when failing during backup Introduction of a new single point of failure No guarantee of correctness and availability Simple Linux Utility for Resource Management, metadata servers of Parallel Virtual File System and Lustre Active/Standby Head Nodes with Shared Storage

5 Scott_RAS_0614 Active/standby redundancy  Single active head node  Backup to standby node  Simple checkpoint/restart  Fail-over to standby node  Idle standby head node  Rollback to backup  Service interruption for fail-over and restore-over HA-OSCAR, Torque on Cray XT Active/Standby Head Nodes

6 Scott_RAS_0614 Asymmetric active/active redundancy  Many active head nodes  Work load distribution  Optional fail-over to standby head node(s) (n+1 or n+m)  No coordination between active head nodes  Service interruption for fail-over and restore-over  Loss of state without standby  Limited use cases, such as high-throughput computing Prototype based on HA-OSCAR Asymmetric Active/Active Head Nodes

7 Scott_RAS_0614 Symmetric active/active redundancy  Many active head nodes  Work load distribution  Symmetric replication between head nodes  Continuous service  Always up to date  No fail-over necessary  No restore-over necessary  Virtual synchrony model  Complex algorithms  JOSHUA prototype for Torque Active/Active Head Nodes

8 Scott_RAS_0614 Input Replication Virtually Synchronous Processing Output Unification Symmetric active/active replication

9 Scott_RAS_0614 Symmetric active/active high availability for head and service nodes  A component = MTTF / (MTTF + MTTR)  A system = 1 - (1 - A component ) n  T down = 8760 hours * (1 – A)  Single node MTTF: 5000 hours  Single node MTTR: 72 hours NodesAvailabilityEst. annual downtime %5d4h21m %1h45m % 1m30s % 1s Single-site redundancy for 7 nines does not mask catastrophic events NodesAvailabilityEst. annual downtime %5d4h21m %1h45m % 1m30s NodesAvailabilityEst. annual downtime %5d4h21m %1h45m NodesAvailabilityEst. annual downtime %5d4h21m

10 Scott_RAS_0614 High-availability framework for HPC  Pluggable component framework  Communication drivers  Group communication  Virtual synchrony  Applications  Interchangeable components  Adaptation to application needs, such as level of consistency  Adaptation to system properties, such as network and system scale Applications Scheduler MPI Runtime File System SSI Virtual Synchrony Replicated Memory Replicated File Replicated State-Machine Replicated Database Replicated RPC/RMI Distributed Control Group Communication Membership Management Failure Detection Reliable Multicast Atomic Multicast Communication Driver Singlecast Failure Detection Multicast Network (Ethernet, Myrinet, Elan+, Infiniband,…)

11 Scott_RAS_0614 Scalable, fault-tolerant membership for MPI tasks on HPC systems  Scalable approach to reconfiguring communication infrastructure  Decentralized (peer-to-peer) protocol that maintains consistent view of active nodes in the presence of faults  Resilience against multiple node failures, even during reconfiguration  Response time:  Hundreds of microseconds over MPI on 1024-node Blue Gene/L  Single-digit milliseconds over TCP on 64-node Gigabit Ethernet Linux cluster (XTORC)  Integration with Berkeley Laboratory checkpoint/restart (BLCR) mechanism to handle node failures without restarting an entire MPI job

12 Scott_RAS_0614 Stabilization time over MPI on BG/L Time for Stabilization [microsecs] Number of Nodes (Log Scale) Experimental results Distance model Base model

13 Scott_RAS_0614 Stabilization time over TCP on XTORC Number of nodes Time for Stabilization [microsecs] Experimental results Distance Model Base Model

14 Scott_RAS_0614 ORNL contacts Stephen L. Scott Network and Cluster Computing Computer Science and Mathematics (865) Christian Engelmann Network and Cluster Computing Computer Science and Mathematics (865) Scott_RAS_0614