Presented by Reliability, Availability, and Serviceability (RAS) for High-Performance Computing Stephen L. Scott Christian Engelmann Computer Science Research.

Slides:



Advertisements
Similar presentations
Remus: High Availability via Asynchronous Virtual Machine Replication
Advertisements

Live migration of Virtual Machines Nour Stefan, SCPD.
Master/Slave Architecture Pattern Source: Pattern-Oriented Software Architecture, Vol. 1, Buschmann, et al.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
Availability in Globally Distributed Storage Systems
Parallel Programming Laboratory1 Fault Tolerance in Charm++ Sayantan Chakravorty.
Workshop on HPC in India Grid Middleware for High Performance Computing Sathish Vadhiyar Grid Applications Research Lab (GARL) Supercomputer Education.
Making Services Fault Tolerant
A Case for a Fault-Tolerant Virtual Machine Andrey Ermolinskiy Mohit Chawla.
Modularized Redundant Parallel Virtual System
Presented by Scalable Systems Software Project Al Geist Computer Science Research Group Computer Science and Mathematics Division Research supported by.
Towards High-Availability for IP Telephony using Virtual Machines Devdutt Patnaik, Ashish Bijlani and Vishal K Singh.
A 100,000 Ways to Fa Al Geist Computer Science and Mathematics Division Oak Ridge National Laboratory July 9, 2002 Fast-OS Workshop Advanced Scientific.
The Virtual Data Toolkit distributed by the Open Science Grid Richard Jones University of Connecticut CAT project meeting, June 24, 2008.
Building Fault Survivable MPI Programs with FT-MPI Using Diskless Checkpointing Zizhong Chen, Graham E. Fagg, Edgar Gabriel, Julien Langou, Thara Angskun,
Keith Burns Microsoft UK Mission Critical Database.
Lee Center Workshop, May 19, 2006 Distributed Objects System with Support for Sequential Consistency.
DatacenterMicrosoft Azure Consistency Connectivity Code.
Computer Science Lecture 16, page 1 CS677: Distributed OS Last Class:Consistency Semantics Consistency models –Data-centric consistency models –Client-centric.
1 Making Services Fault Tolerant Pat Chan, Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong Miroslaw Malek.
1© Copyright 2011 EMC Corporation. All rights reserved. EMC RECOVERPOINT/ CLUSTER ENABLER FOR MICROSOFT FAILOVER CLUSTER.
Efficient Proactive Security for Sensitive Data Storage Arun Subbiah Douglas M. Blough School of ECE, Georgia Tech {arun,
VIRTUALISATION OF HADOOP CLUSTERS Dr G Sudha Sadasivam Assistant Professor Department of CSE PSGCT.
System-level Virtualization for HPC: Recent work on Loadable Hypervisor Modules Systems Research Team Computer Science and Mathematics Division Oak Ridge.
ATIF MEHMOOD MALIK KASHIF SIDDIQUE Improving dependability of Cloud Computing with Fault Tolerance and High Availability.
1 MOLAR: MOdular Linux and Adaptive Runtime support Project Team David Bernholdt 1, Christian Engelmann 1, Stephen L. Scott 1, Jeffrey Vetter 1 Arthur.
1 Advanced Storage Technologies for High Performance Computing Sorin, Faibish EMC NAS Senior Technologist IDC HPC User Forum, April 14-16, Norfolk, VA.
Module 12: Designing High Availability in Windows Server ® 2008.
Remus: VM Replication Jeff Chase Duke University.
Priority Research Direction (use one slide for each) Key challenges -Fault understanding (RAS), modeling, prediction -Fault isolation/confinement + local.
IMPROUVEMENT OF COMPUTER NETWORKS SECURITY BY USING FAULT TOLERANT CLUSTERS Prof. S ERB AUREL Ph. D. Prof. PATRICIU VICTOR-VALERIU Ph. D. Military Technical.
Presented by Reliability, Availability, and Serviceability (RAS) for High-Performance Computing Stephen L. Scott and Christian Engelmann Computer Science.
4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.
1/20 Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications Sheng Di, Mohamed Slim Bouguerra, Leonardo Bautista-gomez, Franck Cappello.
Arun Babu Nagarajan Frank Mueller North Carolina State University Proactive Fault Tolerance for HPC using Xen Virtualization.
Crystal Ball Panel ORNL Heterogeneous Distributed Computing Research Al Geist ORNL March 6, 2003 SOS 7.
N. GSU Slide 1 Chapter 05 Clustered Systems for Massive Parallelism N. Xiong Georgia State University.
DOE PI Meeting at BNL 1 Lightweight High-performance I/O for Data-intensive Computing Jun Wang Computer Architecture and Storage System Laboratory (CASS)
Headline in Arial Bold 30pt HPC User Forum, April 2008 John Hesterberg HPC OS Directions and Requirements.
1 ACTIVE FAULT TOLERANT SYSTEM for OPEN DISTRIBUTED COMPUTING (Autonomic and Trusted Computing 2006) Giray Kömürcü.
Arun Babu Nagarajan, Frank Mueller Christian Engelmann, Stephen L. Scott Oak Ridge National Laboratory Proactive Fault Tolerance for HPC using Xen Virtualization.
© 2004 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Cruz: Application-Transparent Distributed.
Diskless Checkpointing on Super-scale Architectures Applied to the Fast Fourier Transform Christian Engelmann, Al Geist Oak Ridge National Laboratory Februrary,
A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance Chao Wang, Frank Mueller North Carolina State University Christian Engelmann, Stephen.
Research of P2P Architecture based on Cloud Computing Speaker : 吳靖緯 MA0G0101.
High Availability in DB2 Nishant Sinha
Efficient Live Checkpointing Mechanisms for computation and memory-intensive VMs in a data center Kasidit Chanchio Vasabilab Dept of Computer Science,
System-Directed Resilience for Exascale Platforms LDRD Proposal Ron Oldfield (PI)1423 Ron Brightwell1423 Jim Laros1422 Kevin Pedretti1423 Rolf.
70-412: Configuring Advanced Windows Server 2012 services
Scheduling MPI Workflow Applications on Computing Grids Juemin Zhang, Waleed Meleis, and David Kaeli Electrical and Computer Engineering Department, Northeastern.
HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.
Fault Tolerance and Checkpointing - Sathish Vadhiyar.
Lawrence Livermore National Laboratory 1 Science & Technology Principal Directorate - Computation Directorate Scalable Fault Tolerance for Petascale Systems.
1 Chapter Overview Using Standby Servers Using Failover Clustering.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
Presented by Fault Tolerance Challenges and Solutions Al Geist Network and Cluster Computing Computational Sciences and Mathematics Division Research supported.
IMPROVEMENT OF COMPUTATIONAL ABILITIES IN COMPUTING ENVIRONMENTS WITH VIRTUALIZATION TECHNOLOGIES Abstract We illustrates the ways to improve abilities.
Seminar On Rain Technology
1 High-availability and disaster recovery  Dependability concepts:  fault-tolerance, high-availability  High-availability classification  Types of.
Presented by Robust Storage Management On Desktop, in Machine Room, and Beyond Xiaosong Ma Computer Science and Mathematics Oak Ridge National Laboratory.
Azure Site Recovery For Hyper-V, VMware, and Physical Environments
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CLOUD COMPUTING
Scaling Network Load Balancing Clusters
Services DFS, DHCP, and WINS are cluster-aware.
Introduction to Distributed Platforms
Grid Aware: HA-OSCAR By
Distributed computing deals with hardware
Outline Announcement Distributed scheduling – continued
Distributed Systems (15-440)
Presentation transcript:

Presented by Reliability, Availability, and Serviceability (RAS) for High-Performance Computing Stephen L. Scott Christian Engelmann Computer Science Research Group Computer Science and Mathematics Division

2 Scott_RAS_SC07 Research and development goals  Provide high-level RAS capabilities for current terascale and next-generation petascale high- performance computing (HPC) systems  Eliminate many of the numerous single points of failure and control in today’s HPC systems  Develop techniques to enable HPC systems to run computational jobs 24/7  Develop proof-of-concept prototypes and production-type RAS solutions

3 Scott_RAS_SC07 MOLAR: Adaptive runtime support for high-end computing operating and runtime systems  Addresses the challenges for operating and runtime systems to run large applications efficiently on future ultrascale high-end computers  Part of the Forum to Address Scalable Technology for Runtime and Operating Systems (FAST-OS)  MOLAR is a collaborative research effort (

4 Scott_RAS_SC07  Many active head nodes  Workload distribution  Symmetric replication between head nodes  Continuous service  Always up to date  No fail-over necessary  No restore-over necessary  Virtual synchrony model  Complex algorithms  Prototypes for Torque and Parallel Virtual File System metadata server Active/active head nodes Symmetric active/active redundancy Compute nodes

5 Scott_RAS_SC07 Writing throughputReading throughput Symmetric active/active Parallel Virtual File System metadata server Nodes Availability Est. annual downtime % 5d, 4h, 21m % 1h, 45m % 1m, 30s Number of clients Throughput (Requests/sec) PVFS A/A 1 A/A 2 A/A Number of clients Throughput (Requests/sec) 1 server 2 servers 4 servers

6 Scott_RAS_SC07  Operational nodes: Pause  BLCR reuses existing processes  LAM/MPI reuses existing connections  Restore partial process state from checkpoint  Failed nodes: Migrate  Restart process on new node from checkpoint  Reconnect with paused processes  Scalable MPI membership management for low overhead  Efficient, transparent, and automatic failure recovery Reactive fault tolerance for HPC with LAM/MPI+BLCR job-pause mechanism Paused MPI process Paused MPI process Live node Migrated MPI process Migrated MPI process Spare node Paused MPI process Paused MPI process Live node Failed MPI process Failed node Shared storage Failed Process migration New connection Failed Existing connection

7 Scott_RAS_SC07  3.4% overhead over job restart, but  No LAM reboot overhead  Transparent continuation of execution BTCGEPFTLUMGSP Seconds Job pause and migrate LAM rebootJob restart  No requeue penalty  Less staging overhead LAM/MPI+BLCR job pause performance

8 Scott_RAS_SC07 Proactive fault tolerance for HPC using Xen virtualization  Standby Xen host (spare node without guest VM)  Deteriorating health  Migrate guest VM to spare node  New host generates unsolicited ARP reply  Indicates that guest VM has moved  ARP tells peers to resend to new host  Novel fault-tolerance scheme that acts before a failure impacts a system Xen VMM Ganglia Privileged VM PFT daemon PFT daemon H/wBMC Xen VMM Privileged VM Guest VM MPI task MPI task Ganglia PFT daemon PFT daemon H/wBMC Xen VMM Privileged VM Guest VM MPI task MPI task Ganglia PFT daemon PFT daemon H/wBMC Xen VMM Privileged VM Guest VM MPI task MPI task Ganglia PFT daemon PFT daemon H/wBMC Migrate

9 Scott_RAS_SC BTCGEPLUSP Seconds Single node failure  Single node failure: 0.5–5% additional cost over total wall clock time  Double node failure: 2–8% additional cost over total wall clock time VM migration performance impact without migration one migration Double node failure BTCGEPLUSP without migration one migration two migrations Seconds

10 Scott_RAS_SC07 HPC reliability analysis and modeling  Programming paradigm and system scale impact reliability  Reliability analysis  Estimate mean time to failure (MTTF)  Obtain failure distribution: exponential, Weibull, gamma, etc.  Feedback into fault-tolerance schemes for adaptation Exponencia l Weibull Lognormal Gamma Negative likelihood value Cumulative probability Time between failure (TBF)

11 Scott_RAS_SC07 Contacts regarding RAS research Stephen L. Scott Computer Science Research Group Computer Science and Mathematics Division (865) Christian Engelmann Computer Science Research Group Computer Science and Mathematics Division (865)