Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery.

Slides:



Advertisements
Similar presentations
Fault Tolerance CSCI 4780/6780. Reliable Group Communication Reliable multicasting is important for several applications Transport layer protocols rarely.
Advertisements

Dr. Kalpakis CMSC621 Advanced Operating Systems Fault Tolerance.
Chapter 8 Fault Tolerance
1 CS 194: Distributed Systems Process resilience, Reliable Group Communication Scott Shenker and Ion Stoica Computer Science Division Department of Electrical.
Fault-Tolerant Systems Design Part 1.
Reliable Group Communication Quanzeng You & Haoliang Wang.
L-15 Fault Tolerance 1. Fault Tolerance Terminology & Background Byzantine Fault Tolerance Issues in client/server Reliable group communication 2.
3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani.
Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Chapter 7 Fault Tolerance Basic Concepts Failure Models Process Design Issues Flat vs hierarchical group Group Membership Reliable Client.
Distributed Systems CS Fault Tolerance- Part I Lecture 13, Oct 17, 2011 Majd F. Sakr, Mohammad Hammoud andVinay Kolar 1.
2/23/2009CS50901 Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial Fred B. Schneider Presenter: Aly Farahat.
Fault Tolerance A partial failure occurs when a component in a distributed system fails. Conjecture: build the system in a such a way that continues to.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering.
Last Class: Weak Consistency
1 Fault Tolerance Chapter 7. 2 Fault Tolerance An important goal in distributed systems design is to construct the system in such a way that it can automatically.
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
Fault Tolerance Dealing successfully with partial failure within a Distributed System. Key technique: Redundancy.
Computer Science Lecture 16, page 1 CS677: Distributed OS Last Class:Consistency Semantics Consistency models –Data-centric consistency models –Client-centric.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Chapter 8 Fault.
CSC 536 Lecture 6. Outline Fault tolerance Redundancy and replication Process groups Reliable client-server communication Fault tolerance in Akka “Let.
Chapter 19 Recovery and Fault Tolerance Copyright © 2008.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Replication with View Synchronous Group Communication Steve Ko Computer Sciences and Engineering.
Distributed Systems Principles and Paradigms Chapter 07 Fault Tolerance 01 Introduction 02 Communication 03 Processes 04 Naming 05 Synchronization 06 Consistency.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Fault-Tolerant Systems Design Part 1.
Building Dependable Distributed Systems Chapter 1 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655.
CprE 458/558: Real-Time Systems
Fault-Tolerant Systems Design Part 1.
Fault Tolerance Chapter 7.
Fault Tolerance. Basic Concepts Availability The system is ready to work immediately Reliability The system can run continuously Safety When the system.
Kyung Hee University 1/33 Fault Tolerance Chap 7.
Reliable Communication Smita Hiremath CSC Reliable Client-Server Communication Point-to-Point communication Established by TCP Masks omission failure,
V1.7Fault Tolerance1. V1.7Fault Tolerance2 A characteristic of Distributed Systems is that they are tolerant of partial failures within the distributed.
Presentation-2 Group-A1 Professor:Mohamed Khalil Anita Kanuganti Hemanth Rao.
Fault Tolerance Chapter 7. Failures in Distributed Systems Partial failures – characteristic of distributed systems Goals: Construct systems which can.
Fault Tolerance Chapter 7. Topics Basic Concepts Failure Models Redundancy Agreement and Consensus Client Server Communication Group Communication and.
Fault Tolerance Chapter 7. Basic Concepts Dependability Includes Availability Reliability Safety Maintainability.
Introduction to Fault Tolerance By Sahithi Podila.
A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.
Fault Tolerance
1 CHAPTER 5 Fault Tolerance Chapter 5-- Fault Tolerance.
Fault Tolerance Chapter 7. Goal An important goal in distributed systems design is to construct the system in such a way that it can automatically recover.
PROCESS RESILIENCE By Ravalika Pola. outline: Process Resilience  Design Issues  Failure Masking and Replication  Agreement in Faulty Systems  Failure.
Fault Tolerance (2). Topics r Reliable Group Communication.
Fault Tolerance in Distributed Systems. A system’s ability to tolerate failure-1 Reliability: the likelihood that a system will remain operational for.
Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.
Reliable multicast Tolerates process crashes. The additional requirements are: Only correct processes will receive multicasts from all correct processes.
More on Fault Tolerance
Fault Tolerance Prof. Orhan Gemikonakli
Fault Tolerance Chap 7.
8.2. Process resilience Shreyas Karandikar.
Fault Tolerance In Operating System
Chapter 8 Fault Tolerance Part I Introduction.
Reliable group communication
Dependability Dependability is the ability to avoid service failures that are more frequent or severe than desired. It is an important goal of distributed.
Distributed Systems CS
Fault Tolerance CSC 8320 : AOS Class Presentation Shiraj Pokharel
Advanced Operating System
Introduction to Fault Tolerance
Distributed Systems CS
Distributed Systems CS
Fault Tolerance and Reliability in DS.
Distributed Systems - Comp 655
Last Class: Fault Tolerance
Seminar on Enterprise Software
Presentation transcript:

Chapter 11 Fault Tolerance

Topics Introduction Process Resilience Reliable Group Communication Recovery

Basic Concepts Failure = state of a system where system fails to meet its contract Error= part of the system state that leads to failure (e.g. differing from its intended value) Faults = cause of an error, e.g. results from Design errors Manufacturing faults Deterioration External disturbance

Fault  Error  Failure Remark: Presence of a fault does not ensure that an error will occur, e.g. memory stuck-at-0

Characteristics of Faults Duration Permanent fault Once a component fails, it never works correctly again Easiest to diagnose Transient fault 1 time only 10 times as likely as permanent faults Intermittent fault Re-occurring May appear to be transient (if long period) Hard and expensive to detect

Fault Tolerance Includes… Fault tolerance: system can avoid a failure despite occurrence of faults Availability: probability that system works correctly at a given instance of time Reliability: expected time between failures Safety: absence of catastrophic consequences of a fault Maintainability: ease of recovering from a failure (incl. automatic recognizing of faults)

Failure Models Type of failureDescription Crash failureA server halts, but is working correctly until it halts Omission failure Receive omission Send omission A server fails to respond to incoming requests A server fails to receive incoming messages A server fails to send messages Timing failureA server's response lies outside the specified time interval Response failure Value failure State transition failure The server's response is incorrect The value of the response is wrong The server deviates from the correct flow of control Arbitrary failure (Byzantine failure) A server may produce arbitrary responses at arbitrary times

How to Overcome Failures? Design servers being able to announce that they might fail in the near future? Design a DS that is able to detect that A server is down and/or a server does no longer work correctly? Design a DS that is able to mask faults via redundancy?

Failure Masking Hide occurrence of faults using redundancy Information (e.g., additional bits, i.e. error correcting codes, e.g. Hamming-code) Time (e.g., retry an operation, an aborted transaction may be repeated without any side effects) Physical Hardware (replicated equipment) Software (replicated server processes/threads)

Hardware Redundancy Passive (static) Uses fault masking to hide occurrence of faults No action from system is required e.g. voting (see next slide) Active (dynamic) Uses comparison for detection and/or diagnosis Remove faulty hardware from system  reconfiguration Hybrid Combine both approaches Masking until diagnostic complete Expensive, but better to achieve higher reliability

Failure Masking by Redundancy Triple modular redundancy.

Stand-by-Sparing Only one module is driving outputs Other modules are Idle   hot spares Shut down  cold spares In case of error detection   switch to new module Hot spares No power up delays But may be significant power consumption Cold spares Vice versa to hot spares

Failure Masking by Software Redundancy How to improve reliability? What can we do to mask thread/process faults?

Process Resilience Protection against process failures Group of identical processes provides redundancy Software: multiple processes on same machine Hardware: processes on different machines Multicast communication ensures all members receive all messages (often atomic and ordered) Processes can join and leave groups dynamically e.g., to replace failed processes Membership protocol ensures agreement on group membership at any given time

Flat Groups versus Hierarchical Groups a) Communication in a flat group. b) Communication in a simple hierarchical group

Group Management 1. Use a single group-server with a single data base  typical single point of failure 2. Use a single data base but several group-servers (standby solution) 3. Manage groups in a distributed way, i.e. every outsider wanting to enter a group sends a corresponding enter_group message per reliable multicast to every current group member, but When does a new group member gets all the group internal messages? When leaving the group, what about already sent but not yet received messages?

Agreement in Faulty Systems (1) Ensure all non-faulty processes reach consensus in a finite number of steps 1. Reliable processes, faulty communication (omission faults). Two-army problem 2. Reliable communication, faulty processes (Byzantine faults).

Agreement in Faulty Systems (2) The Byzantine generals problem for 3 loyal generals and1 traitor. a) The generals announce their troop strengths (in units of 1 kilosoldiers). b) The vectors that each general assembles based on (a) c) The vectors that each general receives in step 3.

Agreement in Faulty Systems (3) The same as in previous slide, except now with 2 loyal generals and one traitor. With m faulty processes, at least 2m+1 correctly functioning processes are required to reach an agreement.

Reliable Group Communication

Basic Reliable-Multicasting Schemes A simple solution to reliable multicasting when all receivers are known and are assumed not to fail a) Message transmission b) Reporting feedback

Scalability Feedback implosion: sender is swamped with feedback messages Nonhierarchical multicast: Use NACKS Feedback suppression: NACK’s multicast to everyone. Prevents other receivers from sending NACK’s if they have already seen one Reduces (N)ACK load on server Receivers have to be coordinated so they do not all multicast NACKs at same time Multicasting feedback also interrupts processes that successfully have received messages

Nonhierarchical Feedback Control Several receivers have scheduled a request for retransmission, but the first retransmission request leads to the suppression of others.

Hierarchical Feedback Control The essence of hierarchical reliable multicasting. a) Each local coordinator forwards the message to its children. b) A local coordinator handles retransmission requests.

Atomic Multicast: Virtual Synchrony Deliver a message either to all group members (in the same order), or to none. Requires agreement about group membership Replica crash? Process group: Group view: list of processes the sender has when a message is sent. Each message uniquely associated with a group View changes need to be ordered with respect to message transmissions: Either the message is delivered to the old or the new view Special case: sender failure

Virtual Synchrony (2) The principle of virtual synchronous multicast. If the sender crashes during the multicast, the message may either be delivered to all or ignored by each of them.

Implementing Virtual Synchrony A message m sent in view G i is stable if it was received by all members of G i Only stable messages are delivered View changes are announced By the arriving/departing node or failure detecting node via a view change message, followed by any unstable messages in the old view, followed by a flush message View is changed after the flush message has arrived from all members of the old view

Implementing Virtual Synchrony (2) a) Process 4 notices that process 7 has crashed, sends a view change b) Process 6 sends out all its unstable messages, followed by a flush message c) Process 6 installs the new view when it has received a flush message from everyone else