Fault Tolerance A partial failure occurs when a component in a distributed system fails. Conjecture: build the system in a such a way that continues to.

Slides:



Advertisements
Similar presentations
Dr. Kalpakis CMSC621 Advanced Operating Systems Fault Tolerance.
Advertisements

Chapter 8 Fault Tolerance
Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Systems of Distributed Systems Module 2 -Distributed algorithms Teaching unit 3 – Advanced algorithms Ernesto Damiani University of Bozen Lesson 6 – Two.
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Fault Tolerance Chapter 7.
Distributed Systems CS Fault Tolerance- Part II Lecture 14, Oct 19, 2011 Majd F. Sakr, Mohammad Hammoud andVinay Kolar 1.
Chapter 7 Fault Tolerance Basic Concepts Failure Models Process Design Issues Flat vs hierarchical group Group Membership Reliable Client.
Distributed Systems CS Fault Tolerance- Part III Lecture 15, Oct 26, 2011 Majd F. Sakr, Mohammad Hammoud andVinay Kolar 1.
Distributed Systems CS Fault Tolerance- Part I Lecture 13, Oct 17, 2011 Majd F. Sakr, Mohammad Hammoud andVinay Kolar 1.
Last Class: Weak Consistency
1 CS 194: Distributed Systems Distributed Commit, Recovery Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering.
1 Fault Tolerance Chapter 7. 2 Fault Tolerance An important goal in distributed systems design is to construct the system in such a way that it can automatically.
1 More on Distributed Coordination. 2 Who’s in charge? Let’s have an Election. Many algorithms require a coordinator. What happens when the coordinator.
Fault Tolerance Dealing successfully with partial failure within a Distributed System. Key technique: Redundancy.
Computer Science Lecture 16, page 1 CS677: Distributed OS Last Class:Consistency Semantics Consistency models –Data-centric consistency models –Client-centric.
Fault Tolerance Chapter 8 Part I Introduction
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Distributed Commit Dr. Yingwu Zhu. Failures in a distributed system Consistency requires agreement among multiple servers – Is transaction X committed?
Real Time Multimedia Lab Fault Tolerance Chapter – 7 (Distributed Systems) Mr. Imran Rao Ms. NiuYu 22 nd November 2005.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Chapter 8 Fault.
1 Distributed Systems Fault Tolerance Chapter 8. 2 Course/Slides Credits Note: all course presentations are based on those developed by Andrew S. Tanenbaum.
Fault Tolerance. Agenda Overview Introduction to Fault Tolerance Process Resilience Reliable Client-Server communication Reliable group communication.
Chapter 19 Recovery and Fault Tolerance Copyright © 2008.
Distributed Transactions Chapter 13
Distributed Systems CS Fault Tolerance- Part III Lecture 19, Nov 25, 2013 Mohammad Hammoud 1.
Distributed Systems Principles and Paradigms Chapter 07 Fault Tolerance 01 Introduction 02 Communication 03 Processes 04 Naming 05 Synchronization 06 Consistency.
1 8.3 Reliable Client-Server Communication So far: Concentrated on process resilience (by means of process groups). What about reliable communication channels?
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Fault Tolerance CSCI 4780/6780. Distributed Commit Commit – Making an operation permanent Transactions in databases One phase commit does not work !!!
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Chapter 8 Fault.
More on Fault Tolerance Chapter 7. Topics Group Communication Virtual Synchrony Atomic Commit Checkpointing, Logging, Recovery.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Chapter 8 Fault.
Fault Tolerance Chapter 7.
Fault Tolerance. Basic Concepts Availability The system is ready to work immediately Reliability The system can run continuously Safety When the system.
Kyung Hee University 1/33 Fault Tolerance Chap 7.
Reliable Communication Smita Hiremath CSC Reliable Client-Server Communication Point-to-Point communication Established by TCP Masks omission failure,
V1.7Fault Tolerance1. V1.7Fault Tolerance2 A characteristic of Distributed Systems is that they are tolerant of partial failures within the distributed.
Fault Tolerance Chapter 7. Failures in Distributed Systems Partial failures – characteristic of distributed systems Goals: Construct systems which can.
- Manvitha Potluri. Client-Server Communication It can be performed in two ways 1. Client-server communication using TCP 2. Client-server communication.
Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery.
Reliable Client-Server Communication. Reliable Communication So far: Concentrated on process resilience (by means of process groups). What about reliable.
Revisiting failure detectors Some of you asked questions about implementing consensus using S - how does it differ from reaching consensus using P. Here.
Fault Tolerance Chapter 7. Topics Basic Concepts Failure Models Redundancy Agreement and Consensus Client Server Communication Group Communication and.
Fault Tolerance Chapter 7. Basic Concepts Dependability Includes Availability Reliability Safety Maintainability.
Distributed Systems CS Fault Tolerance- Part II Lecture 18, Nov 19, 2012 Majd F. Sakr and Mohammad Hammoud 1.
Introduction to Fault Tolerance By Sahithi Podila.
1 CHAPTER 5 Fault Tolerance Chapter 5-- Fault Tolerance.
Fault Tolerance Chapter 7. Goal An important goal in distributed systems design is to construct the system in such a way that it can automatically recover.
PROCESS RESILIENCE By Ravalika Pola. outline: Process Resilience  Design Issues  Failure Masking and Replication  Agreement in Faulty Systems  Failure.
Fault Tolerance CSCI 4780/6780. RPC Semantics in Presence of Failures 5 types of exceptions Client cannot locate server Request to server is lost Server.
1 Fault Tolerance Chapter 8. 2 Basic Concepts Dependability Includes Availability Reliability Safety Maintainability.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.
More on Fault Tolerance
Fault Tolerance Prof. Orhan Gemikonakli
Fault Tolerance Chap 7.
CIS 620 Advanced Operating Systems
8.2. Process resilience Shreyas Karandikar.
Chapter 8 Fault Tolerance Part I Introduction.
Dependability Dependability is the ability to avoid service failures that are more frequent or severe than desired. It is an important goal of distributed.
Distributed Systems CS
Outline Announcements Fault Tolerance.
Distributed Systems CS
Distributed Systems CS
Fault Tolerance CSC 8320 : AOS Class Presentation Shiraj Pokharel
Reliable Client-Server Communication
Distributed Systems CS
Fault Tolerance and Reliability in DS.
Last Class: Fault Tolerance
Presentation transcript:

Fault Tolerance A partial failure occurs when a component in a distributed system fails. Conjecture: build the system in a such a way that continues to operate despite a fault. Objective: Provide what is know as dependable distributed systems.

Features of Dependable Distributed Systems Dependability entails: Availability Ready to function well at all times. Reliability System continues to run without failure. Safety If the system fails to operate correctly at some point nothing catastrophic happens. Maintainability In light of a failure, the latter is easily fixable.

Factors/Nature of Faulty Behavior Definition: a system FAILS when it cannot meet its requirements. Error is part of a system that may lead to failure. Fault is the cause of an error A system is fault tolerant if in the presence of faults provides its services. Transient faults are the ones that appear once and then they disappear (due to provisions made in the system). Intermittent systems occur, then vanish, then appear again and so on. Permanent fault continues to exist until the faulty component is fixed.

Failure Models [Christian91] Type of failure Description Crash failure A server halts, but is working correctly until it halts Omission failure Receive omission Send omission A server fails to respond to incoming requests A server fails to receive incoming messages A server fails to send messages Timing failure A server's response lies outside the specified time interval Response failure Value failure State transition failure The server's response is incorrect The value of the response is wrong The server deviates from the correct flow of control Arbitrary failure A server may produce arbitrary responses at arbitrary times Different types of failures. Arbitrary failures are known as Byzantine failures.

Failure Masking by Redundancy Key mechanism to mask-out failure is redundancy(ie, add extra bits) Three types of: (or three dimensions) Information redundancy (hamming code) Time redundancy (an action is performed and then it is performed again if need – example: transaction model) Physical redundancy (extra equipment or processes) Triple modular redundancy (replication of devices/equipment).

Process Resilience Issue: what happens when processes fail and how to overcome this? Main vehicle of solution: organize replicated processes in groups and if one fails someone else takes over. Issues: Design of groups Reach agreement within groups when one or more parties cannot be trusted.

Group Process Organization Communication in a flat group. Communication in a simple hierarchical group A method is needed to create/delete groups as well as allow processes to enter and depart from groups. Group Server

Group Server Maintains a complete database of all groups and their relationships. This approach suffers for “single point of failure” Otherwise, some distributed technique has to be used If (reliable) multicast is available, an outsider (process) can send request to all groups about joining one. The same with a departing processes in a group/network. Trouble: when a site has crashed.. (or is very slow). Leaving/Joining groups has to be synchronous with data transmissions.

Agreement among Processes Main problem: have all non-faulty processes reach a consensus on some issue and establish this consensus within finite number of steps. System parameters are important in providing solutions Reliable or nor reliable communication channels Crash/failure semantics.

Distributed Problem of the Two-Armies. Red Army in the Valley (5000 people) Two Blue Armies on the hills (each of 4000 each) If the two blue armies can coordinate a combined assault they get out victorious (otherwise not!) Use messengers who go through the valley (ie,unreliable channel) to pass messages back and forth between the two battalions. As there is always doubt in the mind of the last general who received a messenger, there is continuously a messenger going from one blue army to the other.. Protocol may have no end..

Byzantine Generals Problem The red army is still in the valley The n blue armies are on the hills. Communication between the blue armies is done pair-wise, is instantaneous, and perfect. m of the blue generals are traitors. The traitors prevent the honest generals from reaching an agreement. Each general is assumed to know how many troops he got. Approach: have the blue generals exchange information about their own troop strength and at the end of an (distributed) algorithm each general has a vector with of length n corresponding to all the armies. If general I is loyal then element I is his troop strength

Sketch of the Byzantine Generals Algorithm Assumption: General i has i kilosoldiers. The Byzantine generals problem for 3 loyal generals and 1 traitor (process 3). The generals announce their troop strengths (in units of 1 kilosoldiers). The vectors that each general assembles based on (a) The vectors that each general receives in step 3. Reach result by taking consensus of the received messages.

The Algorithm does not seem to work! The same as in previous slide, except now with 2 loyal generals and one traitor Lamport showed that if there are m traitors then there must be 2m+1 loyalists in order for the algorithm to work properly!

Reliable Communication among Systems Point to Point TCP mainly delivers the reliability (for lost messages) RPC semantics in presence of failure: The client is unable to locate the server The request message from the client to the server is lost The server crashes after receiving the request The reply message from the server to the client is lost The client crashes after sending a request.

RPC Semantics in the presence of Failure Client is unable to Locate Server Possible solution: raise an exception Two drawbacks: Not always easy to write exception handler (for instance there is a big problem if the language used does not support exception handling/signaling of some sort). Use of exception handler may violate the overall requirement of transparency in the distributed system. Lost Request Message Use of timers (to figure out whether a message has been lost).

RPC Semantics in the Presence of Failures Server crashes A server in client-server communication Normal case Crash after execution Crash before execution The main problem is the correct treatment of cases (b) and ( c): the client’s operating system cannot differentiate between these two! Three approaches exist: Wait until server boots and try the operation again [At least once semantics] RPC gives up immediately and reports back failure [At most once semantics] Guarantees that RPC has been carried out one time and possibly none! Guarantee nothing! [RPC may have been executed between one and many times!]

RPC Semantics in the Presence of Failures Client Server Strategy M -> P Strategy P -> M Reissue strategy MPC MC(P) C(MP) PMC PC(M) C(PM) Always DUP OK Never ZERO Only when ACKed Only when not ACKed Different combinations of client and server strategies in the presence of server crashes.

RPC Semantics in the Presence of Failures Lost Reply Messages Use time-outs (but not certain whether the time outs are due to slow server). Some operations can help (those that are idempotent) Transactional requests not possible to be deal with! (choose another model). Client Crashes Creates oprhan processes-orphans waist CPU cycles (for nothing). What one can do about orphans? Extermination: Before an RPC is sent out create a disk-log entry Reincarnation: Divide the time to epochs and when a client reboots broadcasts a new epoch-obsolete remote computations are killed (on behalf of the client) Gentle Reincarnation: when an epoch is broadcast, each machine checks to see if it has a remote computation; if so, tries to locate their owner. If the latter is not successful, the computation is killed. Expiration: for each RPC give an amount of time T to complete. If not complete ask explicitly fro another T secs and so on.

Two-Phase Commit The finite state machine for the coordinator in 2PC. The finite state machine for a participant.

Two-Phase Commit State of Q Action by P COMMIT Make transition to COMMIT ABORT Make transition to ABORT INIT READY Contact another participant Actions taken by a participant P when residing in state READY and having contacted another participant Q.

Two-Phase Commit actions by coordinator: while START _2PC to local log; multicast VOTE_REQUEST to all participants; while not all votes have been collected { wait for any incoming vote; if timeout { while GLOBAL_ABORT to local log; multicast GLOBAL_ABORT to all participants; exit; } record vote; } if all participants sent VOTE_COMMIT and coordinator votes COMMIT{ write GLOBAL_COMMIT to local log; multicast GLOBAL_COMMIT to all participants; } else { write GLOBAL_ABORT to local log; multicast GLOBAL_ABORT to all participants; } Outline of the steps taken by the coordinator in a two phase commit protocol

Steps taken by participant process in 2PC. Two-Phase Commit actions by participant: write INIT to local log; wait for VOTE_REQUEST from coordinator; if timeout { write VOTE_ABORT to local log; exit; } if participant votes COMMIT { write VOTE_COMMIT to local log; send VOTE_COMMIT to coordinator; wait for DECISION from coordinator; if timeout { multicast DECISION_REQUEST to other participants; wait until DECISION is received; /* remain blocked */ write DECISION to local log; } if DECISION == GLOBAL_COMMIT write GLOBAL_COMMIT to local log; else if DECISION == GLOBAL_ABORT write GLOBAL_ABORT to local log; } else { write VOTE_ABORT to local log; send VOTE ABORT to coordinator; } Steps taken by participant process in 2PC.

Steps taken for handling incoming decision requests. Two-Phase Commit actions for handling decision requests: /* executed by separate thread */ while true { wait until any incoming DECISION_REQUEST is received; /* remain blocked */ read most recently recorded STATE from the local log; if STATE == GLOBAL_COMMIT send GLOBAL_COMMIT to requesting participant; else if STATE == INIT or STATE == GLOBAL_ABORT send GLOBAL_ABORT to requesting participant; else skip; /* participant remains blocked */ Steps taken for handling incoming decision requests.

Three-Phase Commit Finite state machine for the coordinator in 3PC Finite state machine for a participant

Recovery Stable Storage Crash after drive 1 is updated Bad spot

Checkpointing A recovery line.

Independent Checkpointing The domino effect.

Message Logging Incorrect replay of messages after recovery, leading to an orphan process.