Fault Tolerance. Basic Concepts Availability The system is ready to work immediately Reliability The system can run continuously Safety When the system.

Slides:



Advertisements
Similar presentations
Fault Tolerance CSCI 4780/6780. Reliable Group Communication Reliable multicasting is important for several applications Transport layer protocols rarely.
Advertisements

Dr. Kalpakis CMSC621 Advanced Operating Systems Fault Tolerance.
Chapter 8 Fault Tolerance
1 CS 194: Distributed Systems Process resilience, Reliable Group Communication Scott Shenker and Ion Stoica Computer Science Division Department of Electrical.
L-15 Fault Tolerance 1. Fault Tolerance Terminology & Background Byzantine Fault Tolerance Issues in client/server Reliable group communication 2.
Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Fault Tolerance Chapter 7.
Distributed Systems CS Fault Tolerance- Part II Lecture 14, Oct 19, 2011 Majd F. Sakr, Mohammad Hammoud andVinay Kolar 1.
Chapter 7 Fault Tolerance Basic Concepts Failure Models Process Design Issues Flat vs hierarchical group Group Membership Reliable Client.
Distributed Systems CS Fault Tolerance- Part III Lecture 15, Oct 26, 2011 Majd F. Sakr, Mohammad Hammoud andVinay Kolar 1.
Fault Tolerance A partial failure occurs when a component in a distributed system fails. Conjecture: build the system in a such a way that continues to.
Last Class: Weak Consistency
1 Fault Tolerance Chapter 7. 2 Fault Tolerance An important goal in distributed systems design is to construct the system in such a way that it can automatically.
Fault Tolerance Dealing successfully with partial failure within a Distributed System. Key technique: Redundancy.
Computer Science Lecture 16, page 1 CS677: Distributed OS Last Class:Consistency Semantics Consistency models –Data-centric consistency models –Client-centric.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Chapter 9: Fault Tolerance
Real Time Multimedia Lab Fault Tolerance Chapter – 7 (Distributed Systems) Mr. Imran Rao Ms. NiuYu 22 nd November 2005.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Chapter 8 Fault.
1 Distributed Systems Fault Tolerance Chapter 8. 2 Course/Slides Credits Note: all course presentations are based on those developed by Andrew S. Tanenbaum.
Fault Tolerance. Agenda Overview Introduction to Fault Tolerance Process Resilience Reliable Client-Server communication Reliable group communication.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Replication with View Synchronous Group Communication Steve Ko Computer Sciences and Engineering.
Distributed Transactions Chapter 13
Distributed Systems CS Fault Tolerance- Part III Lecture 19, Nov 25, 2013 Mohammad Hammoud 1.
Distributed Systems Principles and Paradigms Chapter 07 Fault Tolerance 01 Introduction 02 Communication 03 Processes 04 Naming 05 Synchronization 06 Consistency.
1 8.3 Reliable Client-Server Communication So far: Concentrated on process resilience (by means of process groups). What about reliable communication channels?
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Fault Tolerance CSCI 4780/6780. Distributed Commit Commit – Making an operation permanent Transactions in databases One phase commit does not work !!!
COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655.
ICS362 – Distributed Systems
More on Fault Tolerance Chapter 7. Topics Group Communication Virtual Synchrony Atomic Commit Checkpointing, Logging, Recovery.
Fault Tolerance Chapter 7.
Kyung Hee University 1/33 Fault Tolerance Chap 7.
Reliable Communication Smita Hiremath CSC Reliable Client-Server Communication Point-to-Point communication Established by TCP Masks omission failure,
V1.7Fault Tolerance1. V1.7Fault Tolerance2 A characteristic of Distributed Systems is that they are tolerant of partial failures within the distributed.
Fault Tolerance Chapter 7. Failures in Distributed Systems Partial failures – characteristic of distributed systems Goals: Construct systems which can.
- Manvitha Potluri. Client-Server Communication It can be performed in two ways 1. Client-server communication using TCP 2. Client-server communication.
Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery.
Reliable Client-Server Communication. Reliable Communication So far: Concentrated on process resilience (by means of process groups). What about reliable.
Fault Tolerance Chapter 7. Topics Basic Concepts Failure Models Redundancy Agreement and Consensus Client Server Communication Group Communication and.
Fault Tolerance Chapter 7. Basic Concepts Dependability Includes Availability Reliability Safety Maintainability.
Distributed Systems CS Fault Tolerance- Part II Lecture 18, Nov 19, 2012 Majd F. Sakr and Mohammad Hammoud 1.
Introduction to Fault Tolerance By Sahithi Podila.
1 CHAPTER 5 Fault Tolerance Chapter 5-- Fault Tolerance.
Fault Tolerance Chapter 7. Goal An important goal in distributed systems design is to construct the system in such a way that it can automatically recover.
Fault Tolerance CSCI 4780/6780. RPC Semantics in Presence of Failures 5 types of exceptions Client cannot locate server Request to server is lost Server.
Fault Tolerance (2). Topics r Reliable Group Communication.
1 Fault Tolerance Chapter 8. 2 Basic Concepts Dependability Includes Availability Reliability Safety Maintainability.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.
Reliable multicast Tolerates process crashes. The additional requirements are: Only correct processes will receive multicasts from all correct processes.
More on Fault Tolerance
Fault Tolerance Prof. Orhan Gemikonakli
Fault Tolerance Chap 7.
CIS 620 Advanced Operating Systems
Chapter 8 Fault Tolerance Part I Introduction.
Reliable group communication
Distributed Systems CS
Outline Announcements Fault Tolerance.
Distributed Systems CS
Distributed Systems CS
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
Fault Tolerance CSC 8320 : AOS Class Presentation Shiraj Pokharel
Advanced Operating System
Distributed Systems CS
Distributed Systems CS
Distributed Systems - Comp 655
Last Class: Fault Tolerance
Presentation transcript:

Fault Tolerance

Basic Concepts Availability The system is ready to work immediately Reliability The system can run continuously Safety When the system fails, nothing catastrophic happens Maintainability A failed system can be easily repaired. Fault types: transient, intermittent, permanent

Failure Models Different types of failures. Type of failureDescription Crash failureA server halts, but is working correctly until it halts Omission failure Receive omission Send omission A server fails to respond to requests A server fails to receive incoming messages A server fails to send messages Timing failureA server's response lies outside the specified time interval Response failure Value failure State transition failure The server's response is incorrect The value of the response is wrong The server deviates from the correct flow of control Arbitrary failureA server may produce arbitrary responses at arbitrary times

Failure Masking by Redundancy Information redundancy (extra bits) Time redundancy (extra operations) Physical redundancy (extra equipment or processes)

Failure Masking by Redundancy Triple modular redundancy (TMR). An electronic circuit example

Process failures To tolerate a faulty process, identical processes organized into a group When one process of the group fails, some other process in the group takes care of the work Process groups may be dynamic Mechanisms are needed for managing groups membership Group server maintains information on membership (centralized) Distributed management (less simple and time consuming)

Flat Groups versus Hierarchical Groups a)Communication in a flat group (voting mechanism, slow decision) Replicated write protocols b)Communication in a simple hierarchical group (single point of failure) Primary based protocols

Client-server communication failures Using a reliable transport protocol (TCP) masks omission failures, but many failures are not masked. Classes of failure The client is unable to locate the server – exception is a solution, but we loose in transparency The request message from the client to the server is lost – retransmission The server crashes after receiving a request The reply message from the server to the client is lost – retransmission, but… The client crashes after sending a request – orphan is generated. (extermination, reincarnation with epoch #, gentle reincarnation, expiration…)

Server Crashes (1) A server in client-server communication a)Normal case b)Crash after execution c)Crash before execution At least once semantic: after server reboot, to try until a request is obtained At most once semantic: immediate failure report Exactly once semantic: no way

Server Crashes (2) Different combinations of client and server strategies in the presence of server crashes. ClientServer Strategy M -> PStrategy P -> M Reissue strategyMPCMC(P)C(MP)PMCPC(M)C(PM) AlwaysDUPOK DUP OK NeverOKZERO OK ZERO Only when ACKedDUPOKZERODUPOKZERO Only when not ACKedOKZEROOK DUPOK Example: a client send a message to a server for printing (P) it, having a completion message back (M). The server can crash (C)

Group Communication Basic Reliable-Multicasting Schemes A simple solution to reliable multicasting when all receivers are known and are assumed not to fail a) Message transmission b) Reporting feedback Efficient only for little # of receivers ( only nack, timer etc..) Important for messaging in process group

Nonhierarchical Feedback Control Several receivers have scheduled a request for retransmission, but the first retransmission request leads to the suppression of others (Scalable Reliable Multicasting protocol). It leads to timing problems, useless retransmissions or a complicated organization of the group membership To scale, we need to reduce the number of messages, with feedback suppression

Hierarchical Feedback Control The essence of hierarchical reliable multicasting. A tree of receivers partitions is formed Each local coordinator forwards the message to its children. A local coordinator handles retransmission requests. Acknowledge between coordinators

Atomic Multicast In presence of process failures, the guarantee that a message is delivered to all or none of the receivers is needed. This lead to the atomic multicast problem Atomic multicasting ensures that group members maintain consistency The logical organization of a distributed system to distinguish between message receipt and message delivery In atomic multicasting a multicast message is uniquely associated to a list of receiving processes ( Group view ) A view change takes place when a process joins or leaves the group

Virtual Synchrony The principle of virtual synchronous multicast (view change similar to synchronization variable) We need an ordered reliable multicasting. Virtual Synchrony guarantees that a message sent to a group view is delivered to each non-faulty member of the group. If the sender crashes, the message may be either delivered to all the other processes or ignored by each of them.

Message Ordering Four different type of ordering of multicasts: Reliable, unordered multicast no guarantees is given on the order in which messages are delivered FIFO ordered multicast messages from the same process are delivered in the order as they are sent Causally ordered multicast causality between messages is preserved Totally-ordered multicast messages are delivered in the same order to all members of the group

Message Ordering Four processes in the same group with two different senders, and a possible delivery order of messages under FIFO-ordered multicasting Process P1Process P2Process P3Process P4 sends m1receives m1receives m3sends m3 sends m2receives m3receives m1sends m4 receives m2 receives m4 Process P1Process P2Process P3 sends m1receives m1receives m2 sends m2receives m2receives m1 Unordered multicast : Three communicating processes in the same group. The ordering of events per process is shown along the vertical axis.

Message Ordering Six different versions of virtually synchronous reliable multicasting. MulticastBasic Message OrderingTotal-ordered Delivery? Reliable multicastNoneNo FIFO multicastFIFO-ordered deliveryNo Causal multicastCausal-ordered deliveryNo Atomic multicastNoneYes FIFO atomic multicastFIFO-ordered deliveryYes Causal atomic multicastCausal-ordered deliveryYes Virtually synchronous reliable multicasting offering totally ordered delivery is called atomic multicasting

Distributed Commit a)The finite state machine for the coordinator in two phase commit. b)The finite state machine for a participant. The first phase is the vote phase, the second is the decision phase Timeout mechanisms are necessary, coordinator can crash Distributed commit means that an operation has to be performed by each member of a group or none at all One phase distributed commit is performed using a coordinator ( if a participant cannot perform the operation, no means to advise the coordinator)

Two Phase Commit The coordinator send a vote_request to all participants A participant returns a vote-commit (it is ready to commit its part of transaction) or a vote-abort The coordinator collects the votes and send a global_commit or a global_abort (if one of the participants has sent a vote_abort) A participant receive a global_commit and locally commits the transaction, or receive a global_abort and locally aborts the transaction 1 – voting phase 2 – decision phase 1 2

Three-Phase Commit It avoids blocking processes in case of coordinator crash There is no state from which it is possible to make a transition directly to COMMIT or ABORT There is no state in which it is not possible to make a final decision and from which a transition to a COMMIT can be made

Recovery Backward recovery brings the system to the previous correct state. It is necessary to record the state (check-pointing) Forward recovery attempt to bring the system in a correct new state to continue the execution.