Dr. Kalpakis CMSC 621, Advanced Operating Systems. Fall 2003 URL: Fault Tolerance in Distributed Systems.

Slides:



Advertisements
Similar presentations
Impossibility of Distributed Consensus with One Faulty Process
Advertisements

Reliable Communication in the Presence of Failures Kenneth Birman, Thomas Joseph Cornell University, 1987 Julia Campbell 19 November 2003.
Failure Detection The ping-ack failure detector in a synchronous system satisfies – A: completeness – B: accuracy – C: neither – D: both.
(c) Oded Shmueli Distributed Recovery, Lecture 7 (BHG, Chap.7)
CS 603 Handling Failure in Commit February 20, 2002.
1 ICS 214B: Transaction Processing and Distributed Data Management Lecture 12: Three-Phase Commits (3PC) Professor Chen Li.
CIS 720 Concurrency Control. Timestamp-based concurrency control Assign a timestamp ts(T) to each transaction T. Each data item x has two timestamps:
Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
ICS 421 Spring 2010 Distributed Transactions Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 3/16/20101Lipyeow.
Distributed Transaction Processing Some of the slides have been borrowed from courses taught at Stanford, Berkeley, Washington, and earlier version of.
Termination and Recovery MESSAGES RECEIVED SITE 1SITE 2SITE 3SITE 4 initial state committablenon Round 1(1)CNNN-NNNN Round 2FAILED(1)-CNNN--NNN Round 3FAILED.
Database Replication techniques: a Three Parameter Classification Authors : Database Replication techniques: a Three Parameter Classification Authors :
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Systems of Distributed Systems Module 2 -Distributed algorithms Teaching unit 3 – Advanced algorithms Ernesto Damiani University of Bozen Lesson 6 – Two.
CS 582 / CMPE 481 Distributed Systems
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Non-blocking Atomic Commitment Aaron Kaminsky Presenting Chapter 6 of Distributed Systems, 2nd edition, 1993, ed. Mullender.
The Atomic Commit Problem. 2 The Problem Reaching a decision in a distributed environment Every participant: has an opinion can veto.
Distributed DBMSPage © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database.
1 Distributed Databases CS347 Lecture 16 June 6, 2001.
Distributed DBMSPage © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database.
Reliability and Partition Types of Failures 1.Node failure 2.Communication line of failure 3.Loss of a message (or transaction) 4.Network partition 5.Any.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Chapter 18: Distributed Coordination (Chapter 18.1 – 18.5)
1 CS 194: Distributed Systems Distributed Commit, Recovery Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering.
CS 603 Three-Phase Commit February 22, Centralized vs. Decentralized Protocols What if we don’t want a coordinator? Decentralized: –Each site broadcasts.
©Silberschatz, Korth and Sudarshan19.1Database System Concepts Distributed Transactions Transaction may access data at several sites. Each site has a local.
1 More on Distributed Coordination. 2 Who’s in charge? Let’s have an Election. Many algorithms require a coordinator. What happens when the coordinator.
1 ICS 214B: Transaction Processing and Distributed Data Management Distributed Database Systems.
CMPT Dr. Alexandra Fedorova Lecture XI: Distributed Transactions.
Commit Protocols. CS5204 – Operating Systems2 Fault Tolerance Causes of failure: process failure machine failure network failure Goals : transparent:
Distributed Commit Dr. Yingwu Zhu. Failures in a distributed system Consistency requires agreement among multiple servers – Is transaction X committed?
CS162 Section Lecture 10 Slides based from Lecture and
Chapter 19 Recovery and Fault Tolerance Copyright © 2008.
Reliable Communication in the Presence of Failures Based on the paper by: Kenneth Birman and Thomas A. Joseph Cesar Talledo COEN 317 Fall 05.
B. Prabhakaran 1 Fault Tolerance Recovery: bringing back the failed node in step with other nodes in the system. Fault Tolerance: Increase the availability.
Distributed Transactions Chapter 13
Operating Systems Distributed Coordination. Topics –Event Ordering –Mutual Exclusion –Atomicity –Concurrency Control Topics –Event Ordering –Mutual Exclusion.
Distributed Transaction Management, Fall 2002Lecture Distributed Commit Protocols Jyrki Nummenmaa
Byzantine fault-tolerance COMP 413 Fall Overview Models –Synchronous vs. asynchronous systems –Byzantine failure model Secure storage with self-certifying.
Fault Tolerance CSCI 4780/6780. Distributed Commit Commit – Making an operation permanent Transactions in databases One phase commit does not work !!!
University of Tampere, CS Department Distributed Commit.
XA Transactions.
Commit Algorithms Hamid Al-Hamadi CS 5204 November 17, 2009.
EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Committed:Effects are installed to the database. Aborted:Does not execute to completion and any partial effects on database are erased. Consistent state:
The Totem Single-Ring Ordering and Membership Protocol Y. Amir, L. E. Moser, P. M Melliar-Smith, D. A. Agarwal, P. Ciarfella.
Building Dependable Distributed Systems, Copyright Wenbing Zhao
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Replication Steve Ko Computer Sciences and Engineering University at Buffalo.
Reliable Communication in the Presence of Failures Kenneth P. Birman and Thomas A. Joseph Presented by Gloria Chang.
Antidio Viguria Ann Krueger A Nonblocking Quorum Consensus Protocol for Replicated Data Divyakant Agrawal and Arthur J. Bernstein Paper Presentation: Dependable.
Fault Tolerance (2). Topics r Reliable Group Communication.
Distributed DBMSPage © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database.
Topics in Distributed Databases Database System Implementation CSE 507 Some slides adapted from Navathe et. Al and Silberchatz et. Al.
EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Distributed Databases – Advanced Concepts Chapter 25 in Textbook.
Outline Introduction Background Distributed DBMS Architecture
Fault Tolerance.
Commit Protocols CS60002: Distributed Systems
Outline Introduction Background Distributed DBMS Architecture
Outline Announcements Fault Tolerance.
Replication and Recovery in Distributed Systems
EEC 688/788 Secure and Dependable Computing
Distributed Transactions
Distributed Databases Recovery
Transactions in Distributed Systems
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
CIS 720 Concurrency Control.
Last Class: Fault Tolerance
Presentation transcript:

Dr. Kalpakis CMSC 621, Advanced Operating Systems. Fall 2003 URL: Fault Tolerance in Distributed Systems

CMSC 621, Fall URL: Introduction A fault-tolerant system masks failures, i.e. continues to operate despite failures exhibits well defined failure behavior, i.e. facilitates recovery but it may not continue operating Failure types process deaths machine crashes network failures: link failures, network partitions, message losses Atomic actions (transactions) process executing actions does not communicate/affected by other processes while executing these actions Global atomicity

CMSC 621, Fall URL: Basic Mechanisms for Recovery two protocols to restore/recover the updated objects in a consistent state are write-ahead-log protocol before updating an object, write an UNDO log to stable storage update the object before committing the update write a REDO log to stable storage shadow page protocol make a “shadow” copy of the object in stable storage update the object before commiting, updated the shadow copy of the object on recovery either use the UNDO/REDO logs or the shadow page to restore an object

CMSC 621, Fall URL: Distributed Commit & the Generals Paradox The generals paradox two generals communicating via messengers need to coordinate to capture a hill; in order to succeed both must attack There is no protocol to solve this problem with a bounded number of messages Proof (by contradiction) Consider a message exchange protocol P with min #messages Consider the last message exchanged M does not reach one general is M really necessary for the generals to capture the hill? If yes, the generals fail to capture the hill If no, then P is not a protocol with min #msgs

CMSC 621, Fall URL: 2-Phase Commit System consists of a coordinator and cohorts Coordinator sends Commit_Request, Commit, and Abort messages Cohorts send Agreed to commit and Abort messages Coordinator sends Commit_Request messages to all cohorts and waits for their replies if all cohorts replied Agreed, then coordinator writes COMMIT log, sends Commit messages to all cohorts, and waits for Acknowledgement if any cohort replied Abort, it sends Abort to all cohorts and Aborts when all cohorts acknowledge Commit msg, it writes COMPLETE log if it does not receive a reply it resends the request message

CMSC 621, Fall URL: 2-Phase Commit Cohorts upon receiving Commit_Request reply with Agreed or Abort message based on their local state, and wait for a reply from coordinator if coordinator replies with Commit, it commits if coordinator replies with Abort, cohort undoes updates, and aborts in either case, it sends reply(ACK) to coordinator 2PC is a blocking protocol if coordinator or cohort fails or a message is lost, transaction may neither Aborts nor Commits

CMSC 621, Fall URL: Non-Blocking Commit Independent recovery if recovering sites can determine the outcome of a transaction based solely on their local state Protocols are modeled by finite state machines with arcs labeled with conditions/actions. synchronous within 1 state transition if the distance between the number of transitions between any two sites is <= 1 Concurrency set C(x) of a state x at a node set of states any other node can be in, while that node is in state x Sender set S(x) of a state x at a node set of nodes that can send the messages which can be received in state x

CMSC 621, Fall URL: Non-Blocking Commit Theorem A Commit protocol is resilient (i.e. does not block) under an arbitrary single failure if the concurrency set of any state does not contain conflicting final states, under the independent recovery assumption To make a protocol resilient we may need to introduce new states so that the above theorem applies Theorem Under independent recovery there is no resilient commit protocol for more than 1 site failures no commit protocol is resilient to network partitioning with message losses no commit protocol is resilient under multiple network partitions

CMSC 621, Fall URL: Non-Blocking Commit: 3-Phase-Commit Adding failure and timeout transitions to a commit protocol Rule 1: for every non-final state x with concurrency set C(x), if C(x) contains a Commit final state y add a failure transition from x to y Rule 2: for every non-final state x, if a node j in the sender set of x has a failure transition to a commit/abort final state, add a timeout transition from x to commit/abort final state there two rules are sufficient for resilient commit protocols for single site failures Starting with the two phase commit by adding “buffer” states between the waiting and commit states we obtain the three phase commit protocol; failure/timeout transitions are added according to the rules 1 and 2 above

CMSC 621, Fall URL: 3-phase commit protocol Coordinator a1a1 c1c1 q1q1 w1w1 p1p1 Commit_Request msg sent to all cohorts One or more cohort(s) replied abort Abort msg sent to all cohorts All cohorts agreed Send Prepare msg to all cohorts All cohorts sent Ack msg Send Commit msg to all cohorts F,T T Abort msg sent to all cohorts F F,T T- Timeout Transition F- Failure Transition F,T – Failure/Timeout Transition

CMSC 621, Fall URL: T- Timeout Transition F- Failure Transition F,T – Failure/Timeout Transition 3-phase commit protocol Cohort i (i=2,3,….n) cici aiai qiqi pipi wiwi Commit msg received from Coordinator Prepare msg received Send Ack msg to Coordinator Abort msg received from Coordinator Commit_Request msg received Agreed msg sent to Coordinator Commit_Request msg received Abort msg sent to Coordinator F,T Abort msg received from Coordinator

CMSC 621, Fall URL: Gifford’s Static Voting Protocol Voting protocols provide a method for managing replicated objects in way that is resilient to a richer set of failures Gifford’s Static Voting protocol there is a read and a write quorum every site has a lock manager that is used to lock an object X in order for a site to cast its votes, it needs to obtain a lock from its lock manager; every site for each replica X it has, it maintains a version# for X and its own number of votes

CMSC 621, Fall URL: Gifford’s Static Voting Protocol A site i wishing to read/write X requests a lock from its local lock manager sends vote requests to all other sites collects the (Votes, Version#) from all the replying sites determines if it has a read/write quorum ensures its local copy is current performs all its local operations on X and then updates, if it got a write quorum, all (current) replicas that replied releases its local lock, and sends release_lock msgs to all the nodes that replied

CMSC 621, Fall URL: Gifford’s Static Voting Protocol a site j upon receiving a vote_request obtains a local lock, and sends its #votes# and version# to requestor upon receiving a release_lock, it releases its local lock determining read/write quorums read quorum: #votes received is >= read quorum R write quorum: #votes received from nodes with most current replica >= write quorum W R+W > total #votes and W > total #votes / 2 fault tolerance and performance obtained with Gifford’s static voting protocol depends on the assignment of votes to nodes, the availability of these nodes, and the size R and W of the read/write quorums

CMSC 621, Fall URL: Dynamic Voting Protocols Network partitions and other failures can hurt fault-tolerance of static voting Dynamic voting can boost fault-tolerance by adapting the number of votes assigned to various nodes the set of nodes that can form read/write quorums Partition graph directed graph whose vertices denote the various partitions of the nodes of the system edges denote how partitions are formed (via merging and splitting)

CMSC 621, Fall URL: Majority Based Dynamic Voting Protocol Idea for the Jajodia-Mutchler’s protocol modify the read/write quorums so that sites in a distinguished network partition can still operate A network partition is distinguished if it contains a majority of the nodes that participated in the last update operation Every site i maintains (with respect to its view of the world) version number VNi = #updates performed on its local copy of the object #replicas updated RUi in most recent update according to site i distinguished sites list DSi = a set of node ids if RUi = 3 then DSi=set of all participating nodes in most recent update if RUi is even then DSi=ID of highest priority node among those participating in most recent update else DSi=null

CMSC 621, Fall URL: Majority Based Dynamic Voting Protocol Nodes that want to read/write a replica attempt to obtain a read/write quorum as in Gifford, except that if a node determines that is not in the distinguished partition, then it aborts and releases the locks/voters otherwise, if the node updates the replica object, then it updates its VNi, RUi, and DSi, and then sends a Commit message together with these updated values to all nodes participated in the update

CMSC 621, Fall URL: Majority Based Dynamic Voting Protocol Consider a site I requesting votes let P the set of responding nodes the vote request let M be the highest version number among the nodes responded let Q be the set of nodes that responded and have the most current replica let N be the RU value of any node in Q Site I is in the distinguished partition if Size of(Q) > N/2 or DSj is in Q for some (any) j in Q or N=3 and P contains at least two of the nodes in DSj, j in Q Site I updates VNi, RUi, DSi, as follows sets VNi to M +1 sets RUi to size of(P), and sets DSi to P if RUi=3, to the ID of highest priority node in Q if RUi is even, and to null otherwise

CMSC 621, Fall URL: Autonomous Vote Re-Assignment Vote re-assignment is another method for increasing the fault- tolerance of voting protocols, by changing the #votes of a select group of nodes in anticipation of future failures It can be node in a consensus or an autonomous manner We consider an autonomous vote re-assignment protocol each node i maintains three vectors Vi[j] = #votes of node j according to node i Ni[j] =version# of #votes for node j according to node i vi[j]=#votes node j reports it has in response to a request by node i protocol has three sub-protocols vote increasing vote decreasing vote collecting

CMSC 621, Fall URL: Vote Increasing Sub-Protocol When node I wishes to increase its #votes selects its new #votes X sends its Vi and Ni vectors, together with its new #votes X, to all other nodes it can reach waits until it gets a majority using the vote collecting protocol if it gets a majority, then it sets Vi[i]=X Ni[i]=Ni[i]+1 Each node j upon receiving Vi, Ni, and new #votes X for I sets Vj[i] =X sets Nj[i] = Ni[i]+1

CMSC 621, Fall URL: Vote Decreasing Sub-Protocol When a node I wishes to decrease its #votes selects its new #votes X sets Vi[i]=X sets Ni[i]=Ni[i]+1 and then sends the vectors Vi and Ni to all other nodes A node j receiving Vi and Ni from node i sets Vj[i] = Vi[i] sets Nj[i] = Ni[i] note that vote reduction does not endanger mutual exclusion, thus vote collection to determine majority is not needed

CMSC 621, Fall URL: Vote Collecting Protocol Vote collection is done anytime agreement is needed, i.e. when a node i wants to read, write, or re-assign its #votes Node i upon receiving a reply (Vj, Nj) to a vote request sets vi[j] = Vj[j], and if Vj[j]>Vi[j] (node i missed a vote increase for node j) or Vj[j] Ni[j] (node i missed a vote decrease by j), then sets Vi[j] = Vj[j] and Ni[j]=Nj[j] If node i does not receive a reply from a node j then it updates its vi, Vi, and Ni for node j using the information from the vectors Vk and Nk for node j where node k has the most current information for node j Node i collects a majority if #votes received is > half the sum of the entries of vi over all nodes

CMSC 621, Fall URL: Vote Increasing Policies the overthrow technique upon detecting failure of a node i, a node j is selected to increase its voting power node j increases its voting power by at least twice the voting power of the failed node I Otherwise mutex is in jeopardy the alliance technique nodes in the majority collectively increase their #votes by some collective mechanism (all get some extra votes) the increase in the #votes is always at least twice the #votes of the failed node(s)

CMSC 621, Fall URL: Failure Resilient Processes A process is resilient if it masks failures and guarantees progress despite certain number of failures approaches for building resilient processes Backup processes and replicated execution Backup processes primary process and one or more backup processes state of primary is checkpointed periodically and made accessible to the backup processes backup process needs to detect failure of primary and before it starts execution it may need to perform some re-computation since the last checkpoint

CMSC 621, Fall URL: Replicated Execution Several processes execute the same program concurrently Replicated execution increases reliability by taking a majority consensus among the the results of all the processes availability, since service is available as long as one processes continue executing requires extra resources has problems with non-idempotent operations such as use of random number generators cooperative computations via message exchange

CMSC 621, Fall URL: Reliable Communication Consider a system of cooperating processes that communicate by exchanging messages, such as maintaining replicated data, where all processes must have the same view of the system this requirement can be met if atomic multicast is supported the messages are received in the same order at each process each message is either received at every process or at none of them Note that in an atomic multicast We have a total ordering of all messages for all processes but such an ordering may not be causal

CMSC 621, Fall URL: Birman-Joseph Atomic Broadcast Protocol Each process has an associated message priority queue messages are buffered in the queue before delivered to the process messages have a priority# and can be in two states deliverable and undeliverable a message is delivered to the process when it is deliverable and has the smallest priority# among all buffered messages messages are retained even after they are delivered the protocol has two phases and a fault handler

CMSC 621, Fall URL: Birman-Joseph Atomic Broadcast Protocol Phase I sender sends a message together with the ids of the receivers to all the processes in the multicast group a receiver, upon receiving the message, assigns it a priority# (highest among all its buffered messages) marks it undeliverable and buffers it in the queue informs the sender of the assigned message priority#

CMSC 621, Fall URL: Birman-Joseph Atomic Broadcast Protocol Phase II sender, upon receiving the responses from everybody in the multicast group that is still operational sets the final priority# of the message as the maximum among all priority# assigned by the receivers multicasts the final priority# to the multicast group receiver, upon receiving final priority# of message assigns the priority# to the corresponding message marks the message as deliverable

CMSC 621, Fall URL: Birman-Joseph’s Atomic Broadcast – Fault Handler If receiver detects that the sender for an undeliverable message has failed then it attempts to complete the protocol as coordinator by performing the following steps Step 1 asks the nodes in message’s multicast group about the statues of the message a receiver can respond either with the status and priority# of the message or that it has not received the message Step 2 if the message was marked deliverable at any node in the multicast group, Phase II is executed again with the “coordinator” multicasting the message’s final priority# otherwise, the “coordinator” reinitiates the protocol from Phase I