Distributed Systems CS 15-440 Fault Tolerance- Part III Lecture 19, Nov 21, 2012 Majd F. Sakr and Mohammad Hammoud
Today… Last session Fault Tolerance – Part II Today’s session Reliable request-reply communication Today’s session Fault Tolerance – Part III Reliable group communication Atomicity Recovery Announcement: Project 3 is due tomorrow by 11:59PM
Discussion on Fault Tolerance Objectives Discussion on Fault Tolerance Recovery from failures Atomicity and distributed commit protocols Process resilience, failure detection and reliable communication General background on fault tolerance
Discussion on Fault Tolerance Objectives Discussion on Fault Tolerance Recovery from failures Atomicity and distributed commit protocols Process resilience, failure detection and reliable communication General background on fault tolerance
Reliable Communication Reliable Request-Reply Communication Reliable Group Communication In practice, the focus is on masking crash and omission failures A communication channel may exhibit crash, omission, timing, and byzantine failures Add a slide prior to this slide with an animated example to make your point clear.
Reliable Group Communication As we considered reliable request-reply communication, we need also to consider reliable multicasting services E.g., Election algorithms use multicasting schemes 1 2 7 3 6 4 Multicasting services guarantee that messages are delivered to all members in a process group. 5
Reliable Group Communication A Basic Reliable-Multicasting Scheme Scalability in Reliable Multicasting Atomic Multicast
Reliable Group Communication A Basic Reliable-Multicasting Scheme Scalability in Reliable Multicasting Atomic Multicast
Reliable Multicasting Reliable multicasting indicates that a message that is sent to a process group should be delivered to each member of that group A distinction should be made between: Reliable communication in the presence of faulty processes Reliable communication when processes are assumed to operate correctly In the presence of faulty processes, multicasting is considered to be reliable when it can be guaranteed that all non-faulty group members receive the message Reliable multicasting turns out to be surprisingly tricky
Basic Reliable Multicasting Questions What happens if during communication (i.e., a message is being delivered) a process P joins a group? Should P also receive the message? What happens if a (sending) process crashes during communication? What about message ordering? The situation becomes simpler if we assume an agreement exists on who is a member of the group and who is not. Processes do not fail, and processes do not join or leave the group while communication is going on. Reliable multicasting can be redefined, as such, so that every message should be delivered to each current group member.
Reliable Multicasting with Feedback Messages Consider the case when a single sender S wants to multicast a message to multiple receivers An S’s multicast message may be lost part way and delivered to some, but not to all, of the intended receivers Assume that messages are received in the same order as they are sent The situation becomes simpler if we assume an agreement exists on who is a member of the group and who is not. Processes do not fail, and processes do not join or leave the group while communication is going on. Reliable multicasting can be redefined, as such, so that every message should be delivered to each current group member.
Reliable Multicasting with Feedback Messages Sender Receiver Receiver Receiver Receiver M25 M25 M25 M25 M25 M25 M25 M25 M25 M25 M25 M25 M25 M25 M25 History Buffer Last = 24 Last = 24 Last = 23 Last = 24 Network Sender Receiver Receiver Receiver Receiver We assume that messages are received in the order they are sent. Last = 24 Last = 24 Last = 23 Last = 24 M25 M25 M25 M25 ACK25 ACK25 Missed 24 ACK25 An extensive and detailed survey of total-order broadcasts can be found in Defago et al. (2004)
Reliable Group Communication A Basic Reliable-Multicasting Scheme Scalability in Reliable Multicasting Atomic Multicast
Scalability Issues with a Feedback-Based Scheme If there are N receivers in a multicasting process, the sender must be prepared to accept at least N ACKs This might cause a feedback implosion Instead, we can let a receiver return only a NACK Limitations: No hard guarantees can be given that a feedback implosion will not happen It is not clear for how long the sender should keep a message in its history buffer Reliable multicasting turns out to be surprisingly tricky
Nonhierarchical Feedback Control How can we control the number of NACKs sent back to the sender? A NACK is sent to all the group members after some random delay A group member suppresses its own feedback concerning a missing message after receiving a NACK feedback about the same message
Hierarchical Feedback Control Feedback suppression is basically a nonhierarchical solution Achieving scalability for very large groups of receivers requires that hierarchical approaches are adopted The group of receivers is partitioned into a number of subgroups, which are organized into a tree R The main problem here is the construction of the tree. Receiver
Hierarchical Feedback Control The subgroup containing the sender S forms the root of the tree Within a subgroup, any reliable multicasting scheme can be used Each subgroup appoints a local coordinator C responsible for handling retransmission requests in its subgroup If C misses a message m, it asks the C of the parent subgroup to retransmit m S Coordinator C C R Root The main problem here is the construction of the tree.
Reliable Group Communication A Basic Reliable-Multicasting Scheme Scalability in Reliable Multicasting Atomic Multicast
Atomic Multicast P1: What is often needed in a distributed system is the guarantee that a message is delivered to either all processes or to none at all P2: It is also generally required that all messages are delivered in the same order to all processes Satisfying P1 and P2 results in an atomic multicast Atomic multicast: Ensures that non-faulty processes maintain a consistent view Forces reconciliation when a process recovers and rejoins the group
Virtual Synchrony (1) A multicast message m is uniquely associated with a list of processes to which it should be delivered This delivery list corresponds to a group view (G) There is only one case in which delivery of m is allowed to fail: When a group-membership-change is the result of the sender of m crashing In this case, m may either be delivered to all remaining processes, or ignored by each of them A multicast message m is uniquely associated with a list of processes to which it should be delivered This delivery list corresponds to a group view (G) There is only one case in which delivery of m is allowed to fail: When a group-membership-change is the result of the sender of m crashing In this case, m may either be delivered to all remaining processes, or ignored by each of them A reliable multicast with this property is said to be virtually synchronous
Virtual Synchrony (2) The Principle of Virtual Synchronous Multicast Reliable multicast by multiple point-to-point messages P3 crashes P3 rejoins P1 P2 P3 P4 Time A view change acts as a barrier across which no multicast can pass. G = {P1, P2, P3, P4} G = {P1, P2, P4} G = {P1, P2, P3, P4} Partial multicast from P3 is discarded The Principle of Virtual Synchronous Multicast
Message Ordering Four different virtually synchronous multicast orderings are distinguished: Unordered multicasts FIFO-ordered multicasts Causally-ordered multicasts Totally-ordered multicasts
1. Unordered multicasts A reliable, unordered multicast is a virtually synchronous multicast in which no guarantees are given concerning the order in which received messages are delivered by different processes Process P1 Process P2 Process P3 Sends m1 Receives m1 Receives m2 Sends m2 Three communicating processes in the same group
2. FIFO-Ordered Multicasts With FIFO-Ordered multicasts, the communication layer is forced to deliver incoming messages from the same process in the same order as they have been sent Process P1 Process P2 Process P3 Process P4 Sends m1 Receives m1 Receives m3 Sends m3 Sends m2 Sends m4 Receives m2 Receives m4 Four processes in the same group with two different senders.
3-4. Causally-Ordered and Total-Ordered Multicasts Causally-ordered multicast preserves potential causality between different messages If message m1 causally precedes another message m2, regardless of whether they were multicast by the same sender or not, the communication layer at each receiver will always deliver m1 before m2 Total-ordered multicast requires that when messages are delivered, they are delivered in the same order to all group members (regardless of whether message delivery is unordered, FIFO-ordered, or causally-ordered)
Virtually Synchronous Reliable Multicasting A virtually synchronous reliable multicasting that offers total-ordered delivery of messages is what we refer to as atomic multicasting Multicast Basic Message Ordering Total-Ordered Delivery? Reliable multicast None No FIFO multicast FIFO-ordered delivery Causal multicast Causal-ordered delivery Atomic multicast Yes FIFO atomic multicast Causal atomic multicast Multicast Basic Message Ordering Total-Ordered Delivery? Reliable multicast None No FIFO multicast FIFO-ordered delivery Causal multicast Causal-ordered delivery Atomic multicast Yes FIFO atomic multicast Causal atomic multicast Multicast Basic Message Ordering Total-Ordered Delivery? Reliable multicast None No FIFO multicast FIFO-ordered delivery Causal multicast Causal-ordered delivery Atomic multicast Yes FIFO atomic multicast Causal atomic multicast Multicast Basic Message Ordering Total-Ordered Delivery? Reliable multicast None No FIFO multicast FIFO-ordered delivery Causal multicast Causal-ordered delivery Atomic multicast Yes FIFO atomic multicast Causal atomic multicast Multicast Basic Message Ordering Total-Ordered Delivery? Reliable multicast None No FIFO multicast FIFO-ordered delivery Causal multicast Causal-ordered delivery Atomic multicast Yes FIFO atomic multicast Causal atomic multicast Multicast Basic Message Ordering Total-Ordered Delivery? Reliable multicast None No FIFO multicast FIFO-ordered delivery Causal multicast Causal-ordered delivery Atomic multicast Yes FIFO atomic multicast Causal atomic multicast Six different versions of virtually synchronous reliable multicasting
Implementing Virtual Synchrony (1) We will consider a possible implementation of virtual synchrony appeared in Isis [Birman et al. 1991] Isis assumes a FIFO-ordered multicast Isis makes use of TCP, hence, each transmission is guaranteed to succeed Using TCP does not guarantee that all messages sent to a view G are delivered to all non-faulty processes in G before any view change
Implementing Virtual Synchrony (2) The solution adopted by Isis is to let every process in G keeps a message m until it knows for sure that all members in G have received it If m has been received by all members in G, m is said to be stable Only stable messages are allowed to be delivered
Implementing Virtual Synchrony (3) A flush message An unstable message 1 1 1 2 2 2 5 5 5 4 View change 6 4 6 4 6 3 3 3 7 7 7 Process 4 notices that process 7 has crashed and sends a view change The major flaw in the protocol described so far is that it cannot deal with process failures while a new view change is being announced. In particular, it assumes that until the new view Gi+1 has been installed by each member in Gi+1, no process in Gi+1 will fail (which would lead to a next view Gi+2). This problem is solved by announcing view changes for any view Gi+k even while previous changes have not yet been installed by all processes. The details are left as an exercise for the students. Process 6 sends out all its unstable messages, followed by a flush message Process 6 installs the new view when it receives a flush message from everyone else
Distributed Commit Atomic multicasting problem is an example of a more general problem, known as distributed commit The distributed commit problem involves having an operation being performed by each member of a process group, or none at all With reliable multicasting, the operation is the delivery of a message With distributed transactions, the operation may be the commit of a transaction at a single site that takes part in the transaction Distributed commit is often established by means of a coordinator and participants
One-Phase Commit Protocol In a simple scheme, a coordinator can tell all participants whether or not to (locally) perform the operation in question This scheme is referred to as a one-phase commit protocol The one-phase commit protocol has a main drawback that if one of the participants cannot actually perform the operation, there is no way to tell the coordinator In practice, more sophisticated schemes are needed. The most common utilized one is the two-phase commit protocol
Two-Phase Commit Protocol Assuming that no failures occur, the two-phase commit protocol (2PC) consists of the following two phases, each consisting of two steps: Phase I: Voting Phase Step 1 The coordinator sends a VOTE_REQUEST message to all participants. Step 2 When a participant receives a VOTE_REQUEST message, it returns either a VOTE_COMMIT message to the coordinator indicating that it is prepared to locally commit its part of the transaction, or otherwise a VOTE_ABORT message. Phase I: Voting Phase Step 1 The coordinator sends a VOTE_REQUEST message to all participants. Step 2 When a participant receives a VOTE_REQUEST message, it returns either a VOTE_COMMIT message to the coordinator telling the coordinator that it is prepared to locally commit its part of the transaction, or otherwise a VOTE_ABORT message Phase I: Voting Phase Step 1 The coordinator sends a VOTE_REQUEST message to all participants. Step 2 When a participant receives a VOTE_REQUEST message, it returns either a VOTE_COMMIT message to the coordinator telling the coordinator that it is prepared to locally commit its part of the transaction, or otherwise a VOTE_ABORT message
Two-Phase Commit Protocol Phase II: Decision Phase Step 1 The coordinator collects all votes from the participants. If all participants have voted to commit the transaction, then so will the coordinator. In that case, it sends a GLOBAL_COMMIT message to all participants. However, if one participant had voted to abort the transaction, the coordinator will also decide to abort the transaction and multicasts a GLOBAL_ABORT message. Step 2 Each participant that voted for a commit waits for the final reaction by the coordinator. If a participant receives a GLOBAL_COMMIT message, it locally commits the transaction. Phase II: Decision Phase Step 1 The coordinator collects all votes from the participants. If all participants have voted to commit the transaction, then so will the coordinator. In that case, it sends a GLOBAL_COMMIT message to all participants. However, if one participant had voted to abort the transaction, the coordinator will also decide to abort the transaction and multicasts a GLOBAL_ABORT message. Step 2 Each participant that voted for a commit waits for the final reaction by the coordinator. If a participant receives a GLOBAL_COMMIT message, it locally commits the transaction. Otherwise, when receiving a GLOBAL_ABORT message, the transaction is locally aborted as well. Phase II: Decision Phase Step 1 The coordinator collects all votes from the participants. If all participants have voted to commit the transaction, then so will the coordinator. In that case, it sends a GLOBAL_COMMIT message to all participants. However, if one participant had voted to abort the transaction, the coordinator will also decide to abort the transaction and multicasts a GLOBAL_ABORT message. Step 2 Each participant that voted for a commit waits for the final reaction by the coordinator. Phase II: Decision Phase Step 1 The coordinator collects all votes from the participants. If all participants have voted to commit the transaction, then so will the coordinator. In that case, it sends a GLOBAL_COMMIT message to all participants. Step 2 Phase II: Decision Phase Step 1 The coordinator collects all votes from the participants. Step 2 Phase II: Decision Phase Step 1 Step 2 Phase II: Decision Phase Step 1 The coordinator collects all votes from the participants. If all participants have voted to commit the transaction, then so will the coordinator. In that case, it sends a GLOBAL_COMMIT message to all participants. However, if one participant had voted to abort the transaction, the coordinator will also decide to abort the transaction and multicasts a GLOBAL_ABORT message. Step 2
2PC Finite State Machines Vote-request Vote-abort Commit Vote-request INIT INIT Vote-request Vote-commit WAIT WAIT Vote-abort Global-abort Vote-commit Global-commit Global-abort ACK Global-commit ACK ABORT COMMIT ABORT COMMIT The coordinator as well as the participants have states in which they block waiting for incoming messages. Consequently, the protocol can easily fail when a process crashes for other processes may be indefinitely waiting for a message from that process. For this reason, a timeout mechanism is used. The finite state machine for the coordinator in 2PC The finite state machine for a participant in 2PC
2PC Algorithm Actions by coordinator: write START_2PC to local log; multicast VOTE_REQUEST to all participants; while not all votes have been collected{ wait for any incoming vote; if timeout{ write GLOBAL_ABORT to local log; multicast GLOBAL_ABORT to all participants; exit; } record vote; If all participants sent VOTE_COMMIT and coordinator votes COMMIT{ write GLOBAL_COMMIT to local log; multicast GLOBAL_COMMIT to all participants; }else{
Two-Phase Commit Protocol Actions by participants: write INIT to local log; Wait for VOTE_REQUEST from coordinator; If timeout{ write VOTE_ABORT to local log; exit; } If participant votes COMMIT{ write VOTE_COMMIT to local log; send VOTE_COMMIT to coordinator; wait for DECISION from coordinator; if timeout{ multicast DECISION_RQUEST to other participants; wait until DECISION is received; /*remain blocked*/ write DECISION to local log; if DECISION == GLOBAL_COMMIT { write GLOBAL_COMMIT to local log;} else if DECISION == GLOBAL_ABORT {write GLOBAL_ABORT to local log}; }else{ send VOTE_ABORT to coordinator;
Two-Phase Commit Protocol Actions for handling decision requests: /*executed by separate thread*/ while true{ wait until any incoming DECISION_REQUEST is received; /*remain blocked*/ read most recently recorded STATE from the local log; if STATE == GLOBAL_COMMIT send GLOBAL_COMMIT to requesting participant; else if STATE == INIT or STATE == GLOBAL_ABORT send GLOBAL_ABORT to requesting participant; else skip; /*participant remains blocked*/ }
Discussion on Fault Tolerance Objectives Discussion on Fault Tolerance Recovery from failures Atomicity and distributed commit protocols Process resilience, failure detection and reliable communication General background on fault tolerance
Recovery So far, we have mainly concentrated on algorithms that allow us to tolerate faults However, once a failure has occurred, it is essential that the process where the failure has happened can recover to a correct state In what follows we focus on: What it actually means to recover to a correct state When and how the state of a distributed system can be recorded and recovered, by means of checkpointing and message logging
Recovery Error Recovery Checkpointing Message Logging
Recovery Error Recovery Checkpointing Message Logging
Error Recovery Once a failure has occurred, it is essential that the process where the failure has happened can recover to a correct state Fundamental to fault tolerance is the recovery from an error The idea of error recovery is to replace an erroneous state with an error-free state There are essentially two forms of error recovery: Backward recovery Forward recovery
Crash after drive 1 is updated 1. Backward Recovery (1) In backward recovery, the main issue is to bring the system from its present erroneous state back to a previously correct state It is necessary to record the system’s state from time to time onto a stable storage, and to restore such a recorded state when things go wrong Crash after drive 1 is updated Bad Spot Stable Storage Dust particles or general wear and tear can give a previously valid block a sudden checksum error, without cause or warning. When such an error is detected, the bad block can be regenerated from the corresponding block on the other drive.
1. Backward Recovery (2) Each time (part of) the system’s present state is recorded, a checkpoint is said to be made Problems with backward recovery: Restoring a system or a process to a previous state is generally expensive in terms of performance Some states can never be rolled back (e.g., typing in UNIX rm –fr *)
2. Forward Recovery When the system detects that it has made an error, forward recovery reverts the system state to error time and corrects it, to be able to move forward Forward recovery is typically faster than backward recovery but requires that it has to be known in advance which errors may occur Some systems make use of both forward and backward recovery for different errors or different parts of one error Back recovery, in contrast, reverts the system state back to some earlier, correct version.
Recovery Error Recovery Checkpointing Message Logging
Why Checkpointing? In a fault-tolerant distributed system, backward recovery requires that the system regularly saves its state onto a stable storage This process is referred to as checkpointing In particular, checkpointing consists of storing a distributed snapshot of the current application state (i.e., a consistent global state), and later on, use it for restarting the execution in case of a failure
They jointly form a distributed Recovery Line In a distributed snapshot, if a process P has recorded the receipt of a message, then there should be also a process Q that has recorded the sending of that message We are able to identify both, senders and receivers. Initial state A snapshot A recovery line Not a recovery line P A failure Q Message sent from Q to P They jointly form a distributed snapshot
Checkpointing Checkpointing can be of two types: Independent Checkpointing: each process simply records its local state from time to time in an uncoordinated fashion Coordinated Checkpointing: all processes synchronize to jointly write their states to local stable storages
Domino Effect Independent checkpointing may make it difficult to find a recovery line, leading potentially to a domino effect resulting from cascaded rollbacks With coordinated checkpointing, the saved state is automatically globally consistent, hence, domino effect is inherently avoided Rollback Not a Recovery Line Not a Recovery Line Not a Recovery Line P A failure Q
Recovery Error Recovery Checkpointing Message Logging
Why Message Logging? Considering that checkpointing is an expensive operation, techniques have been sought to reduce the number of checkpoints, but still enable recovery An important technique in distributed systems is message logging The basic idea is that if transmission of messages can be replayed, we can still reach a globally consistent state but without having to restore that state from stable storage In practice, the combination of having fewer checkpoints and message logging is more efficient than having to take many checkpoints
Message Logging Message logging can be of two types: Sender-based logging: A process can log its messages before sending them off Receiver-based logging: A receiving process can first log an incoming message before delivering it to the application When a sending or a receiving process crashes, it can restore the most recently checkpointed state, and from there on replay the logged messages (important for non-deterministic behaviors)
Replay of Messages and Orphan Processes Incorrect replay of messages after recovery can lead to orphan processes. This should be avoided Q crashes Q recovers M1 is replayed M3 becomes an orphan P M1 M1 Q M3 M3 M2 M2 R M2 can never be replayed Logged Message Unlogged Message
Discussion on Fault Tolerance Objectives Discussion on Fault Tolerance Recovery from failures Atomicity and distributed commit protocols Process resilience, failure detection and reliable communication General background on fault tolerance
Distributed File Systems-Part I Thanks You! Next Class Distributed File Systems-Part I Thanks You!