Distributed Systems CS

Distributed Systems CS 15-440
Fault Tolerance- Part II Lecture 21, Nov 30, 2015 Mohammad Hammoud

Today… Last Session: Fault Tolerance- Part I Today’s Session:
Process Resilience Today’s Session: Fault Tolerance – Part II Reliable communication Announcements: P4 is due on Nov 30 by midnight Final exam is on Thursday, Dec 8th from 1:30PM to 4:30PM at Room 1190 (all topics are included; it will be open books, open notes)

Discussion on Fault Tolerance
Objectives Discussion on Fault Tolerance Recovery from failures Atomicity and distributed commit protocols Process resilience, failure detection and reliable communication General background on fault tolerance

Reliable Communication
Fault tolerance in distributed systems typically concentrates on faulty processes However, we also need to consider communication failures Two types of reliable communication: Reliable request-reply communication (e.g., RPC) Reliable group communication (e.g., multicasting schemes) P P In practice, the focus is on masking crash and omission failures A communication channel may exhibit crash, omission, timing, and byzantine failures Add a slide prior to this slide with an animated example to make your point clear.

Reliable Request-Reply Communication Reliable Group Communication Reliable Group Communication In practice, the focus is on masking crash and omission failures A communication channel may exhibit crash, omission, timing, and byzantine failures Add a slide prior to this slide with an animated example to make your point clear.

Request-Reply Communication
The request-reply (RR) communication is designed to support the roles and message exchanges in typical client-server interactions This sort of communication is mainly based on a trio of communication primitives, doOperation, getRequest and sendReply Client Server doOperation (wait) (continuation) Request Message getRequest select operation execute operation sendReply - In practice, the focus is on masking crash and omission failures Add a slide prior to this slide with an animated example to make your point clear. Reply Message

Timeout Mechanisms Request-reply communication may suffer from crash, omission, timing, and byzantine failures To allow for occasions where a request or a reply message is not delivered (e.g., lost), doOperation uses a timeout mechanism There are various options as to what doOperation can do after a timeout: Return immediately with an indication to the client that the request has failed Send the request message repeatedly until either a reply is received or the server is assumed to have failed - In practice, the focus is on masking crash and omission failures Add a slide prior to this slide with an animated example to make your point clear.

Idempotent Operations
In cases when the request message is retransmitted, the server may receive it more than once This can cause the server executing an operation more than once for the same request Not every operation can be executed more than once and obtain the same results each time Operations that can be executed repeatedly with the same effect are called idempotent operations - In practice, the focus is on masking crash and omission failures Add a slide prior to this slide with an animated example to make your point clear.

Duplicate Filtering To avoid problems with operations, the server should: Recognize successive messages from the “same” client Filter out duplicates Upon receiving a “duplicate” request, the server can either: Re-execute the operation again and reply Possible only for idempotent operations Or avoid re-executing the operation if it retained the outcome of the first request - In practice, the focus is on masking crash and omission failures Add a slide prior to this slide with an animated example to make your point clear.

Implementation Choices
RR protocol can be implemented in different ways to provide different delivery guarantees. The main choices are: Retry request message (client side): Controls whether to retransmit the request message until either a reply is received or the server is assumed to have failed Duplicate filtering (server side): Controls when retransmissions are used and whether to filter out duplicate requests at the server Retransmission of results (server side): Controls whether to keep a history of result messages so as to enable lost replies to be retransmitted without re-executing the operations at the server - In practice, the focus is on masking crash and omission failures Add a slide prior to this slide with an animated example to make your point clear.

Request-Reply Call Semantics
Combinations of RR protocols lead to a variety of possible semantics for the reliability of remote invocations Fault Tolerance Measure Call Semantics (Pertaining to Remote Procedures) Retransmit Request Message Duplicate Filtering Re-execute Procedure or Retransmit Reply No N/A Maybe Yes Re-execute Procedure At-least-once Retransmit Reply At-most-once Fault Tolerance Measure Call Semantics Retransmit Request Message Duplicate Filtering Re-execute Procedure or Retransmit Reply No N/A Maybe Yes Re-execute Procedure At-least-once Retransmit Reply At-most-once Fault Tolerance Measure Call Semantics Retransmit Request Message Duplicate Filtering Re-execute Procedure or Retransmit Reply No N/A Maybe Yes Re-execute Procedure At-least-once Retransmit Reply At-most-once - In practice, the focus is on masking crash and omission failures Add a slide prior to this slide with an animated example to make your point clear.

Reliable Request-Reply Communication Reliable Group Communication In practice, the focus is on masking crash and omission failures A communication channel may exhibit crash, omission, timing, and byzantine failures Add a slide prior to this slide with an animated example to make your point clear.

Reliable Group Communication
As we considered reliable request-reply communication, we need also to consider reliable multicasting services E.g., Election algorithms use multicasting schemes 1 2 7 3 6 4 Multicasting services guarantee that messages are delivered to all members in a process group. 5

A Basic Reliable-Multicasting Scheme Atomic Multicasting

Reliable Multicasting
Reliable multicasting indicates that a message that is sent to a group of processes should be delivered to each member of that group A distinction should be made between: Reliable communication in the presence of faulty processes Reliable communication in the presence of non-faulty processes How can we achieve reliable multicasting? Reliable multicasting turns out to be surprisingly tricky

Reliable Multicasting with Feedback Messages
Consider the case when a single sender, S, wants to multicast a message to multiple receivers S’s message may be lost part way and delivered to some, but not to all, of the intended receivers As of now, let us assume that messages are received in the same order as they were sent

Reliable Multicasting with Feedback Messages
Sender Receiver Receiver Receiver Receiver M25 M25 M25 M25 M25 M25 M25 M25 M25 M25 M25 M25 M25 M25 M25 History Buffer Last = 24 Last = 24 Last = 23 Last = 24 Network Sender Receiver Receiver Receiver Receiver We assume that messages are received in the order they are sent. Last = 24 Last = 24 Last = 23 Last = 24 M25 M25 M25 M25 ACK25 ACK25 Missed 24 ACK25 An extensive and detailed survey of total-order broadcasts can be found in Defago et al. (2004)

A Basic Reliable-Multicasting Scheme Atomic Multicasting

Atomic Multicast Atomic multicast entails that: As a result:
A message is delivered to either all or none of the processes All messages are delivered in the same order to all processes As a result: Non-faulty processes can maintain a “consistent view” Reconciliation is enforced when a process recovers and rejoins a group

Virtual Synchrony A multicast message, m, is uniquely associated with a list of processes to which it should be delivered This delivery list corresponds to what is called a group view In principle, the delivery of m is allowed to fail: When a change in group-membership is the result of the sender of m crashing Accordingly, m may either be delivered to all remaining processes, or ignored by each of them When a change in group-membership is the result of a receiver of m crashing Accordingly, m may be ignored by every other receiver A reliable multicast with these properties is said to be “virtually synchronous”

The Principle of Virtual Synchrony
Reliable multicast by multiple point-to-point messages P3 crashes P3 rejoins P1 P2 P3 P4 A view change acts as a barrier across which no multicast can pass. Time G = {P1, P2, P3, P4} G = {P1, P2, P4} G = {P1, P2, P3, P4} Partial multicast from P3 is discarded

Message Ordering Four different virtually synchronous multicast orderings are distinguished: Unordered multicasts FIFO-ordered multicasts Causally-ordered multicasts Totally-ordered multicasts

1. Unordered multicasts A reliable, unordered multicast is a virtually synchronous multicast in which no guarantees are given concerning the order in which received messages are delivered by different processes Process P1 Process P2 Process P3 Sends m1 Receives m1 Receives m2 Sends m2 Three communicating processes in the same group

2. FIFO-Ordered Multicasts
With FIFO-Ordered multicasts, the communication layer is forced to deliver incoming messages from the same process in the same order as they have been sent Process P1 Process P2 Process P3 Process P4 Sends m1 Receives m1 Receives m3 Sends m3 Sends m2 Sends m4 Receives m2 Receives m4 Four processes in the same group with two different senders.

3-4. Causally-Ordered and Total-Ordered Multicasts
Causally-ordered multicasts preserve potential causality between different messages If message m1 causally precedes another message m2, the communication layer at each receiver will always deliver m1 before m2 Total-ordered multicasts require that when messages are delivered, they are delivered in the same order to all group members This is regardless of whether message delivery is unordered, FIFO-ordered, or causally-ordered

Virtually Synchronous Reliable Multicasting
A virtually synchronous reliable multicasting that offers total-ordered delivery of messages is what we refer to as atomic multicasting Multicast Basic Message Ordering Total-Ordered Delivery? Reliable multicast None No FIFO multicast FIFO-ordered delivery Causal multicast Causal-ordered delivery Atomic multicast Yes FIFO atomic multicast Causal atomic multicast Multicast Basic Message Ordering Total-Ordered Delivery? Reliable multicast None No FIFO multicast FIFO-ordered delivery Causal multicast Causal-ordered delivery Atomic multicast Yes FIFO atomic multicast Causal atomic multicast Multicast Basic Message Ordering Total-Ordered Delivery? Reliable multicast None No FIFO multicast FIFO-ordered delivery Causal multicast Causal-ordered delivery Atomic multicast Yes FIFO atomic multicast Causal atomic multicast Multicast Basic Message Ordering Total-Ordered Delivery? Reliable multicast None No FIFO multicast FIFO-ordered delivery Causal multicast Causal-ordered delivery Atomic multicast Yes FIFO atomic multicast Causal atomic multicast Multicast Basic Message Ordering Total-Ordered Delivery? Reliable multicast None No FIFO multicast FIFO-ordered delivery Causal multicast Causal-ordered delivery Atomic multicast Yes FIFO atomic multicast Causal atomic multicast Multicast Basic Message Ordering Total-Ordered Delivery? Reliable multicast None No FIFO multicast FIFO-ordered delivery Causal multicast Causal-ordered delivery Atomic multicast Yes FIFO atomic multicast Causal atomic multicast Six different versions of virtually synchronous reliable multicasting

Distributed Commit Atomic multicasting problem is an example of a more general problem, known as distributed commit The distributed commit problem involves having an operation being performed by each member of a process group, or none at all With reliable multicasting, the operation is the delivery of a message With distributed transactions, the operation may be the commit of a transaction at a single site that takes part in the transaction Distributed commit is often established by means of a coordinator and participants

One-Phase Commit Protocol
In a simple scheme, a coordinator can tell all participants whether or not to (locally) perform the operation in question This scheme is referred to as a one-phase commit protocol The one-phase commit protocol has a main drawback that if one of the participants cannot actually perform the operation, there is no way to tell the coordinator In practice, more sophisticated schemes are needed The most common utilized one is the two-phase commit protocol

Two-Phase Commit Protocol
Assuming that no failures occur, the two-phase commit protocol (2PC) consists of the following two phases, each consisting of two steps: Phase I: Voting Phase Step 1 The coordinator sends a VOTE_REQUEST message to all participants. Step 2 When a participant receives a VOTE_REQUEST message, it returns either a VOTE_COMMIT message to the coordinator indicating that it is prepared to locally commit its part of the transaction, or otherwise a VOTE_ABORT message. Phase I: Voting Phase Step 1 The coordinator sends a VOTE_REQUEST message to all participants. Step 2 When a participant receives a VOTE_REQUEST message, it returns either a VOTE_COMMIT message to the coordinator telling the coordinator that it is prepared to locally commit its part of the transaction, or otherwise a VOTE_ABORT message Phase I: Voting Phase Step 1 The coordinator sends a VOTE_REQUEST message to all participants. Step 2 When a participant receives a VOTE_REQUEST message, it returns either a VOTE_COMMIT message to the coordinator telling the coordinator that it is prepared to locally commit its part of the transaction, or otherwise a VOTE_ABORT message

Phase II: Decision Phase Step 1 The coordinator collects all votes from the participants. If all participants have voted to commit the transaction, then so will the coordinator. In that case, it sends a GLOBAL_COMMIT message to all participants. However, if one participant had voted to abort the transaction, the coordinator will also decide to abort the transaction and multicast a GLOBAL_ABORT message. Step 2 Each participant that voted for a commit waits for the final reaction by the coordinator. If a participant receives a GLOBAL_COMMIT message, it locally commits the transaction. Phase II: Decision Phase Step 1 The coordinator collects all votes from the participants. If all participants have voted to commit the transaction, then so will the coordinator. In that case, it sends a GLOBAL_COMMIT message to all participants. However, if one participant had voted to abort the transaction, the coordinator will also decide to abort the transaction and multicast a GLOBAL_ABORT message. Step 2 Each participant that voted for a commit waits for the final reaction by the coordinator. If a participant receives a GLOBAL_COMMIT message, it locally commits the transaction. Otherwise, when receiving a GLOBAL_ABORT message, the transaction is locally aborted as well. Phase II: Decision Phase Step 1 The coordinator collects all votes from the participants. If all participants have voted to commit the transaction, then so will the coordinator. In that case, it sends a GLOBAL_COMMIT message to all participants. However, if one participant had voted to abort the transaction, the coordinator will also decide to abort the transaction and multicast a GLOBAL_ABORT message. Step 2 Each participant that voted for a commit waits for the final reaction by the coordinator. Phase II: Decision Phase Step 1 The coordinator collects all votes from the participants. If all participants have voted to commit the transaction, then so will the coordinator. In that case, it sends a GLOBAL_COMMIT message to all participants. Step 2 Phase II: Decision Phase Step 1 The coordinator collects all votes from the participants. Step 2 Phase II: Decision Phase Step 1 Step 2 Phase II: Decision Phase Step 1 The coordinator collects all votes from the participants. If all participants have voted to commit the transaction, then so will the coordinator. In that case, it sends a GLOBAL_COMMIT message to all participants. However, if one participant had voted to abort the transaction, the coordinator will also decide to abort the transaction and multicast a GLOBAL_ABORT message. Step 2

2PC Finite State Machines
Vote-request Vote-abort Commit Vote-request INIT INIT Vote-request Vote-commit WAIT WAIT Vote-abort Global-abort Vote-commit Global-commit Global-abort ACK Global-commit ACK ABORT COMMIT ABORT COMMIT The coordinator as well as the participants have states in which they block waiting for incoming messages. Consequently, the protocol can easily fail when a process crashes for other processes may be indefinitely waiting for a message from that process. For this reason, a timeout mechanism is used. The finite state machine for the coordinator in 2PC The finite state machine for a participant in 2PC

2PC Algorithm Actions by coordinator: write START_2PC to local log;
multicast VOTE_REQUEST to all participants; while not all votes have been collected{ wait for any incoming vote; if timeout{ write GLOBAL_ABORT to local log; multicast GLOBAL_ABORT to all participants; exit; } record vote; If all participants sent VOTE_COMMIT and coordinator votes COMMIT{ write GLOBAL_COMMIT to local log; multicast GLOBAL_COMMIT to all participants; }else{

Actions by participants: write INIT to local log; Wait for VOTE_REQUEST from coordinator; If timeout{ write VOTE_ABORT to local log; exit; } If participant votes COMMIT{ write VOTE_COMMIT to local log; send VOTE_COMMIT to coordinator; wait for DECISION from coordinator; if timeout{ multicast DECISION_RQUEST to other participants; wait until DECISION is received; /*remain blocked*/ write DECISION to local log; if DECISION == GLOBAL_COMMIT { write GLOBAL_COMMIT to local log;} else if DECISION == GLOBAL_ABORT {write GLOBAL_ABORT to local log}; }else{ send VOTE_ABORT to coordinator;

Actions for handling decision requests: /*executed by separate thread*/ while true{ wait until any incoming DECISION_REQUEST is received; /*remain blocked*/ read most recently recorded STATE from the local log; if STATE == GLOBAL_COMMIT send GLOBAL_COMMIT to requesting participant; else if STATE == INIT or STATE == GLOBAL_ABORT send GLOBAL_ABORT to requesting participant; else skip; /*participant remains blocked*/ }

Objectives Discussion on Fault Tolerance Recovery from failures Atomicity and distributed commit protocols Process resilience, failure detection and reliable communication General background on fault tolerance

Recovery Thus far, we have focused on algorithms that allow us to tolerate faults However, once a process fails, it is essential that it can recover to a correct state Two common recovery mechanisms: Checkpointing Message Logging

Why Checkpointing? In fault-tolerant distributed systems, processes “regularly” save their states onto a stable storage This mechanism is referred to as checkpointing Checkpointing consists of storing a “distributed snapshot” of the current application state After a failure, the distributed snapshot is used to restart the system (or part of it) from a correct state

They jointly form a distributed
Recovery Line In capturing a distributed snapshot, if a process P has recorded the receipt of a message, m, then there should be also a process Q that has recorded the sending of m We are able to identify both, senders and receivers. Initial state A local snapshot A recovery line Not a recovery line P m A failure Q Message sent from Q to P They jointly form a distributed snapshot

Checkpointing Checkpointing can be classified into two types:
Independent Checkpointing, wherein each process simply records its local state from time to time in an uncoordinated fashion Coordinated Checkpointing, wherein all processes synchronize to jointly write their states to a stable storage Which algorithm among the ones we’ve studied can be used to implement coordinated checkpointing? 2PC

Domino Effect Independent checkpointing may make it difficult to find a recovery line, leading potentially to a domino effect resulting from cascaded rollbacks With coordinated checkpointing, the saved state is automatically globally consistent, hence, domino effect is inherently avoided Rollback Not a Recovery Line Not a Recovery Line Not a Recovery Line P A failure Q

Why Message Logging? Considering that checkpointing is an expensive operation, techniques have been sought to reduce the number of checkpoints, but still enable recovery An important technique in distributed systems is message logging The basic idea is that if transmission of messages can be replayed, we can still reach a globally consistent state, yet without having to restore all the state from a distributed checkpoint In practice, the combination of having fewer checkpoints alongside message logging is more efficient than having to take many checkpoints

Message Logging Message logging can be classified into two types:
Sender-based logging: A process can log its messages before sending them off Receiver-based logging: A receiving process can first log an incoming message before delivering it to the application When a sending or a receiving process crashes, it can restore the most recently checkpointed state, and from there on “replay” the logged messages Will this work for non-deterministic applications?

Replaying of Messages and Orphan Processes
Caveat: Incorrect logging/replaying of messages after recovery can lead to orphan processes Q crashes Q recovers M1 is replayed M3 becomes an orphan P M1 M1 Q M3 M3 M2 M2 R M2 can never be replayed Logged Message Unlogged Message

Objectives Discussion on Fault Tolerance Recovery from failures Atomicity and distributed commit protocols Process resilience, failure detection and reliable communication General background on fault tolerance All Covered!

Distributed File Systems Thank You!
Next Class Distributed File Systems Thank You!

Distributed Systems CS

Similar presentations

Presentation on theme: "Distributed Systems CS"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Distributed Systems CS

Similar presentations

Presentation on theme: "Distributed Systems CS"— Presentation transcript:

Similar presentations

About project

Feedback