1 8.3 Reliable Client-Server Communication So far: Concentrated on process resilience (by means of process groups). What about reliable communication channels?

Slides:



Advertisements
Similar presentations
6.852: Distributed Algorithms Spring, 2008 Class 7.
Advertisements

CS542: Topics in Distributed Systems Distributed Transactions and Two Phase Commit Protocol.
(c) Oded Shmueli Distributed Recovery, Lecture 7 (BHG, Chap.7)
Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
CS 582 / CMPE 481 Distributed Systems
Systems of Distributed Systems Module 2 -Distributed algorithms Teaching unit 3 – Advanced algorithms Ernesto Damiani University of Bozen Lesson 6 – Two.
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Non-blocking Atomic Commitment Aaron Kaminsky Presenting Chapter 6 of Distributed Systems, 2nd edition, 1993, ed. Mullender.
Chapter 7 Fault Tolerance Basic Concepts Failure Models Process Design Issues Flat vs hierarchical group Group Membership Reliable Client.
Distributed Systems CS Fault Tolerance- Part III Lecture 15, Oct 26, 2011 Majd F. Sakr, Mohammad Hammoud andVinay Kolar 1.
"Failure is not an option. It comes bundled with your system.“ (--unknown)
1 CS 194: Distributed Systems Distributed Commit, Recovery Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering.
©Silberschatz, Korth and Sudarshan19.1Database System Concepts Distributed Transactions Transaction may access data at several sites. Each site has a local.
1 More on Distributed Coordination. 2 Who’s in charge? Let’s have an Election. Many algorithms require a coordinator. What happens when the coordinator.
1 ICS 214B: Transaction Processing and Distributed Data Management Distributed Database Systems.
A Survey of Rollback-Recovery Protocols in Message-Passing Systems M. Elnozahy, L. Alvisi, Y. Wang, D. Johnson Carnegie Mellon University Presented by:
1 Rollback-Recovery Protocols II Mahmoud ElGammal.
Distributed Deadlocks and Transaction Recovery.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Distributed Commit Dr. Yingwu Zhu. Failures in a distributed system Consistency requires agreement among multiple servers – Is transaction X committed?
Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.
1 Distributed Systems Fault Tolerance Chapter 8. 2 Course/Slides Credits Note: all course presentations are based on those developed by Andrew S. Tanenbaum.
A Survey of Rollback-Recovery Protocols in Message-Passing Systems.
Transaction Communications Yi Sun. Outline Transaction ACID Property Distributed transaction Two phase commit protocol Nested transaction.
Distributed Transactions Chapter 13
Distributed Systems CS Fault Tolerance- Part III Lecture 19, Nov 25, 2013 Mohammad Hammoud 1.
Distributed Systems Principles and Paradigms Chapter 07 Fault Tolerance 01 Introduction 02 Communication 03 Processes 04 Naming 05 Synchronization 06 Consistency.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14.
Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.
Fault Tolerance CSCI 4780/6780. Distributed Commit Commit – Making an operation permanent Transactions in databases One phase commit does not work !!!
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Chapter 8 Fault.
Chapter 2 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.
More on Fault Tolerance Chapter 7. Topics Group Communication Virtual Synchrony Atomic Commit Checkpointing, Logging, Recovery.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Chapter 8 Fault.
Fault Tolerance Chapter 7.
Fault Tolerance. Basic Concepts Availability The system is ready to work immediately Reliability The system can run continuously Safety When the system.
Kyung Hee University 1/33 Fault Tolerance Chap 7.
Reliable Communication Smita Hiremath CSC Reliable Client-Server Communication Point-to-Point communication Established by TCP Masks omission failure,
Fault Tolerance Chapter 7. Failures in Distributed Systems Partial failures – characteristic of distributed systems Goals: Construct systems which can.
Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery.
Reliable Client-Server Communication. Reliable Communication So far: Concentrated on process resilience (by means of process groups). What about reliable.
Revisiting failure detectors Some of you asked questions about implementing consensus using S - how does it differ from reaching consensus using P. Here.
EEC 688/788 Secure and Dependable Computing Lecture 6 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Fault Tolerance Chapter 7. Basic Concepts Dependability Includes Availability Reliability Safety Maintainability.
1 CHAPTER 5 Fault Tolerance Chapter 5-- Fault Tolerance.
Fault Tolerance Chapter 7. Goal An important goal in distributed systems design is to construct the system in such a way that it can automatically recover.
Fault Tolerance (2). Topics r Reliable Group Communication.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
1 Fault Tolerance and Recovery Mostly taken from
Chapter 8 – Fault Tolerance Section 8.5 Distributed Commit Heta Desai Dr. Yanqing Zhang Csc Advanced Operating Systems October 14 th, 2015.
Distributed Databases – Advanced Concepts Chapter 25 in Textbook.
More on Fault Tolerance
Fault Tolerance Prof. Orhan Gemikonakli
Fault Tolerance Chap 7.
EEC 688/788 Secure and Dependable Computing
Chapter 8 Fault Tolerance Part I Introduction.
Reliable group communication
Commit Protocols CS60002: Distributed Systems
Outline Announcements Fault Tolerance.
EEC 688/788 Secure and Dependable Computing
Distributed Systems CS
Distributed Systems CS
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Distributed Databases Recovery
EEC 688/788 Secure and Dependable Computing
Last Class: Fault Tolerance
Presentation transcript:

1 8.3 Reliable Client-Server Communication So far: Concentrated on process resilience (by means of process groups). What about reliable communication channels? Error detection: Framing of packets to allow for bit error detection Use of frame numbering to detect packet loss Error correction: Add so much redundancy that corrupted packets can be automatically corrected Request retransmission of lost, or last N packets Observation: Most of this work assumes point-to-point communication

2 RPC Semantics in the Presence of Failures Five different classes of failures that can occur in RPC systems: 1.The client is unable to locate the server. 2.The request message from the client to the server is lost. 3.The server crashes after receiving a request. 4.The reply message from the server to the client is lost. 5.The client crashes after sending a request. [1:] Relatively simple – just report back to client [2:] Just resend message

3 Server Crashes Figure 8-7. A server in client-server communication. (a) The normal case. (b) Crash after execution. (c) Crash before execution. [3:] Server crashes are harder as you don’t know what it had already done. Problem: We need to decide on what we expect from the server At-least-once-semantics: The server guarantees it will carry out an operation at least once, no matter what. At-most-once-semantics: The server guarantees it will carry out an operation at most once.

4 Server Crashes These events can occur in six different orderings: 1.M →P →C: A crash occurs after sending the completion message and printing the text. 2.M →C (→P): A crash happens after sending the completion message, but before the text could be printed. 3.P →M →C: A crash occurs after sending the completion message and printing the text. 4.P→C(→M): The text printed, after which a crash occurs before the completion message could be sent. 5.C (→P →M): A crash happens before the server could do anything. 6.C (→M →P): A crash happens before the server could do anything. Three events that can happen at the server: Send the completion message (M), Print the text (P), Crash (C).

5 Server Crashes Figure 8-8. Different combinations of client and server strategies in the presence of server crashes.

6 [4:] Detecting lost replies can be hard, because it can also be that the server had crashed. You don’t know whether the server has carried out the operation Solution: None, except that you can try to make your operations idempotent: repeatable without any harm done if it happened to be carried out before. [5:] Problem: The server is doing work and holding resources for nothing (called doing an orphan computation). Orphan is killed (or rolled back) by client when it reboots Broadcast new epoch number when recovering ⇒ servers kill orphans Require computations to complete in a T time units. Old ones are simply removed.

7 8.4 Reliable Group Communication Basic model: We have a multicast channel c with two (possibly overlapping) groups: The sender group SND(c) of processes that submit messages to channel c The receiver group RCV(c) of processes that can receive messages from channel c Simple reliability: If process P ∈ RCV(c) at the time message m was submitted to c, and P does not leave RCV(c), m should be delivered to P Atomic multicast: How can we ensure that a message m submitted to channel c is delivered to process P ∈ RCV(c) only if m is delivered to all members of RCV(c) Basic Reliable-Multicasting Schemes

8 Observation: If we can stick to a local-area network, reliable multicasting is “easy” Principle: Let the sender log messages submitted to channel c: If P sends message m, m is stored in a history buffer Each receiver acknowledges the receipt of m, or requests retransmission at P when noticing message lost Sender P removes m from history buffer when everyone has acknowledged receipt Question: Why doesn’t this scale?

9 Basic Reliable-Multicasting Schemes Figure 8-9. A simple solution to reliable multicasting when all receivers are known and are assumed not to fail. (a) Message transmission. (b) Reporting feedback.

10 Nonhierarchical Feedback Control Basic idea: Let a process P suppress its own feedback when it notices another process Q is already asking for a retransmission Assumptions: All receivers listen to a common feedback channel to which feedback messages are submitted Process P schedules its own feedback message randomly, and suppresses it when observing another feedback message Question: Why is the random schedule so important?

11 Hierarchical Feedback Control Basic solution: Construct a hierarchical feedback channel in which all submitted messages are sent only to the root. Intermediate nodes aggregate feedback messages before passing them on. Question: What’s the main problem with this solution? Observation: Intermediate nodes can easily be used for retransmission purposes

12 Atomic Multicast Figure The logical organization of a distributed system to distinguish between message receipt and message delivery. Virtual Synchrony

13 Virtual Synchrony Idea: Formulate reliable multicasting in the presence of process failures in terms of process groups and changes to group membership: Guarantee: A message is delivered only to the non-faulty members of the current group. All members should agree on the current group membership.

14 Virtual Synchrony Essence: We consider views V ⊆ RCV(c) ∪ SND(c) Processes are added or deleted from a view V through view changes to V ∗ ; a view change is to be executed locally by each P ∈ V ∩ V ∗ (1) For each consistent state, there is a unique view on which all its members agree. Note: implies that all non-faulty processes see all view changes in the same order (2) If message m is sent to V before a view change vc to V ∗, then either all P ∈ V that execute vc receive m, or no processes P ∈ V that execute vc receive m. Note: all non-faulty members in the same view get to see the same set of multicast messages. (3) A message sent to view V can be delivered only to processes in V, and is discarded by successive views A reliable multicast algorithm satisfying (1)–(3) is virtually synchronous

15 Virtual Synchrony A sender to a view V need not be member of V If a sender S ∈ V crashes, its multicast message m is flushed before S is removed from V: m will never be delivered after the point that S  V Note: Messages from S may still be delivered to all, or none (non-faulty) processes in V before they all agree on a new view to which S does not belong If a receiver P fails, a message m may be lost but can be recovered as we know exactly what has been received in V. Alternatively, we may decide to deliver m to members in V − {P} Observation: Virtually synchronous behavior can be seen independent from the ordering of message delivery. The only issue is that messages are delivered to an agreed upon group of receivers.

16 Message Ordering Four different orderings are distinguished: Unordered multicasts FIFO-ordered multicasts Causally-ordered multicasts Totally-ordered multicasts Figure Three communicating processes in the same group. Figure Four processes in the same group with two different senders, and a possible delivery order of messages under FIFO-ordered multicasting

17 Implementing Virtual Synchrony (1) Figure Six different versions of virtually synchronous reliable multicasting.

18 Implementing Virtual Synchrony (2) Figure (a) Process 4 notices that process 7 has crashed and sends a view change. (b) Process 6 sends out all its unstable messages, followed by a flush message. (c) Process 6 installs the new view when it has received a flush message from everyone else.

Distributed Commit Two-Phase Commit Model: The client who initiated the computation acts as coordinator; processes required to commit are the participants Phase 1a: Coordinator sends vote-request to participants (also called a pre-write) Phase 1b: When participant receives vote-request it returns either vote-commit or vote-abort to coordinator. If it sends vote- abort, it aborts its local computation Phase 2a: Coordinator collects all votes; if all are vote-commit, it sends global-commit to all participants, otherwise it sends global-abort Phase 2b: Each participant waits for global-commit or global- abort and handles accordingly.

20 Two-Phase Commit Figure (a) The finite state machine for the coordinator in 2PC. (b) The finite state machine for a participant.

21 Two-Phase Commit - Failing Participant Observation: Consider participant crash in one of its states, and the subsequent recovery to that state: Initial state: No problem, as participant was unaware of the protocol Ready state: Participant is waiting to either commit or abort. After recovery, participant needs to know which state transition it should make ⇒ log the coordinator’s decision Abort state: Merely make entry into abort state idempotent, e.g., removing the workspace of results Commit state: Also make entry into commit state idempotent, e.g., copying workspace to storage. Observation: When distributed commit is required, having participants use temporary workspaces to keep their results allows for simple recovery in the presence of failures.

22 Two-Phase Commit - Failing Participant Alternative: When a recovery is needed to the READY state, check what the other participants are doing. This approach avoids having to log the coordinator’s decision. Assume recovering participant P contacts another participant Q: Result: If all participants are in the READY state, the protocol blocks. Apparently, the coordinator is failing. Note: The protocol prescribes that we need the decision from the coordinator.

23 Two-Phase Commit - Failing Coordinator Observation: The real problem lies in the fact that the coordinator’s final decision may not be available for some time (or actually lost). Alternative: Let a participant P in the READY state timeout when it hasn’t received the coordinator’s decision; P tries to find out what other participants know (as discussed). Observation: Essence of the problem is that a recovering participant cannot make a local decision: it is dependent on other (possibly failed) processes

24 Two-Phase Commit Figure Outline of the steps taken by the coordinator in a two-phase commit protocol.

25 Two-Phase Commit Figure (a) The steps taken by a participant process in 2PC.

26 Two-Phase Commit Figure (b) The steps for handling incoming decision requests..

27 Three-Phase Commit The states of the coordinator and each participant satisfy the following two conditions: 1.There is no single state from which it is possible to make a transition directly to either a COMMIT or an ABORT state. 2.There is no state in which it is not possible to make a final decision, and from which a transition to a COMMIT state can be made.

28 Three-Phase Commit Phase 1a: Coordinator sends vote-request to participants Phase 1b: When participant receives vote-request it returns either vote-commit or vote-abort to coordinator. If it sends vote-abort, it aborts its local computation Phase 2a: Coordinator collects all votes; if all are vote-commit, it sends prepare-commit to all participants, otherwise it sends global- abort, and halts Phase 2b: Each participant waits for prepare-commit, or waits for global-abort after which it halts Phase 3a: (Prepare to commit) Coordinator waits until all participants have sent ready-commit, and then sends global-commit to all Phase 3b: (Prepare to commit) Participant waits for global-commit

29 Three-Phase Commit Figure (a) The finite state machine for the coordinator in 3PC. (b) The finite state machine for a participant.

30 Three-Phase Commit - Failing Participant Basic issue: Can P find out what it should it do after crashing in the ready or pre-commit state, even if other participants or the coordinator failed? Essence: Coordinator and participants on their way to commit, never differ by more than one state transition Consequence: If a participant timeouts in ready state, it can find out at the coordinator or other participants whether it should abort, or enter pre-commit state Observation: If a participant already made it to the pre- commit state, it can always safely commit (but is not allowed to do so for the sake of failing other processes) Observation: We may need to elect another coordinator to send off the final COMMIT

Recovery Introduction Checkpointing Message Logging

32 Recovery Introduction Essence: When a failure occurs, we need to bring the system into an error-free state: Forward error recovery: Find a new state from which the system can continue operation Backward error recovery: Bring the system back into a previous error-free state Practice: Use backward error recovery, requiring that we establish recovery points Observation: Recovery in distributed systems is complicated by the fact that processes need to cooperate in identifying a consistent state from where to recover

33 Stable Storage After a crash: If both disks are identical: you’re in good shape. If one is bad, but the other is okay (checksums): choose the good one. If both seem okay, but are different: choose the main disk. If both aren’t good: you’re not in a good shape.

34 Checkpointing Requirement: Every message that has been received is also shown to have been sent in the state of the sender Recovery line: Assuming processes regularly checkpoint their state, the most recent consistent global checkpoint. Observation: If and only if the system provides reliable communication, should sent messages also be received in a consistent state

35 Cascaded Rollback Figure The domino effect. Observation: If checkpointing is done at the “wrong” instants, the recovery line may lie at system startup time ⇒ cascaded rollback

36 Independent Checkpointing Essence: Each process independently takes checkpoints, with the risk of a cascaded rollback to system startup. Let CP i (m) denote mth checkpoint of process P i and INT i (m) the interval between CP i (m − 1) and CP i (m) When process P i sends a message in interval INT i (m), it piggybacks (i,m) When process P j receives a message in interval INT j (n), it records the dependency INT i (m)→INT j (n) The dependency INT i (m)→INT j (n) is saved to stable storage when taking checkpoint CP j (n) Observation: If process P i rolls back to CP i (m−1), P j must roll back to CP j (n−1).

37 Coordinated Checkpointing Essence: Each process takes a checkpoint after a globally coordinated action Question: What advantages are there to coordinated checkpointing? Simple solution: Use a two-phase blocking protocol: A coordinator multicasts a checkpoint request message When a participant receives such a message, it takes a checkpoint, stops sending (application) messages, and reports back that it has taken a checkpoint When all checkpoints have been confirmed at the coordinator, the latter broadcasts a checkpoint done message to allow all processes to continue Observation: It is possible to consider only those processes that depend on the recovery of the coordinator, and ignore the rest

38 Message Logging Alternative: Instead of taking an (expensive) checkpoint, try to replay your (communication) behavior from the most recent checkpoint ⇒ store messages in a log Assumption: We assume a piecewise deterministic execution model: The execution of each process can be considered as a sequence of state intervals Each state interval starts with a nondeterministic event (e.g., message receipt) Execution in a state interval is deterministic Conclusion: If we record nondeterministic events (to replay them later), we obtain a deterministic execution model that will allow us to do a complete replay

39 Message Logging and Consistency Problem: When should we actually log messages? Issue: Avoid orphans: Process Q has just received and subsequently delivered messages m1 and m2 Assume that m2 is never logged. After delivering m1 and m2, Q sends message m3 to process R Process R receives and subsequently delivers m3 Goal: Devise message logging schemes in which orphans do not occur

40 Message-Logging Scheme HDR[m]: The header of message m containing its source, destination, sequence number, and delivery number The header contains all information for resending a message and delivering it in the correct order (assume data is reproduced by the application) A message m is stable if HDR[m] cannot be lost (e.g., because it has been written to stable storage) DEP[m]: The set of processes to which message m has been delivered, as well as any message that causally depends on delivery of m COPY[m]: The set of processes that have a copy of HDR[m] in their volatile memory If C is a collection of crashed processes, then Q  C is an orphan if there is a message m such that Q ∈ DEP[m] and COPY[m] ⊆ C

41 Message-Logging Scheme Goal: No orphans means that for each message m, DEP[m] ⊆ COPY[m] Pessimistic protocol: for each nonstable message m, there is at most one process dependent on m, that is |DEP[m]| ≤ 1 Consequence: An unstable message in a pessimistic protocol must be made stable before sending a next message Optimistic protocol: for each unstable message m, we ensure that if COPY[m] ⊆ C, then eventually also DEP[m] ⊆ C, where C denotes a set of processes that have been marked as faulty Consequence: To guarantee that DEP[m] ⊆ C, we generally rollback each orphan process Q until Q  DEP[m]