Fault Tolerance Prof. Orhan Gemikonakli

Slides:



Advertisements
Similar presentations
Fault Tolerance CSCI 4780/6780. Reliable Group Communication Reliable multicasting is important for several applications Transport layer protocols rarely.
Advertisements

Dr. Kalpakis CMSC621 Advanced Operating Systems Fault Tolerance.
Chapter 8 Fault Tolerance
L-15 Fault Tolerance 1. Fault Tolerance Terminology & Background Byzantine Fault Tolerance Issues in client/server Reliable group communication 2.
Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Group Communications Group communication: one source process sending a message to a group of processes: Destination is a group rather than a single process.
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Chapter 7 Fault Tolerance Basic Concepts Failure Models Process Design Issues Flat vs hierarchical group Group Membership Reliable Client.
Distributed Systems CS Fault Tolerance- Part III Lecture 15, Oct 26, 2011 Majd F. Sakr, Mohammad Hammoud andVinay Kolar 1.
EEC 688/788 Secure and Dependable Computing Lecture 13 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Last Class: Weak Consistency
1 CS 194: Distributed Systems Distributed Commit, Recovery Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering.
1 Fault Tolerance Chapter 7. 2 Fault Tolerance An important goal in distributed systems design is to construct the system in such a way that it can automatically.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Chapter 8 Fault.
1 Distributed Systems Fault Tolerance Chapter 8. 2 Course/Slides Credits Note: all course presentations are based on those developed by Andrew S. Tanenbaum.
Distributed Transactions Chapter 13
Distributed Systems CS Fault Tolerance- Part III Lecture 19, Nov 25, 2013 Mohammad Hammoud 1.
Distributed Systems Principles and Paradigms Chapter 07 Fault Tolerance 01 Introduction 02 Communication 03 Processes 04 Naming 05 Synchronization 06 Consistency.
1 8.3 Reliable Client-Server Communication So far: Concentrated on process resilience (by means of process groups). What about reliable communication channels?
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Fault Tolerance CSCI 4780/6780. Distributed Commit Commit – Making an operation permanent Transactions in databases One phase commit does not work !!!
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Chapter 8 Fault.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Chapter 8 Fault.
Fault Tolerance Chapter 7.
Fault Tolerance. Basic Concepts Availability The system is ready to work immediately Reliability The system can run continuously Safety When the system.
Kyung Hee University 1/33 Fault Tolerance Chap 7.
Reliable Communication Smita Hiremath CSC Reliable Client-Server Communication Point-to-Point communication Established by TCP Masks omission failure,
V1.7Fault Tolerance1. V1.7Fault Tolerance2 A characteristic of Distributed Systems is that they are tolerant of partial failures within the distributed.
Fault Tolerance Chapter 7. Failures in Distributed Systems Partial failures – characteristic of distributed systems Goals: Construct systems which can.
Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery.
Reliable Client-Server Communication. Reliable Communication So far: Concentrated on process resilience (by means of process groups). What about reliable.
Revisiting failure detectors Some of you asked questions about implementing consensus using S - how does it differ from reaching consensus using P. Here.
Fault Tolerance Chapter 7. Basic Concepts Dependability Includes Availability Reliability Safety Maintainability.
Introduction to Fault Tolerance By Sahithi Podila.
Fault Tolerance Chapter 7. Goal An important goal in distributed systems design is to construct the system in such a way that it can automatically recover.
Fault Tolerance (2). Topics r Reliable Group Communication.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.
EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Chapter 8 – Fault Tolerance Section 8.5 Distributed Commit Heta Desai Dr. Yanqing Zhang Csc Advanced Operating Systems October 14 th, 2015.
More on Fault Tolerance
Fault Tolerance Chap 7.
Distributed Systems – Paxos
8.2. Process resilience Shreyas Karandikar.
Chapter 8 Fault Tolerance Part I Introduction.
Reliable group communication
Dependability Dependability is the ability to avoid service failures that are more frequent or severe than desired. It is an important goal of distributed.
Outline Announcements Fault Tolerance.
Fault Tolerance Distributed Web-based Systems
Distributed Systems CS
Distributed Systems CS
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
Fault Tolerance CSC 8320 : AOS Class Presentation Shiraj Pokharel
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
Replication Improves reliability Improves availability
Distributed Object-based systems
Advanced Operating System
Distributed Systems CS
Active replication for fault tolerance
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Distributed Systems - Comp 655
Last Class: Fault Tolerance
Presentation transcript:

Fault Tolerance Prof. Orhan Gemikonakli Module Leader: Prof. Leonardo Mostarda Università di Camerino Prof. Orhan Gemikonakli - Camerino,

Prof. Orhan Gemikonakli - Camerino, Last lecture Replica management Permanent replicas Server initiated replicas Client initiated replicas Pull versus push protocols Consistency protocols Prof. Orhan Gemikonakli - Camerino,

Prof. Orhan Gemikonakli - Camerino, Outline Basic concepts Fault modelling Failure masking RPC Semantics in the Presence of Failures Reliable multicast schemes Virtual Synchrony Two phase commit Three phase commit Checkpointing and Message logging Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. 0-13-239227-5 Prof. Orhan Gemikonakli - Camerino,

Prof. Orhan Gemikonakli - Camerino, Learning outcome To understand the basic concepts related to fault tolerance Availability Reliability Safety To describe and discuss process resilience To understand reliable client-server communication To understand two-and three phase commit Prof. Orhan Gemikonakli - Camerino,

Fault Tolerance Basic Concepts Being fault tolerant is strongly related to what are called dependable systems Dependability implies the following: Availability Reliability Safety Maintainability Prof. Orhan Gemikonakli - Camerino,

Prof. Orhan Gemikonakli - Camerino, RMA & Safety Type equation here.  Prof. Orhan Gemikonakli - Camerino,

Prof. Orhan Gemikonakli - Camerino, RMA & Safety Availability vs reliability If a system goes down for one millisecond every hour, it has an availability of over 99.9999 percent, but is still highly unreliable. Similarly, a system that never crashes but is shut down for two weeks every August has high reliability but only 96 percent availability. 5 9s: 99.999% Prof. Orhan Gemikonakli - Camerino,

Prof. Orhan Gemikonakli - Camerino, RMA & Safety Safety refers to the situation that when a system temporarily fails to operate correctly, nothing catastrophic happens. For example, many process control systems, such as those used for controlling nuclear power plants or sending people into space, are required to provide a high degree of safety. If such control systems temporarily fail for only a very brief moment, the effects could be disastrous Maintainability refers to how easy a failed system can be repaired. A highly maintainable system may also show a high degree of availability, especially if failures can be detected and repaired automatically. (MTBSO) Prof. Orhan Gemikonakli - Camerino,

Prof. Orhan Gemikonakli - Camerino, RMA & Safety We discussed Reliability, Maintainability, Availability, and Safety. Security? Prof. Orhan Gemikonakli - Camerino,

Prof. Orhan Gemikonakli - Camerino, Reliability Loss of reliability may lead to Loss of revenue/customers Unrecoverable information or situation Loss of sensitive data Loss of life Improving reliability Prof. Orhan Gemikonakli - Camerino,

Prof. Orhan Gemikonakli - Camerino, More on reliability Overall reliability of components Connected in series With redundancy In series and redundancy Component level redundancy versus system level redundancy Prof. Orhan Gemikonakli - Camerino,

Prof. Orhan Gemikonakli - Camerino, Failure Models Figure 8-1. Different types of failures. Prof. Orhan Gemikonakli - Camerino,

Failure Masking by Redundancy Figure 8-2. Triple modular redundancy (TMR). Prof. Orhan Gemikonakli - Camerino,

Prof. Orhan Gemikonakli - Camerino, Process resilience Achieving fault tolerance in distributed systems Protection against process failures Group similar processes All receive the same message All run the same procedures A collective response is returned. Grouping processes: Flat groups – all members are equal Hierarchical groups – have a coocrdinator Prof. Orhan Gemikonakli - Camerino,

Flat Groups versus Hierarchical Groups Figure 8-3. (a) Communication in a flat group. (b) Communication in a simple hierarchical group. Prof. Orhan Gemikonakli - Camerino,

Prof. Orhan Gemikonakli - Camerino, Groups Flat groups All members are equal No single point of failure If one member fails, group continues as a smaller one Decision making is more complicated (voting: delay + overhead) Hierarchical groups Single point of failure (coordinator) As far as healthy, the coordinator makes decisions Prof. Orhan Gemikonakli - Camerino,

Agreement in Faulty Systems (1) Possible cases: Synchronous versus asynchronous systems. Communication delay is bounded or not. Message delivery is ordered or not. Message transmission is done through unicasting or multicasting. Prof. Orhan Gemikonakli - Camerino,

Agreement in Faulty Systems (2) Figure 8-4. Circumstances under which distributed agreement can be reached. Prof. Orhan Gemikonakli - Camerino,

Agreement in Faulty Systems (3) Figure 8-5. The Byzantine agreement problem for three nonfaulty and one faulty process. (a) Each process sends their value to the others. Prof. Orhan Gemikonakli - Camerino,

Agreement in Faulty Systems (4) Figure 8-5. The Byzantine agreement problem for three nonfaulty and one faulty process. (b) The vectors that each process assembles based on (a). (c) The vectors that each process receives in step 3. Prof. Orhan Gemikonakli - Camerino,

Agreement in Faulty Systems (5) Figure 8-6. The same as Fig. 8-5, except now with two correct process and one faulty process. Prof. Orhan Gemikonakli - Camerino,

RPC Semantics in the Presence of Failures Five different classes of failures that can occur in RPC systems: The client is unable to locate the server. The request message from the client to the server is lost. The server crashes after receiving a request. The reply message from the server to the client is lost. The client crashes after sending a request. Prof. Orhan Gemikonakli - Camerino,

Prof. Orhan Gemikonakli - Camerino, Server Crashes (1) Figure 8-7. A server in client-server communication. (a) The normal case. (b) Crash after execution. (c) Crash before execution. Prof. Orhan Gemikonakli - Camerino,

Prof. Orhan Gemikonakli - Camerino, Server Crashes (2) Three events that can happen at the server: Send the completion message (M), Print the text (P), Crash (C). Prof. Orhan Gemikonakli - Camerino,

Prof. Orhan Gemikonakli - Camerino, Server Crashes (3) These events can occur in six different orderings: M →P →C: A crash occurs after sending the completion message and printing the text. M →C (→P): A crash happens after sending the completion message, but before the text could be printed. P →M →C: A crash occurs after sending the completion message and printing the text. P→C(→M): The text printed, after which a crash occurs before the completion message could be sent. C (→P →M): A crash happens before the server could do anything. C (→M →P): A crash happens before the server could do anything. Prof. Orhan Gemikonakli - Camerino,

Prof. Orhan Gemikonakli - Camerino, Server Crashes (4) Figure 8-8. Different combinations of client and server strategies in the presence of server crashes. Prof. Orhan Gemikonakli - Camerino,

Basic Reliable-Multicasting Schemes Figure 8-9. A simple solution to reliable multicasting when all receivers are known and are assumed not to fail. (a) Message transmission. (b) Reporting feedback. Prof. Orhan Gemikonakli - Camerino,

Nonhierarchical Feedback Control Figure 8-10. Several receivers have scheduled a request for retransmission, but the first retransmission request leads to the suppression of others. Prof. Orhan Gemikonakli - Camerino,

Hierarchical Feedback Control Figure 8-11. The essence of hierarchical reliable multicasting. Each local coordinator forwards the message to its children and later handles retransmission requests. Prof. Orhan Gemikonakli - Camerino,

Prof. Orhan Gemikonakli - Camerino, Virtual Synchrony (1) Figure 8-12. The logical organization of a distributed system to distinguish between message receipt and message delivery. Prof. Orhan Gemikonakli - Camerino,

Prof. Orhan Gemikonakli - Camerino, Virtual Synchrony (2) Figure 8-13. The principle of virtual synchronous multicast. Prof. Orhan Gemikonakli - Camerino,

Prof. Orhan Gemikonakli - Camerino, Message Ordering (1) Four different orderings are distinguished: Unordered multicasts FIFO-ordered multicasts Causally-ordered multicasts Totally-ordered multicasts Prof. Orhan Gemikonakli - Camerino,

Prof. Orhan Gemikonakli - Camerino, Message Ordering (2) Figure 8-14. Unordered multicast - three communicating processes in the same group. The ordering of events per process is shown along the vertical axis. Prof. Orhan Gemikonakli - Camerino,

Prof. Orhan Gemikonakli - Camerino, Message Ordering (3) Figure 8-15. Four processes in the same group with two different senders, and a possible delivery order of messages under FIFO-ordered multicasting Prof. Orhan Gemikonakli - Camerino,

Prof. Orhan Gemikonakli - Camerino, Message ordering (4) Reliable causal delivery Observes causal relationship between messages in delivery to application layer This is done regardless of the whether messages come from the same sender or different ones Communication layer holds on to a message until another causally related one is received first. Vector timestamps are used to identify the order. Prof. Orhan Gemikonakli - Camerino,

Prof. Orhan Gemikonakli - Camerino, Message ordering (5) Total-ordered delivery Messages delivery can be unordered, FIFO-ordered, causally ordered However, messages are delivered at the same order to all members This is known as atomic multicasting Prof. Orhan Gemikonakli - Camerino,

Implementing Virtual Synchrony (1) Figure 8-16. Six different versions of virtually synchronous reliable multicasting. Prof. Orhan Gemikonakli - Camerino,

Implementing Virtual Synchrony (2) Messages are kept at Communication layer Received by all group members – stable message Not yet received by all group members – unstable message Prof. Orhan Gemikonakli - Camerino,

Implementing Virtual Synchrony (3) Figure 8-17. (a) Process 4 notices that process 7 has crashed and sends a view change. Prof. Orhan Gemikonakli - Camerino,

Implementing Virtual Synchrony (4) Figure 8-17. (b) Process 6 sends out all its unstable messages, followed by a flush message. Prof. Orhan Gemikonakli - Camerino,

Implementing Virtual Synchrony (5) Figure 8-17. (c) Process 6 installs the new view when it has received a flush message from everyone else. Prof. Orhan Gemikonakli - Camerino,

Prof. Orhan Gemikonakli - Camerino, Distributed Commit Either all group members perform an operation or none A coordinator manages commit Example: atomic multicasting Three variations: One-phase commit protocol Two-phase commit (2PC) Three-phase commit (3PC) One phase commit: If one member fails to commit, there is no way to tell the coordinator. Prof. Orhan Gemikonakli - Camerino,

Prof. Orhan Gemikonakli - Camerino, Two-Phase Commit (1) Figure 8-18. (a) The finite state machine for the coordinator in 2PC. (b) The finite state machine for a participant. Prof. Orhan Gemikonakli - Camerino,

Prof. Orhan Gemikonakli - Camerino, Two-Phase Commit (2) Figure 8-19. Actions taken by a participant P when residing in state READY and having contacted another participant Q. Prof. Orhan Gemikonakli - Camerino,

Prof. Orhan Gemikonakli - Camerino, Two-Phase Commit (3) Figure 8-20. Outline of the steps taken by the coordinator in a two-phase commit protocol. . . . Prof. Orhan Gemikonakli - Camerino,

Prof. Orhan Gemikonakli - Camerino, Two-Phase Commit (4) . . . Figure 8-20. Outline of the steps taken by the coordinator in a two-phase commit protocol. Prof. Orhan Gemikonakli - Camerino,

Prof. Orhan Gemikonakli - Camerino, Two-Phase Commit (5) Figure 8-21. (a) The steps taken by a participant process in 2PC. Prof. Orhan Gemikonakli - Camerino,

Prof. Orhan Gemikonakli - Camerino, Two-Phase Commit (7) Figure 8-21. (b) The steps for handling incoming decision requests. Prof. Orhan Gemikonakli - Camerino,

Prof. Orhan Gemikonakli - Camerino, Three-Phase Commit (1) The states of the coordinator and each participant satisfy the following two conditions: There is no single state from which it is possible to make a transition directly to either a COMMIT or an ABORT state. There is no state in which it is not possible to make a final decision, and from which a transition to a COMMIT state can be made. Prof. Orhan Gemikonakli - Camerino,

Prof. Orhan Gemikonakli - Camerino, Three-Phase Commit (2) Figure 8-22. (a) The finite state machine for the coordinator in 3PC. (b) The finite state machine for a participant. Prof. Orhan Gemikonakli - Camerino,

Recovery – Stable Storage Figure 8-23. (a) Stable storage. (b) Crash after drive 1 is updated. (c) Bad spot. Prof. Orhan Gemikonakli - Camerino,

Prof. Orhan Gemikonakli - Camerino, Checkpointing Figure 8-24. A recovery line. Prof. Orhan Gemikonakli - Camerino,

Independent Checkpointing Figure 8-25. The domino effect. Prof. Orhan Gemikonakli - Camerino,

Characterizing Message-Logging Schemes Figure 8-26. Incorrect replay of messages after recovery, leading to an orphan process. Prof. Orhan Gemikonakli - Camerino,

Prof. Orhan Gemikonakli - Camerino, Summary Basic concepts Fault modelling Failure masking RPC Semantics in the Presence of Failures Reliable multicast schemes Virtual Synchrony Two phase commit |Three phase commit Checkpointing and Message logging Prof. Orhan Gemikonakli - Camerino,

Prof. Orhan Gemikonakli - Camerino, Next Lecture Distributed Object-based Systems Architecture Processes Communication Naming Synchronization Consistency and Replication Fault Tolerance Prof. Orhan Gemikonakli - Camerino,