Fault Tolerance Prof. Orhan Gemikonakli

Fault Tolerance Prof. Orhan Gemikonakli
Module Leader: Prof. Leonardo Mostarda Università di Camerino Prof. Orhan Gemikonakli - Camerino,

Prof. Orhan Gemikonakli - Camerino,
Last lecture Replica management Permanent replicas Server initiated replicas Client initiated replicas Pull versus push protocols Consistency protocols Prof. Orhan Gemikonakli - Camerino,

Outline Basic concepts Fault modelling Failure masking RPC Semantics in the Presence of Failures Reliable multicast schemes Virtual Synchrony Two phase commit Three phase commit Checkpointing and Message logging Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc Prof. Orhan Gemikonakli - Camerino,

Learning outcome To understand the basic concepts related to fault tolerance Availability Reliability Safety To describe and discuss process resilience To understand reliable client-server communication To understand two-and three phase commit Prof. Orhan Gemikonakli - Camerino,

Fault Tolerance Basic Concepts
Being fault tolerant is strongly related to what are called dependable systems Dependability implies the following: Availability Reliability Safety Maintainability Prof. Orhan Gemikonakli - Camerino,

RMA & Safety Type equation here. Prof. Orhan Gemikonakli - Camerino,

RMA & Safety Availability vs reliability If a system goes down for one millisecond every hour, it has an availability of over percent, but is still highly unreliable. Similarly, a system that never crashes but is shut down for two weeks every August has high reliability but only 96 percent availability. 5 9s: % Prof. Orhan Gemikonakli - Camerino,

RMA & Safety Safety refers to the situation that when a system temporarily fails to operate correctly, nothing catastrophic happens. For example, many process control systems, such as those used for controlling nuclear power plants or sending people into space, are required to provide a high degree of safety. If such control systems temporarily fail for only a very brief moment, the effects could be disastrous Maintainability refers to how easy a failed system can be repaired. A highly maintainable system may also show a high degree of availability, especially if failures can be detected and repaired automatically. (MTBSO) Prof. Orhan Gemikonakli - Camerino,

RMA & Safety We discussed Reliability, Maintainability, Availability, and Safety. Security? Prof. Orhan Gemikonakli - Camerino,

Reliability Loss of reliability may lead to Loss of revenue/customers Unrecoverable information or situation Loss of sensitive data Loss of life Improving reliability Prof. Orhan Gemikonakli - Camerino,

More on reliability Overall reliability of components Connected in series With redundancy In series and redundancy Component level redundancy versus system level redundancy Prof. Orhan Gemikonakli - Camerino,

Failure Models Figure 8-1. Different types of failures. Prof. Orhan Gemikonakli - Camerino,

Failure Masking by Redundancy
Figure 8-2. Triple modular redundancy (TMR). Prof. Orhan Gemikonakli - Camerino,

Process resilience Achieving fault tolerance in distributed systems Protection against process failures Group similar processes All receive the same message All run the same procedures A collective response is returned. Grouping processes: Flat groups – all members are equal Hierarchical groups – have a coocrdinator Prof. Orhan Gemikonakli - Camerino,

Flat Groups versus Hierarchical Groups
Figure 8-3. (a) Communication in a flat group. (b) Communication in a simple hierarchical group. Prof. Orhan Gemikonakli - Camerino,

Groups Flat groups All members are equal No single point of failure If one member fails, group continues as a smaller one Decision making is more complicated (voting: delay + overhead) Hierarchical groups Single point of failure (coordinator) As far as healthy, the coordinator makes decisions Prof. Orhan Gemikonakli - Camerino,

Agreement in Faulty Systems (1)
Possible cases: Synchronous versus asynchronous systems. Communication delay is bounded or not. Message delivery is ordered or not. Message transmission is done through unicasting or multicasting. Prof. Orhan Gemikonakli - Camerino,

Figure 8-4. Circumstances under which distributed agreement can be reached. Prof. Orhan Gemikonakli - Camerino,

Figure 8-5. The Byzantine agreement problem for three nonfaulty and one faulty process. (a) Each process sends their value to the others. Prof. Orhan Gemikonakli - Camerino,

Figure 8-5. The Byzantine agreement problem for three nonfaulty and one faulty process. (b) The vectors that each process assembles based on (a). (c) The vectors that each process receives in step 3. Prof. Orhan Gemikonakli - Camerino,

Figure 8-6. The same as Fig. 8-5, except now with two correct process and one faulty process. Prof. Orhan Gemikonakli - Camerino,

RPC Semantics in the Presence of Failures
Five different classes of failures that can occur in RPC systems: The client is unable to locate the server. The request message from the client to the server is lost. The server crashes after receiving a request. The reply message from the server to the client is lost. The client crashes after sending a request. Prof. Orhan Gemikonakli - Camerino,

Server Crashes (1) Figure 8-7. A server in client-server communication. (a) The normal case. (b) Crash after execution. (c) Crash before execution. Prof. Orhan Gemikonakli - Camerino,

Server Crashes (2) Three events that can happen at the server: Send the completion message (M), Print the text (P), Crash (C). Prof. Orhan Gemikonakli - Camerino,

Server Crashes (3) These events can occur in six different orderings: M →P →C: A crash occurs after sending the completion message and printing the text. M →C (→P): A crash happens after sending the completion message, but before the text could be printed. P →M →C: A crash occurs after sending the completion message and printing the text. P→C(→M): The text printed, after which a crash occurs before the completion message could be sent. C (→P →M): A crash happens before the server could do anything. C (→M →P): A crash happens before the server could do anything. Prof. Orhan Gemikonakli - Camerino,

Server Crashes (4) Figure 8-8. Different combinations of client and server strategies in the presence of server crashes. Prof. Orhan Gemikonakli - Camerino,

Basic Reliable-Multicasting Schemes
Figure 8-9. A simple solution to reliable multicasting when all receivers are known and are assumed not to fail. (a) Message transmission. (b) Reporting feedback. Prof. Orhan Gemikonakli - Camerino,

Nonhierarchical Feedback Control
Figure Several receivers have scheduled a request for retransmission, but the first retransmission request leads to the suppression of others. Prof. Orhan Gemikonakli - Camerino,

Hierarchical Feedback Control
Figure The essence of hierarchical reliable multicasting. Each local coordinator forwards the message to its children and later handles retransmission requests. Prof. Orhan Gemikonakli - Camerino,

Virtual Synchrony (1) Figure The logical organization of a distributed system to distinguish between message receipt and message delivery. Prof. Orhan Gemikonakli - Camerino,

Virtual Synchrony (2) Figure The principle of virtual synchronous multicast. Prof. Orhan Gemikonakli - Camerino,

Message Ordering (1) Four different orderings are distinguished: Unordered multicasts FIFO-ordered multicasts Causally-ordered multicasts Totally-ordered multicasts Prof. Orhan Gemikonakli - Camerino,

Message Ordering (2) Figure Unordered multicast - three communicating processes in the same group. The ordering of events per process is shown along the vertical axis. Prof. Orhan Gemikonakli - Camerino,

Message Ordering (3) Figure Four processes in the same group with two different senders, and a possible delivery order of messages under FIFO-ordered multicasting Prof. Orhan Gemikonakli - Camerino,

Message ordering (4) Reliable causal delivery Observes causal relationship between messages in delivery to application layer This is done regardless of the whether messages come from the same sender or different ones Communication layer holds on to a message until another causally related one is received first. Vector timestamps are used to identify the order. Prof. Orhan Gemikonakli - Camerino,

Message ordering (5) Total-ordered delivery Messages delivery can be unordered, FIFO-ordered, causally ordered However, messages are delivered at the same order to all members This is known as atomic multicasting Prof. Orhan Gemikonakli - Camerino,

Implementing Virtual Synchrony (1)
Figure Six different versions of virtually synchronous reliable multicasting. Prof. Orhan Gemikonakli - Camerino,

Messages are kept at Communication layer Received by all group members – stable message Not yet received by all group members – unstable message Prof. Orhan Gemikonakli - Camerino,

Figure (a) Process 4 notices that process 7 has crashed and sends a view change. Prof. Orhan Gemikonakli - Camerino,

Figure (b) Process 6 sends out all its unstable messages, followed by a flush message. Prof. Orhan Gemikonakli - Camerino,

Figure (c) Process 6 installs the new view when it has received a flush message from everyone else. Prof. Orhan Gemikonakli - Camerino,

Distributed Commit Either all group members perform an operation or none A coordinator manages commit Example: atomic multicasting Three variations: One-phase commit protocol Two-phase commit (2PC) Three-phase commit (3PC) One phase commit: If one member fails to commit, there is no way to tell the coordinator. Prof. Orhan Gemikonakli - Camerino,

Two-Phase Commit (1) Figure (a) The finite state machine for the coordinator in 2PC. (b) The finite state machine for a participant. Prof. Orhan Gemikonakli - Camerino,

Two-Phase Commit (2) Figure Actions taken by a participant P when residing in state READY and having contacted another participant Q. Prof. Orhan Gemikonakli - Camerino,

Two-Phase Commit (3) Figure Outline of the steps taken by the coordinator in a two-phase commit protocol. . . . Prof. Orhan Gemikonakli - Camerino,

Two-Phase Commit (4) . . . Figure Outline of the steps taken by the coordinator in a two-phase commit protocol. Prof. Orhan Gemikonakli - Camerino,

Two-Phase Commit (5) Figure (a) The steps taken by a participant process in 2PC. Prof. Orhan Gemikonakli - Camerino,

Two-Phase Commit (7) Figure (b) The steps for handling incoming decision requests. Prof. Orhan Gemikonakli - Camerino,

Three-Phase Commit (1) The states of the coordinator and each participant satisfy the following two conditions: There is no single state from which it is possible to make a transition directly to either a COMMIT or an ABORT state. There is no state in which it is not possible to make a final decision, and from which a transition to a COMMIT state can be made. Prof. Orhan Gemikonakli - Camerino,

Three-Phase Commit (2) Figure (a) The finite state machine for the coordinator in 3PC. (b) The finite state machine for a participant. Prof. Orhan Gemikonakli - Camerino,

Recovery – Stable Storage
Figure (a) Stable storage. (b) Crash after drive 1 is updated. (c) Bad spot. Prof. Orhan Gemikonakli - Camerino,

Checkpointing Figure A recovery line. Prof. Orhan Gemikonakli - Camerino,

Independent Checkpointing
Figure The domino effect. Prof. Orhan Gemikonakli - Camerino,

Characterizing Message-Logging Schemes
Figure Incorrect replay of messages after recovery, leading to an orphan process. Prof. Orhan Gemikonakli - Camerino,

Summary Basic concepts Fault modelling Failure masking RPC Semantics in the Presence of Failures Reliable multicast schemes Virtual Synchrony Two phase commit |Three phase commit Checkpointing and Message logging Prof. Orhan Gemikonakli - Camerino,

Next Lecture Distributed Object-based Systems Architecture Processes Communication Naming Synchronization Consistency and Replication Fault Tolerance Prof. Orhan Gemikonakli - Camerino,

Fault Tolerance Prof. Orhan Gemikonakli

Similar presentations

Presentation on theme: "Fault Tolerance Prof. Orhan Gemikonakli"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Fault Tolerance Prof. Orhan Gemikonakli

Similar presentations

Presentation on theme: "Fault Tolerance Prof. Orhan Gemikonakli"— Presentation transcript:

Similar presentations

About project

Feedback