Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fault Tolerance Prof. Orhan Gemikonakli

Similar presentations


Presentation on theme: "Fault Tolerance Prof. Orhan Gemikonakli"— Presentation transcript:

1 Fault Tolerance Prof. Orhan Gemikonakli
Module Leader: Prof. Leonardo Mostarda Università di Camerino Prof. Orhan Gemikonakli - Camerino,

2 Prof. Orhan Gemikonakli - Camerino,
Last lecture Replica management Permanent replicas Server initiated replicas Client initiated replicas Pull versus push protocols Consistency protocols Prof. Orhan Gemikonakli - Camerino,

3 Prof. Orhan Gemikonakli - Camerino,
Outline Basic concepts Fault modelling Failure masking RPC Semantics in the Presence of Failures Reliable multicast schemes Virtual Synchrony Two phase commit Three phase commit Checkpointing and Message logging Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc Prof. Orhan Gemikonakli - Camerino,

4 Prof. Orhan Gemikonakli - Camerino,
Learning outcome To understand the basic concepts related to fault tolerance Availability Reliability Safety To describe and discuss process resilience To understand reliable client-server communication To understand two-and three phase commit Prof. Orhan Gemikonakli - Camerino,

5 Fault Tolerance Basic Concepts
Being fault tolerant is strongly related to what are called dependable systems Dependability implies the following: Availability Reliability Safety Maintainability Prof. Orhan Gemikonakli - Camerino,

6 Prof. Orhan Gemikonakli - Camerino,
RMA & Safety Type equation here.  Prof. Orhan Gemikonakli - Camerino,

7 Prof. Orhan Gemikonakli - Camerino,
RMA & Safety Availability vs reliability If a system goes down for one millisecond every hour, it has an availability of over percent, but is still highly unreliable. Similarly, a system that never crashes but is shut down for two weeks every August has high reliability but only 96 percent availability. 5 9s: % Prof. Orhan Gemikonakli - Camerino,

8 Prof. Orhan Gemikonakli - Camerino,
RMA & Safety Safety refers to the situation that when a system temporarily fails to operate correctly, nothing catastrophic happens. For example, many process control systems, such as those used for controlling nuclear power plants or sending people into space, are required to provide a high degree of safety. If such control systems temporarily fail for only a very brief moment, the effects could be disastrous Maintainability refers to how easy a failed system can be repaired. A highly maintainable system may also show a high degree of availability, especially if failures can be detected and repaired automatically. (MTBSO) Prof. Orhan Gemikonakli - Camerino,

9 Prof. Orhan Gemikonakli - Camerino,
RMA & Safety We discussed Reliability, Maintainability, Availability, and Safety. Security? Prof. Orhan Gemikonakli - Camerino,

10 Prof. Orhan Gemikonakli - Camerino,
Reliability Loss of reliability may lead to Loss of revenue/customers Unrecoverable information or situation Loss of sensitive data Loss of life Improving reliability Prof. Orhan Gemikonakli - Camerino,

11 Prof. Orhan Gemikonakli - Camerino,
More on reliability Overall reliability of components Connected in series With redundancy In series and redundancy Component level redundancy versus system level redundancy Prof. Orhan Gemikonakli - Camerino,

12 Prof. Orhan Gemikonakli - Camerino,
Failure Models Figure 8-1. Different types of failures. Prof. Orhan Gemikonakli - Camerino,

13 Failure Masking by Redundancy
Figure 8-2. Triple modular redundancy (TMR). Prof. Orhan Gemikonakli - Camerino,

14 Prof. Orhan Gemikonakli - Camerino,
Process resilience Achieving fault tolerance in distributed systems Protection against process failures Group similar processes All receive the same message All run the same procedures A collective response is returned. Grouping processes: Flat groups – all members are equal Hierarchical groups – have a coocrdinator Prof. Orhan Gemikonakli - Camerino,

15 Flat Groups versus Hierarchical Groups
Figure 8-3. (a) Communication in a flat group. (b) Communication in a simple hierarchical group. Prof. Orhan Gemikonakli - Camerino,

16 Prof. Orhan Gemikonakli - Camerino,
Groups Flat groups All members are equal No single point of failure If one member fails, group continues as a smaller one Decision making is more complicated (voting: delay + overhead) Hierarchical groups Single point of failure (coordinator) As far as healthy, the coordinator makes decisions Prof. Orhan Gemikonakli - Camerino,

17 Agreement in Faulty Systems (1)
Possible cases: Synchronous versus asynchronous systems. Communication delay is bounded or not. Message delivery is ordered or not. Message transmission is done through unicasting or multicasting. Prof. Orhan Gemikonakli - Camerino,

18 Agreement in Faulty Systems (2)
Figure 8-4. Circumstances under which distributed agreement can be reached. Prof. Orhan Gemikonakli - Camerino,

19 Agreement in Faulty Systems (3)
Figure 8-5. The Byzantine agreement problem for three nonfaulty and one faulty process. (a) Each process sends their value to the others. Prof. Orhan Gemikonakli - Camerino,

20 Agreement in Faulty Systems (4)
Figure 8-5. The Byzantine agreement problem for three nonfaulty and one faulty process. (b) The vectors that each process assembles based on (a). (c) The vectors that each process receives in step 3. Prof. Orhan Gemikonakli - Camerino,

21 Agreement in Faulty Systems (5)
Figure 8-6. The same as Fig. 8-5, except now with two correct process and one faulty process. Prof. Orhan Gemikonakli - Camerino,

22 RPC Semantics in the Presence of Failures
Five different classes of failures that can occur in RPC systems: The client is unable to locate the server. The request message from the client to the server is lost. The server crashes after receiving a request. The reply message from the server to the client is lost. The client crashes after sending a request. Prof. Orhan Gemikonakli - Camerino,

23 Prof. Orhan Gemikonakli - Camerino,
Server Crashes (1) Figure 8-7. A server in client-server communication. (a) The normal case. (b) Crash after execution. (c) Crash before execution. Prof. Orhan Gemikonakli - Camerino,

24 Prof. Orhan Gemikonakli - Camerino,
Server Crashes (2) Three events that can happen at the server: Send the completion message (M), Print the text (P), Crash (C). Prof. Orhan Gemikonakli - Camerino,

25 Prof. Orhan Gemikonakli - Camerino,
Server Crashes (3) These events can occur in six different orderings: M →P →C: A crash occurs after sending the completion message and printing the text. M →C (→P): A crash happens after sending the completion message, but before the text could be printed. P →M →C: A crash occurs after sending the completion message and printing the text. P→C(→M): The text printed, after which a crash occurs before the completion message could be sent. C (→P →M): A crash happens before the server could do anything. C (→M →P): A crash happens before the server could do anything. Prof. Orhan Gemikonakli - Camerino,

26 Prof. Orhan Gemikonakli - Camerino,
Server Crashes (4) Figure 8-8. Different combinations of client and server strategies in the presence of server crashes. Prof. Orhan Gemikonakli - Camerino,

27 Basic Reliable-Multicasting Schemes
Figure 8-9. A simple solution to reliable multicasting when all receivers are known and are assumed not to fail. (a) Message transmission. (b) Reporting feedback. Prof. Orhan Gemikonakli - Camerino,

28 Nonhierarchical Feedback Control
Figure Several receivers have scheduled a request for retransmission, but the first retransmission request leads to the suppression of others. Prof. Orhan Gemikonakli - Camerino,

29 Hierarchical Feedback Control
Figure The essence of hierarchical reliable multicasting. Each local coordinator forwards the message to its children and later handles retransmission requests. Prof. Orhan Gemikonakli - Camerino,

30 Prof. Orhan Gemikonakli - Camerino,
Virtual Synchrony (1) Figure The logical organization of a distributed system to distinguish between message receipt and message delivery. Prof. Orhan Gemikonakli - Camerino,

31 Prof. Orhan Gemikonakli - Camerino,
Virtual Synchrony (2) Figure The principle of virtual synchronous multicast. Prof. Orhan Gemikonakli - Camerino,

32 Prof. Orhan Gemikonakli - Camerino,
Message Ordering (1) Four different orderings are distinguished: Unordered multicasts FIFO-ordered multicasts Causally-ordered multicasts Totally-ordered multicasts Prof. Orhan Gemikonakli - Camerino,

33 Prof. Orhan Gemikonakli - Camerino,
Message Ordering (2) Figure Unordered multicast - three communicating processes in the same group. The ordering of events per process is shown along the vertical axis. Prof. Orhan Gemikonakli - Camerino,

34 Prof. Orhan Gemikonakli - Camerino,
Message Ordering (3) Figure Four processes in the same group with two different senders, and a possible delivery order of messages under FIFO-ordered multicasting Prof. Orhan Gemikonakli - Camerino,

35 Prof. Orhan Gemikonakli - Camerino,
Message ordering (4) Reliable causal delivery Observes causal relationship between messages in delivery to application layer This is done regardless of the whether messages come from the same sender or different ones Communication layer holds on to a message until another causally related one is received first. Vector timestamps are used to identify the order. Prof. Orhan Gemikonakli - Camerino,

36 Prof. Orhan Gemikonakli - Camerino,
Message ordering (5) Total-ordered delivery Messages delivery can be unordered, FIFO-ordered, causally ordered However, messages are delivered at the same order to all members This is known as atomic multicasting Prof. Orhan Gemikonakli - Camerino,

37 Implementing Virtual Synchrony (1)
Figure Six different versions of virtually synchronous reliable multicasting. Prof. Orhan Gemikonakli - Camerino,

38 Implementing Virtual Synchrony (2)
Messages are kept at Communication layer Received by all group members – stable message Not yet received by all group members – unstable message Prof. Orhan Gemikonakli - Camerino,

39 Implementing Virtual Synchrony (3)
Figure (a) Process 4 notices that process 7 has crashed and sends a view change. Prof. Orhan Gemikonakli - Camerino,

40 Implementing Virtual Synchrony (4)
Figure (b) Process 6 sends out all its unstable messages, followed by a flush message. Prof. Orhan Gemikonakli - Camerino,

41 Implementing Virtual Synchrony (5)
Figure (c) Process 6 installs the new view when it has received a flush message from everyone else. Prof. Orhan Gemikonakli - Camerino,

42 Prof. Orhan Gemikonakli - Camerino,
Distributed Commit Either all group members perform an operation or none A coordinator manages commit Example: atomic multicasting Three variations: One-phase commit protocol Two-phase commit (2PC) Three-phase commit (3PC) One phase commit: If one member fails to commit, there is no way to tell the coordinator. Prof. Orhan Gemikonakli - Camerino,

43 Prof. Orhan Gemikonakli - Camerino,
Two-Phase Commit (1) Figure (a) The finite state machine for the coordinator in 2PC. (b) The finite state machine for a participant. Prof. Orhan Gemikonakli - Camerino,

44 Prof. Orhan Gemikonakli - Camerino,
Two-Phase Commit (2) Figure Actions taken by a participant P when residing in state READY and having contacted another participant Q. Prof. Orhan Gemikonakli - Camerino,

45 Prof. Orhan Gemikonakli - Camerino,
Two-Phase Commit (3) Figure Outline of the steps taken by the coordinator in a two-phase commit protocol. . . . Prof. Orhan Gemikonakli - Camerino,

46 Prof. Orhan Gemikonakli - Camerino,
Two-Phase Commit (4) . . . Figure Outline of the steps taken by the coordinator in a two-phase commit protocol. Prof. Orhan Gemikonakli - Camerino,

47 Prof. Orhan Gemikonakli - Camerino,
Two-Phase Commit (5) Figure (a) The steps taken by a participant process in 2PC. Prof. Orhan Gemikonakli - Camerino,

48 Prof. Orhan Gemikonakli - Camerino,
Two-Phase Commit (7) Figure (b) The steps for handling incoming decision requests. Prof. Orhan Gemikonakli - Camerino,

49 Prof. Orhan Gemikonakli - Camerino,
Three-Phase Commit (1) The states of the coordinator and each participant satisfy the following two conditions: There is no single state from which it is possible to make a transition directly to either a COMMIT or an ABORT state. There is no state in which it is not possible to make a final decision, and from which a transition to a COMMIT state can be made. Prof. Orhan Gemikonakli - Camerino,

50 Prof. Orhan Gemikonakli - Camerino,
Three-Phase Commit (2) Figure (a) The finite state machine for the coordinator in 3PC. (b) The finite state machine for a participant. Prof. Orhan Gemikonakli - Camerino,

51 Recovery – Stable Storage
Figure (a) Stable storage. (b) Crash after drive 1 is updated. (c) Bad spot. Prof. Orhan Gemikonakli - Camerino,

52 Prof. Orhan Gemikonakli - Camerino,
Checkpointing Figure A recovery line. Prof. Orhan Gemikonakli - Camerino,

53 Independent Checkpointing
Figure The domino effect. Prof. Orhan Gemikonakli - Camerino,

54 Characterizing Message-Logging Schemes
Figure Incorrect replay of messages after recovery, leading to an orphan process. Prof. Orhan Gemikonakli - Camerino,

55 Prof. Orhan Gemikonakli - Camerino,
Summary Basic concepts Fault modelling Failure masking RPC Semantics in the Presence of Failures Reliable multicast schemes Virtual Synchrony Two phase commit |Three phase commit Checkpointing and Message logging Prof. Orhan Gemikonakli - Camerino,

56 Prof. Orhan Gemikonakli - Camerino,
Next Lecture Distributed Object-based Systems Architecture Processes Communication Naming Synchronization Consistency and Replication Fault Tolerance Prof. Orhan Gemikonakli - Camerino,


Download ppt "Fault Tolerance Prof. Orhan Gemikonakli"

Similar presentations


Ads by Google