Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 7 Fault Tolerance Basic Concepts Failure Models Process Design Issues Flat vs hierarchical group Group Membership Reliable Client.

Similar presentations


Presentation on theme: "Chapter 7 Fault Tolerance Basic Concepts Failure Models Process Design Issues Flat vs hierarchical group Group Membership Reliable Client."— Presentation transcript:

1 Chapter 7 Fault Tolerance Basic Concepts Failure Models Process Design Issues Flat vs hierarchical group Group Membership Reliable Client Server Communication Point to point Communication RPC Reliable Group Communication Atomic Multicast Two Phase Commit

2 Introduction Partial failure in distributed system may happen when one component is fails. May affect the operation in certain component Leaving another component totally unaffected The design goal in DS is Build a system that automatically recover from a partial failure Without seriously affecting the overall performance

3 Basic Concepts Dependability Includes Availability Reliability Safety
Maintainability

4 Availability The system is ready to be used immediately
In general, the system is operating correctly at any given moment and is available to performs its functions. Percentage of availability = (total elapsed time – sum of downtime)/total elapsed time 99.9% hours

5 Reliability System can run continuously without failure.
High reliable system is one that will most likely continue to work without interruption during a relative long period of time One measure used to define a component or system's reliability is mean time between failures (MTBF) MTBF = (total elapsed time – sum of downtime)/number of failures A related measurement is mean time to repair (MTTR). MTTR is the average time interval (usually expressed in hours) that it takes to repair a failed component.

6 Safety Nothing catastrophic will happen if a system temporary fails to operate correctly.

7 Maintainability Refers to how easy a failed system can be repaired

8 Terminology Failure: When a component is not living up to its specifications, a failure occurs Error: That part of a component's state that can lead to a failure Fault: The cause of an error Fault prevention: prevent the occurrence of a fault Fault tolerance: build a component in such a way that it can meet its specifications in the presence of faults

9 Different types of failures.
Failure Models Different types of failures. Type of failure Description Crash failure A server halts, but is working correctly until it halts Omission failure Receive omission Send omission A server fails to respond to incoming requests A server fails to receive incoming messages A server fails to send messages Timing failure A server's response lies outside the specified time interval Response failure Value failure State transition failure The server's response is incorrect The value of the response is wrong The server deviates from the correct flow of control Arbitrary failure A server may produce arbitrary responses at arbitrary times

10 Failure Models(cont) Timing failures: The output of a component is correct, but lies outside a specified real-time interval - (performance failures: too slow) Response failures: The output of a component is incorrect Value failure: The wrong value is produced State transition failure: Execution of the component's service brings it into a wrong state

11 Failure Models(cont) Crash failures: A component halts but behaves correctly before halting Omission failures: A component fails to respond Receive omission: A server fails to receive incoming messages Send omission: A server fails to send messages Arbitrary failures: A component may produce arbitrary output and be subject to arbitrary timing failures

12 Process Resilience Protect against faulty processes by replicating and distributing computations in a group: Flat groups: Good for fault tolerance as information exchange immediately occurs with all group members; however, may impose more overhead as control is completely distributed (hard to implement). Hierarchical groups: All communication through a single coordinator  not really fault tolerant and scalable, but relatively easy to implement.

13 Flat Groups versus Hierarchical Groups
Communication in a flat group. Communication in a simple hierarchical group

14 Group Membership Method for creating and deleting groups is needed
Allowing process to join and leave the group Group Server Centralized Single point of failure Send a message to all groups members Want to join the group Send goodbye to leave the group

15 RPC Semantic in the presence of failure
What can go wrong?: 1: Client can’t locate server 2: Client request is lost 3: Server crashes 4: Server response is lost 5: Client crashes

16 RPC Semantics The client unable to locate the server
The server might be down Relatively simple – just report back to client Lost Request message Just resend message Use timer in waiting for acknowledgement msg, If the ack is not received within certain time limit, the msg will be transmit again

17 Server crashes are harder - you don't know what was done:
RPC Semantics Server crashes are harder - you don't know what was done: A server in client-server communication Normal case Crash after execution Crash before execution

18 RPC Semantics(cont) Lost Reply message Client Crashes
Use the timer again, resend the request Keep track of the request by having a sequence number to differentiate between the original req and a retransmission. Or by having a bit in the message header to distinguish initial request from retransmission. Client Crashes After send a request to the server Server will do the computation while there is no client waiting for the result – orphan.

19 Basic Reliable-Multicasting Schemes
A simple solution to reliable multicasting when all receivers are known and are assumed not to fail Message transmission Reporting feedback

20 Atomic Multicast Definition
msg is delivered to either all processes or none at all all messages are delivered in the same order to all processes Atomic Multicast is required for many applications example? Replicated Database, Bank Transactions Example: replicated database each replica is represented by a process each process needs to get the updates in the same order when a process dies and re-starts, it needs to perform the updates in order For atomic multicast, all the processes need to agree on group membership eg: how does a crashed process re-join the group

21 Reliable Stock Exchange
Reliable Multicast Receivers Object Group External Data Feed (ORB) Bridge

22 Virtually Synchronous
Group View: the view on the set of process contained in the group. The message multicast to group view G is delivered to each non-faulty process in G. If the sender crashes, the message may either be delivered to all the remaining process, or ignored by each of them.

23 Virtual Synchrony (1) The logical organization of a distributed system to distinguish between message receipt and message delivery

24 The principle of virtual synchronous multicast.
Virtual Synchrony (2) The principle of virtual synchronous multicast.

25 Message Ordering Unordered multicast FIFO-ordered multicast
Causally-ordered multicast Totally-ordered multicast

26 Message Ordering (1) Process P1 Process P2 Process P3 sends m1 receives m1 receives m2 sends m2 Three communicating processes in the same group. The ordering of events per process is shown along the vertical axis.

27 Message Ordering (2) Process P1 Process P2 Process P3 Process P4 sends m1 receives m1 receives m3 sends m3 sends m2 sends m4 receives m2 receives m4 Four processes in the same group with two different senders, and a possible delivery order of messages under FIFO-ordered multicasting

28 Implementing Virtual Synchrony (1)
Multicast message m to a group of process. No guarantee that all member in G will receive m (sender may fail before transmit m to all its member) Because the sender has crashed; another process should get m from somewhere else Let every process in G keep m until all member in G have received it. m is said to be stable, if all member received it Only a stable message are allowed to be delivered

29 Implementing Virtual Synchrony (2)
Process 4 notices that process 7 has crashed, sends a view change Process 6 sends out all its unstable messages, followed by a flush message Process 6 installs the new view when it has received a flush message from everyone else

30 Consist of coordinator and participants(voting phase&decision phase)
Two-Phase Commit (1) Consist of coordinator and participants(voting phase&decision phase) The coordinator sends a VOTE_REQUEST message to all participants Participant may either send back the VOTE_COMMIT or VOTE_ABORT Coordinator will send GLOBAL_COMMIT if received VOTE_COMMIT from all participant; GLOBAL_ABORT if only one send VOTE_ABORT Each participant wait for message from coordinator; GLOBAL_COMMIT- Locally commit the transaction GLOBAL_ABORT - Locally abort the transaction

31 Two-Phase Commit (2) The finite state machine for the coordinator in 2PC. The finite state machine for a participant.

32 Two-Phase Commit (3) The protocol can easily fail when a process crashes; other processes may indefinitely waiting for a message from that process. Time out mechanism are use to solved the problems 1. Participant may waiting for VOTE_REQUEST in their INIT state; decide to locally abort the transaction 2. Coordinator may waiting for a message from participants in the WAIT state; send GLOBAL_ABORT 3. Participant block in READY state waiting for global vote from coordinator; participant cannot simply decide to abort the transaction; - wait for coordinator to up again; -Contact another participant to check for the current state

33 Two-Phase Commit (4) Coordinator Q State Coordinator Msg COMMIT GLOBAL_COMMIT ABORT GLOBAL_ABORT INIT VOTE_REQUEST READY Wait for coordinator recover READY P Q State of Q Action by P COMMIT Make transition to COMMIT ABORT Make transition to ABORT INIT READY Contact another participant Actions taken by a participant P when residing in state READY and having contacted another participant Q.


Download ppt "Chapter 7 Fault Tolerance Basic Concepts Failure Models Process Design Issues Flat vs hierarchical group Group Membership Reliable Client."

Similar presentations


Ads by Google