Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 CSE 8343 Presentation # 2 Fault Tolerance in Distributed Systems By Sajida Begum Samina F Choudhry.

Similar presentations


Presentation on theme: "1 CSE 8343 Presentation # 2 Fault Tolerance in Distributed Systems By Sajida Begum Samina F Choudhry."— Presentation transcript:

1 1 CSE 8343 Presentation # 2 Fault Tolerance in Distributed Systems By Sajida Begum Samina F Choudhry

2 2 Outline Terminology Goals of Fault Tolerance Fault Prevention Vs Fault Tolernance Phases in Fault Tolerance Causes of Faults Fault Classification Fault tolerance in Distributed Systems Recovering a Consistent State Checkpoint Rollback Recovery

3 3 Terminology Fault : Physical Defect Error : Manifestation of fault Failure :Incorrect functioning of the system Fault Tolerance : Provide the service despite the presence of faults in the system. Fault Tolerant System: Mask the presence of faults in the system by using redundancy.

4 4 Goals of Fault Tolerance Dependability: -Trustworthiness of a computer system Attributes of Dependability : - Reliability –Used when momentary periods of incorrect operation is unacceptable - Availability - Safety ­ different from reliability - Security

5 5 Fault Prevention Vs Fault Tolernance Fault Avoidance - Assumes system failure will occasionally occur. - No redundancy in the system to mask failures - Systems fail when the component(s) fail - Manual maintenance Fault Tolerance -Assumes fault prevention techniques will never be able to eliminate all possible faults - Redundancy - Fault detection - Recovery

6 6 Phases in Fault Tolerance Error Detection Damage confinement Error Recovery - Forward Recovery - Backward Recovery Fault Treatment & continued Service

7 7 Causes of Faults Physical defects Wear and tear External intervention User errors

8 8 Fault Classification System Failure Incorrect Design Unstable or marginal components Unstable Environment Permanent fault Operator Mistake Permanent Error Transient Error Intermittent Error

9 9 FT in Distributed Systems Failures and Fault Classification Crash fault –Component halts or loses internal state –Will not go through correct state-transition Omission fault –Will not respond to some inputs Timing fault –Makes it slower or faster – performance fault

10 10 Fault Classes (Cont’d) Byzantine fault –Behaves in an arbitrary way Incorrect computation fault Byzantine Timing Omi Crash

11 11 FT Building Blocks Byzantine agreement Synchronized clocks Stable storage Fail-stop processors Detection and diagnosis Reliable messaging

12 12 Byzantine agreement 1 1 1 0 0 0 Node j is faulty Transmitter is faulty Transmitter Node i Node j

13 13 IC Protocol With Ordinary Messages Assumptions: - All messages delivered correctly - Receiver knows the sender - Absence of a message can be detected Algorithm runs in various rounds

14 14 ICA Algorithm ICA(0) - Transmitter sends the value to other (n-1)nodes - Each node uses the received value or the default value ( in case of no reception) ICA(m), m>0: - Transmitter sends value to other nodes - Node I runs ICA(m-1) to send Vi to other (n-2)nodes. - Node I uses the value majority (v1, v2,………,vn-1)

15 15 Protocol with Signed Messages Algorithm SM(m) Initialize Vi=0 The transmitter sends the signed value to all other nodes For each I: –Receives message v:0 from transmitter, sets Vi to {v} and sends message v:0:I to every other node –If node I receives the message v:0:j1:j2:…:jk and v not in Vi, add v to Vi, if k<m, sends the message v:0:j1:j2:…:jk:I to every node other than j1, j2, …, jk. –When the messages are over, final value is choice(Vi)

16 16 Synchronized Clocks Internal synchronization External synchronization Drift of physical clocks –Value of all the non-faulty processors' clocks must be approximately equal –Change of the non-faulty processors' clocks during resynchronization should be minimal Deterministic and probabilistic clock synchronization

17 17 Stable Storage Operations: –write (address, data); –read (address), returns (status, data) Failures: –Transient failures –Bad sector –Controller failure –Disk failure

18 18 Implementation Using only one disk –Careful read - repeated read until it returns status good –Careful write - write followed by a careful read –Will not cover decay events and crashes during write Partition disk into ordered pairs of pages that are not decay related

19 19 Disk Shadowing CPU1 CPU2 Disk Controller Disk 2Disk 1

20 20 Redundant Arrays of Disk Files are "striped" across multiple spindles Redundancy yields high data availability Mirroring/Shadowing (high capacity cost) Techniques: Horizontal Hamming Codes (overkill) Parity & Reed-Solomon Codes Capacity penalty to store it Bandwidth penalty to update

21 21 Fail-Stop Processors Fail Stop Behavior After a failure –Stops executing –Internal state including the volatile memory lost –Any processor can detect the failure Impossible to implement with just one processor k-fail-stop implementation

22 22 Reliable Broadcast Reliable Atomic Casual Using Message forwarding Using Piggybacked Acks

23 23 CheckPoint What is Checkpointing ? - Saved local states of a system is called checkpoint. - Process of saving the checkpoints on a stable storage is called checkpointing. Need for Checkpointing ? - Checkpointing is used to bring a system to consistent state after failures (Rollback Recovery).

24 24 CheckPoint cont… Simplifies the task of determining actions of transactions that need to be undone or redone when a failure occurs. A checkpoint record contains a list of active transactions. Steps: - Write a begin checkpoint record into the log. - Collect the checkpoint data into the stable storage. - Write an end checkpoint into the log.

25 25 Classification of Checkpoint Algorithms Uncoordinated Checkpointing (Asynchronous) Each process may take a checkpoint when it is most convenient. Coordinated Checkpointing (Synchronous) When a process takes checkpoints it asks all relevant processes to take checkpoints.

26 26 Uncoordinated Checkpointing Algorithms Allows each process maximum autonomy in deciding when to take a checkpoint. Disadvantages: Possibility of Domino effect. Process may take a useless checkpoint. Forces each process to maintain multiple checkpoints.

27 27 Coordinated Checkpointing Algorithm Advantages: Not susceptible to domino effect. Maintains only one permanent checkpoint. Disadvantages: Large latency involved in committing output. Suffers from high overhead.

28 28 Domino Effect An erroneous syntax rollbacks to previous checkpoint. A process that received a message from the recovering process after the rollback point also needs to be rollback. When p2 goes to C, then p1 needs to go to B, then p0 needs to go to A, then p2 needs to go to…. processors A B C P0 P1 P2 Checkpoints

29 29 Rollback Recovery Rollback Recovery: If there is an error in process then all other dependent processes are rolled back to a consistent state and restarted.

30 30 Issues in Rollback Recovery Minimize the extent of Rollback. Minimize the number of processes rolling back. Avoid Domino effect: A cascade rollbacks resulting in restoring the process to the origin.

31 31 Recovery techniques Deferred Update: All the actual updates to the database are postponed until the transaction completes its execution successfully and reaches its commit point. Immediate Update: The database is updated immediately without waiting for the transaction to commit. Logging Shadow paging

32 32 Logging implementation Do it at page level. Maintain log buffer in memory. Write ahead log protocols Flush log buffer before dirty pages goes to disk. Flush log buffer at commit.

33 33 Shadow Paging In this recovery technique a shadow directory, which contains the most recent entries, and a current directory, which contains the entries of the updations that are being done. Whenever updating is done, a new copy of the modified database page is created. Committing a transaction corresponds to discarding the previous shadow paging technique

34 34 Shadow Paging cont.. Advantages: No overhead of writing log records. Recovery is trivial. Disadvantage Storage Management becomes complex Copying the entire page table is very expensive. Garbage Collection becomes difficult when a transaction commits. Data gets fragmented (related pages gets separated).

35 35 Goals in designing a Recovery Method Important goals involved in designing a recovery method: 1.Simplicity 2.Flexible Storage Management 3.Partial rollbacks 4.Flexible Buffer Management 5.Parallelism and Fast Recovery 6.Minimal overhead.

36 36 Any Questions ?


Download ppt "1 CSE 8343 Presentation # 2 Fault Tolerance in Distributed Systems By Sajida Begum Samina F Choudhry."

Similar presentations


Ads by Google