Download presentation
Presentation is loading. Please wait.
1
Fault-tolerance and Availability in Distributed Systems Distributed Systems Lecture # 11
2
How to deal with failure Do nothing Fail-fast: ethernet Fail-safe: traffic light Fail-soft: Boeing 777 Fail-mask: –Let’s worry about this some more..
3
Fail Masking Must detect error –System must be analyzable –Boundaries must be clearly defined –Must monitor the “health” of the system Must correct the error –Understand the “correct” behavior of the system –Typically employ redundancy
4
Analyzability Use a language that can be easily analyzed –Constrained, domain-specific languages –Formal verification systems –Fine-state automata Regression testing Experimental evaluation –MTTF
5
Monitoring Useful for black-box analysis Periodic Ping –De facto system monitoring –TCP_KEEP_ALIVE Performance monitoring –System slow down beyond a threshold –DDOS Stack state –Java loop termination Overhead? – must keep monitoring overhead low Increase of decrease monitoring after failure?
6
Understand failure What is an ‘error’ –Slow down By how much? –Inconsistency Consistency semantics? –Data corruption Checksum Classification of Errors –Statistical analysis –False positives What is an acceptable rate?
7
Out of place recovery Shadowing –Keep versions, never replace –Only update access paths –Disk space is cheap Differentials –For each file, maintain differentials –Only Insertions, deletions –Update?
8
Fault-recovery Logging –Undo –Redo Durable When to flush?
9
Fault-correction Redundancy –Encode FEC –Replicate Aha ….
10
Overview Ordering –Lazy vs. absolute Transactions –Two-phase commit –Three-phase commit –Quorum-based protocols
11
Availability and Replication Global ordering –Timestamping Absolute –Vector clocks Causal ordering More available But lazy
12
Optimistic Replication Let everyone make changes –Only 3 % transactions ever abort Make changes, send updates –If someone else’s changes come through with T_him < T_you, your changes are overridden Wait for a bit before committing –deadlocks
13
Two Phase Commit Blocking How?
14
Three-phase Commit
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.