Download presentation
Presentation is loading. Please wait.
Published byCarol Bridges Modified over 8 years ago
1
Distributed Systems Lecture 4 Failure detection 1
2
Previous lecture Big Data MapReduce 2
3
Motivation Large datacenters need to be available on demand and to process data reliably However, hardware fails Need to ensure hardware is available with small downtime – Amazon promises 99.95% uptime per month – Google promises >99.9% per month Is it enough? What do we do when hardware fails? Butler Lampson A distributed system is a system in which I can’t get my work done because a computer that I’ve never heard of has failed. 3
4
Example Rate of disk failure is once every 10 years For 120 servers in the datacenter this translates to once every month For 120,000 servers it goes down to 7.2 hours What are our options? 1.Hire a team to monitor machines in the datacenter and report to you when they fails 2.Write a failure detector program (distributed) that automatically detects failures and reports to your workstation In 2002 when ASCI Q supercomputer (2 nd fastest in the world at the time) was installed at New Mexico Lab the computer could not run more than an hour without crashing 4
5
Synchronous Distributed System – Each message is received within bounded time – Each step in a process takes lb < time < ub Each local clock’s drift has a known bound – Example: Multiprocessor systems Asynchronous Distributed System – No bounds on message transmission delays – No bounds on process execution The drift of a clock is arbitrary – Example: Internet, wireless networks, datacenters, most real systems Two types of DSs 5
6
Problems with Distributed Systems Failures are more frequent – Many places to fail – More complex More pieces New challenges: asynchrony, communication Potential problems of failures – Single system everything stops – Distributed some parts may continue 6
7
First step: Goals Availability – Can I use it now? Reliability – Will it be up as long as I need it? Safety – If it fails, what are the consequences? Maintainability – How easy is it to fix if it breaks? 7
8
Next step: Failure models Failure: System does not behave as expected – Component-level failure (can compensate) – System-level failure (incorrect result) Fault: Cause of failure (component-level) – Transient: Not repeatable – Intermittent: Repeats, but (apparently) independent of system operations – Permanent: Exists until component repaired Failure model: How the system behaves when it doesn’t behave properly Failure semantics: describes and classifies errors that distributed systems can experience 8
9
Failure classification Correct – In response to inputs, behaves in a manner consistent with the service specification Omission Failure – Does not respond to input Crash: After first omission failure, subsequent requests result in omission failure Timing failure (early, late) – Correct response, but outside required time window Response failure – Value: Wrong output for inputs – State Transition: Server ends in wrong state 9
10
Crash failure types (based on recovery behavior) Crash-stop (fail-stop) – process halts and does not execute any further operations – Halting Never restarts Crash-recovery – process halts, but then recovers (reboots) after a while – Special case of crash-stop model (uses a new identifier on recovery) – Classification: Amnesia – Server recovers to predefined state independent of operations before crash Partial amnesia – Some part of state is as before crash, rest to predefined state Pause – Recovers to state before omission failure 10
11
Hierarchical failure masking – Dependency: Higher level gets (at best) failure semantics of lower level – Can compensate for lower level failure to improve this Example: – TCP fixes communication errors, so some failure semantics not propagated to higher level 11
12
Group failure masking Redundant servers – Failed server masked by others in group – Allows failure semantics of group to be higher than individuals k-fault tolerant – Group can mask k concurrent group member failures from client May “upgrade” failure semantics – Example: Group detects non-responsive server, other member picks up the slack – Omission failure becomes performance failure 12
13
pipi pjpj 13 Detecting failures
14
pipi pj X Crash-stop failure (p j is a failed process) 14 Detecting failures
15
pipi pj X needs to know about pj’s failure (pi is a non-faulty process or alive process) There are two main flavors of failure detectors 1. Ping-Ack (proactive) 2. Heartbeat (reactive) Crash-stop failure (p j is a failed process) 15 Detecting failures
16
pipi pjpj p i needs to know about p j ’s failure - p j replies - p i queries p j once every T time units - if p j does not respond within another T time units of being sent the ping, p i detects p j as failed ping ack Worst case Detection time = 2T If p j fails, then within T time units, p i will send it a ping message. p i will time out within another T time units. The waiting time T can be parameterized. 16 Ping-ack protocol
17
pipi pjpj - p j maintains a sequence number - p j sends p i a heartbeat with incremented sequence number after every T time units - if p i has not received a new heartbeat for the past, say 3T time units, since it received the last heartbeat, then p i detects p j as failed heartbeat If T >> round trip time of messages, then worst case detection time ~ 3*T (why?) The 3 can be changed to any positive number since it is a parameter 17 Heartbeat protocol p i needs to know about p j ’s failure
18
The Ping-ack and Heartbeat failure detectors are always “correct” – If a process p j fails, then p i will detect its failure as long as p i itself is alive Why? – Ping-ack: set waiting time T to be > round trip time upper bound p i p j latency + p j processing + p j p i latency + p i processing time – Heartbeat: set waiting time 3T to be > round trip time upper bound 18 Synchronous DS case
19
Completeness = every process failure is eventually detected (no misses) Accuracy = every detected failure corresponds to a crashed process (no mistakes) Completeness and Accuracy – Can both be guaranteed 100% in a synchronous distributed system – Can never be guaranteed simultaneously in an asynchronous distributed system 19 Failure detector properties
20
Impossible because of arbitrary message delays & message losses – If a heartbeat/ack is dropped (or several are dropped) from p j, then p j will be mistakenly detected as failed inaccurate detection – How large would the T waiting period in ping-ack or 3T heartbeat waiting period, need to be to obtain 100% accuracy? – In asynchronous systems, delay/losses on a network link are impossible to distinguish from a faulty process Heartbeat – satisfies completeness but not accuracy Ping-Ack – satisfies completeness but not accuracy 20 Satisfying completeness and accuracy in asynchronous DS
21
Most failure detector implementations are willing to tolerate some inaccuracy, but require 100% completeness Many distributed apps designed assuming 100% completeness, e.g., P2P systems – “Err on the side of caution” – Processes not “stuck” waiting for other processes If error in identifying is made then victim process rejoins as a new process and catches up Hearbeating and Ping-ack provide – Probabilistic accuracy: for a process detected as failed, with some probability close to 1.0 (but not equal) it is true that it has actually crashed 21 Completeness or accuracy in asynchronous DS
22
We want failure detection of not merely one process (p j ), but all processes in the DS Approaches: – Centralized heartbeat – Ring heartbeat – All-to-all heartbeat Who guards the failure detectors? 22 Failure detection across the DS
23
… p j, Heartbeat Seq. i++ pjpj pipi 23 Centralized heartbeat
24
p j, Heartbeat Seq. i++ pjpj … … pipi No SPOF (single point of failure) 24 Ring heartbeat
25
p j, Heartbeat Seq. i++ … pjpj pipi Advantage: Everyone is able to keep track of everyone 25 All-to-all heartbeat
26
Bandwidth: – the number of messages sent in the system during steady state (no failures) – Small is good Detection time – Time between a process crash and its detection – Small is good Scalability: – How do bandwidth and detection properties scale with N, the number of processes? Accuracy – Large is good 26 Detection efficiency metrics
27
False Detection Rate/False Positive Rate (inaccuracy) – Multiple possible metrics 1.Average number of failures detected per second, when there are in fact no failures 2.Fraction of failure detections that are false Tradeoffs: If you increase the T waiting period in ping- ack or 3T waiting period in heartbeating what happens to: – Detection Time? – False positive rate? – Where would you set these waiting periods? 27 Accuracy metrics
28
Maintain a list of other alive (non-faulty) processes at each process in the system Failure detector is a component in membership protocol – Failure of p j detected delete p j from membership list – New machine joins p j sends message to everyone add p j to membership list Flavors – Strongly consistent: all membership lists identical at all times (hard, may not scale) – Weakly consistent: membership lists not identical at all times – Eventually consistent: membership lists always moving towards becoming identical eventually (scales well) 28 Membership protocols
29
Array of Heartbeat Seq. i for member subset Good accuracy properties pipi 29 Gossip protocols Mimic gossip in a social network efficient to use due to DS large scale In a random search the access time to any VM is of at most n 3 for a regular graph and a third degree polynomial for any graph
30
Gossip based failure detection 1 11012066 21010362 31009863 41011165 2 4 3 Protocol Each process maintains a membership list Each process periodically increments its own heartbeat counter Each process periodically gossips its membership list On receipt, the heartbeats are merged, and local times are updated 11011864 21011064 31009058 41011165 11012070 21011064 31009870 41011165 Current time : 70 at node 2 (asynchronous clocks) Address Heartbeat Counter Time (local) 30
31
O(log(N)) time for a heartbeat update to propagate to everyone with high probability Very robust against failures – even if a large number of processes crash, most/all of the remaining processes still receive all heartbeats Failure detection: If the heartbeat has not increased for more than T fail seconds, the member is considered failed – T fail usually set to O(log(N)). – But entry not deleted immediately: wait another T cleanup seconds (usually = T fail ) Why? 31 Gossip based failure detection
32
What if an entry pointing to a failed node is deleted right after T fail (=24) seconds? Solution: remember for another T cleanup 1 11012066 21010362 31009855 41011165 2 4 3 11012066 21011064 31009850 41011165 11012066 21011064 41011165 11012066 21011064 31009875 41011165 Current time : 75 at node 2 32 Gossip based failure detection
33
Communication omission failures – Send-omission: loss of messages between the sending process and the outgoing message buffer (both inclusive) – Channel omission: loss of message in the communication channel – Receive-omission: loss of messages between the incoming message buffer and the receiving process (both inclusive) Arbitrary failures – Arbitrary process failure: arbitrarily omits intended processing steps or takes unintended processing steps. – Arbitrary channel failures: messages may be corrupted, duplicated, delivered out of order, incur extremely large delays; or non-existent messages may be delivered. – These are known as Byzantine failures, e.g., due to hackers, man-in-the- middle attacks, viruses, worms, etc., and even bugs in the code – A variety of Byzantine fault-tolerant protocols have been designed in literature 33 Other types of failures
34
Class of failureAffectsDescription Fail-stop or Crash-stop ProcessProcess halts and remains halted. Other processes may detect this state. OmissionChannelA message inserted in an outgoing message buffer never arrives at the other end’s incoming message buffer. Send-omissionProcessA process completes asend, but the message is not put in its outgoing message buffer. Receive-omissionProcessA message is put in a process’s incoming message buffer, but that process does not receive it. Arbitrary (Byzantine) Process or channel Process/channel exhibits arbitrary behaviour: it may send/transmit arbitrary messages at arbitrary times, commit omissions; a process may stop or take an incorrect step. 34 Omission and arbitrary failures
35
Time and synchronization 35 Next lecture
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.