1 DRAFTS Fault Tolerance Some background Claudio Pinello
2 DRAFTS Hammurabi code (~1750 BC) 229 If a builder build a house for some one, and does not construct it properly, and the house which he built fall in and kill its owner, then that builder shall be put to death If it kill the son of the owner the son of that builder shall be put to death If it kill a slave of the owner, then he shall pay slave for slave to the owner of the house If it ruin goods, he shall make compensation for all that has been ruined, and inasmuch as he did not construct properly this house which he built and it fell, he shall re-erect the house from his own means. Inspiration: Prof. Patterson; reproduced without permission from
3 DRAFTS Some Terminology A fault is the cause of an error; an error is the part of the system state which may cause a failure; a failure is the deviation of the system from the specification Adapted from: J.C. Laprie, “Dependability : basic concepts and terminology in English, French, German, Italian, and Japanese”, Springer-Verlag 1992, Series title: Dependable computing and fault-tolerant systems.
4 DRAFTS Example Office Desk –lamp bulb fails (fault) –light level drops (error) –I can’t get work done (failure) –unless…
5 DRAFTS One Good Idea: Redundancy
6 DRAFTS One Bad Idea: Redundancy
7 DRAFTS Structure System-level fault tolerance –avoid single point of failure –avoid common-mode failure (e.g. same bug in replicated software, all power supplies fail above 50 o C, etc.) –fault isolation –cross fingers!
8 DRAFTS Fault Model Silent Faults –faults result in omission errors Crash Faults (fail-stop) –faults result in crashes: no more data, ever! Non-silent Faults –faults result in value errors Byzantine Faults –malicious attacks, non-silent faults, bounded delays, etc…
9 DRAFTS Fault Detection Typically check for errors –Silent Faults: no errors? “omission” errors! Easy for synchronous systems, otherwise use timeouts. Question: You are sick in bed. How do you know if your door bell is broken?
10 DRAFTS Fault Detection Typically check for errors –Non-silent faults: how do you know if result is wrong? –e.g. your calculator computes sin(), how do you know if it is faulty? –BTW: what time is it?
11 DRAFTS Fault Detection Non-silent faults: try voting –you can tolerate up to n/2 -1 faults
12 DRAFTS Fault Detection Typically check for errors –Byzantine faults: oh my! you can’t trust people on chatlines… can you ask them the time? the account number of the red cross for a donation? would you ask them what medicine to take?
13 DRAFTS Byzantine Generals question: “attack or retreat?” message passing (oral/written) there are traitors goal: determine consensus among non-traitors
14 DRAFTS Byzantine Generals Basic algorithm (by Lamport et al.) –n rounds of oral message passing –use majority voting, decide Tolerates up to < 1/3 traitors If you can use signed messages, reduced number or rounds All methods require bounded asynchrony, i.e. bounded delays
15 DRAFTS What model to use? Depends on your application –internet transactions? probably Byzantine –embedded systems? usually non-silent faults are sufficient, but… more networked applications…. –channel transmission? using CRC one “approximates” fail silence HW faults or SW faults?
16 DRAFTS More on redundancy Space-redundancy –hw (e.g. 4 brakes, RAID disks, batteries,…) –data structures (e.g. RAID) –software (e.g. Domain name servers) Time-redundancy –same person, compute twice –“reload” in web-browsers transient faults
17 DRAFTS Recovery You detected a fault, now what? Isolate fault to avoid further errors Recover from fault –backtrack to known good checkpoint –start another agent to compute result –use another already available result –reduce functionality (e.g. slow down) –bring system to safe state (e.g. turn off engine)
18 DRAFTS Conclusions Faults do occur, do you care? Model them Use redundancy right! System-level fault tolerance Techniques exist, some are complex to get right