Fault Tolerance Chapter 7
Topics Basic Concepts Failure Models Redundancy Agreement and Consensus Client Server Communication Group Communication and Virtual Synchrony Atomic Commit, Recovery, Checkpointing
Basic Concepts An important goal in DS is to make the system resilient to failures of some of the components. Fault tolerance (FT) is frequently one of the reasons for making it distributed in the first place. Dependability Includes: Availability Reliability Safety Maintainability
Goals Availability: Can I use it now? Probability of being up at any given time. Reliability: Will it be up as long as I need it? Ability to run continuously without failure. If system crashes briefly every hour, it may still have good availability (it is up most of the time) but has poor reliability because it cannot run for very long before crashing. Safety: If it fails, ensure nothing bad happens? Maintainability: How easy is it to fix if it breaks?
Definitions FAULT A fault is the cause of an error FAULT TOLERANCE - A system can continue to function even in the presence of faults. Classification of faults: –Transient faults - occur once then disappear. –Intermittent faults - occurs, goes away, then comes back, goes away … –Permanent faults - doesn't go away by itself, like disk failures.
Failure Models Different types of failures. Type of failureDescription Crash failure or fail-stopA server halts, but is working correctly until it halts Omission failure Receive omission Send omission A server fails to respond to incoming requests A server fails to receive incoming messages A server fails to send messages Timing failureA server's response lies outside the specified time interval Response failure Value failure State transition failure The server's response is incorrect The value of the response is wrong The server deviates from the correct flow of control Arbitrary or ByzantineA server may produce arbitrary responses at arbitrary times
Network Failures Link failure (one way or 2 way): 5 can talk to 6, but 6 can not talk to 5 Network partitions: the network 1,2,3,4,5,6 is partitioned into 1,2,3,4 and 5,
Are The Models Realistic? No, of course not! Synch vs Asynch –Asynchronous model is too weak (real systems have clocks, “most” timing meets expectations… but heavy tails) –Synchronous model is too strong (real systems lack a way to implement synchronize rounds) Failure Types –Crash fail (fail-stop) model is too weak (systems usually display some odd behavior before dying) –Byzantine model is too strong (assumes an adversary of arbitrary speed who designs the “ultimate attack”)
Models: Justification If we can do something in the asynchronous model, we can probably do it even better in a real network –Clocks, a-priori knowledge can only help… If we can’t do something in the synchronous model, we can’t do it in a real network –After all, synchronized rounds are a powerful, if unrealistic, capability to introduce If we can survive Byzantine failures, we can probably survive a real distributed system.
Fault Tolerance Strategies Redundancy –Hardware,software,informational,temporal Hierarchy –Confinement of errors
Failure Masking by Redundancy Triple modular redundancy. Voter circuits choose majority of inputs to determine correct output
Flat Groups versus Hierarchical Groups a)Communication in a flat group. b)Communication in a simple hierarchical group
Identical Processes, Fail-stop A system is K fault tolerant if it can withstand faults in K components and still produce correct results. Example: FT through replication - each replica reports a result. If the nodes in a DS are fail-stop and there are K+1 identical processes, then the system can tolerate K failures: the result comes from the remaining one
Identical Processes, Byzantine Failures If K failures are Byzantine (with K-collusion) then 2K+1 processes are needed for K FT. Example: K processes can be faulty and "lie" about their result. (If they simply fail to report a result, that is not a problem). If there are 2K+1 processes, at least K+1 will be correct and report the same correct answer. So by taking the result reported by at least K+1 (which is a majority), we get the correct answer.
Agreement section Distributed agreement or "distributed consensus" is the fundamental problem in DS. –Distributed mutual exclusion and election are basically getting processes to agree on something. –Agreeing on time or the update of replicated data are special cases of the distributed consensus problem. Agreement sometimes means one process proposes a value and the others agree on it while consensus means all processes propose values and all agree on some function of those values.
Consensus (Agreement) There are M processes, P1, P2, … Pm in a DS that are trying to reach agreement. A subset F of the processes are faulty. Each process Pi stores a value Vi. During agreement, the processes each calculate a value Ai. At the end of the algorithm: – All non-faulty processes reach a decision. – For every pair of non-faulty processes Pi and Pj, Ai = Aj. This is the agreement value. – The agreement value is a function of the initial values {Vi} of the non-faulty processes. The function is often max (as in the case of election) or average or one of the Vi. If all non-faulty processes have the same Vi, then that must be the agreement value.
Consensus: Easy Case: No Failures No failures, synchronous, M processes If there can be no failures, reaching consensus is easy. Every process sends his value to every other process. All processes now have identical info. All processes do the same calculation and come up with the same value. Processes need to maintain an array of M values. P1 has {1,2,3,4} P2 has {1,2,3,4} P3 has {1,2,3,4} P4 has {1,2,3,4}
Consensus: Fail-stop Fairly Easy case: fail-stop, synchronous If faulty processes are fail-stop, reaching consensus is reasonably easy, all non-faulty processes send their values to all others. However, K of them may fail at sometime during the process... P1 has {1,2,3,4} P2 has {1,2,3,4} P3 has {x,2,3,4} P4 has {x,2,3,4}
Consensus: Fail-stop Solution is after all processes send their values to all others, then all processes now broadcast all the values they received (and who from). This continues for f+1 rounds where f = |F|. Processes maintain a tree of values. After second round P4 has 1st round{x,2,3,4} from P2 {1,2,3,4} from P3 {x,2,3,4} {x,2,3,4} {1,2,3,4}
Consensus: Fail-stop If M=4 and F=1 then we need f+1=2 rounds to get consensus (previous example). Do we really need f+1 rounds? Consider M=4, F=2 P1 crashes during 1st round after sending to P2. P2 crashes during 2nd round after sending to P P3:{x,2,3,4} P4:{x,2,3,4} P2:{1,2,3,4}
Consensus: Fail stop What do P3 and P4 see? Round 1 {1,2,3,4}{X,2,3,4}{X,2,3,4} Round 2 send to P3 {1,2,3,4}{X,2,3,4} and die Round 3 {1,2,3,4}{1,2,3,4} If processes are fail-stop, we can tolerate any number of faulty processes, however we need f+1 rounds 4 3 2
Difficult Case: Agreement with Byzantine Failures We will look at agreement (single proposer) rather than consensus (all propose values). The faulty process may respond like a non-faulty process so the non-faulty processes do not know who is faulty. Faulty process can send a fake value to throw off the calculation and can send one value to some and a different value to others. Faulty process is an adversary and can see the global state: has more information than non-faulty nodes. But, can only affect the faulty processes.
Variations on Byzantine Agreement Process always knows who sent the received message. Default value - some algorithms assume a default value (retreat) when there is no agreement. Oral messages - message content is controlled by latest sender (relayer) so receiver doesn’t know whether or not it was tampered with. Signed messages - messages can be authenticated with digital signatures. Assume faulty processes can send arbitrary messages but they cannot forge signatures.
BA with Oral Messages(1) Commanding general coordinates other generals. If all loyal generals attack victory is certain. If none attack, the Empire survives. If some attack, Empire is lost. Gong keeps time. Attack!
BA with Oral Messages(2) How it works. Disloyal generals have corrupt soldiers. Orders are distributed by exchange of messages, corrupt soldiers violate protocol at will. But corrupt soldiers can’t intercept and modify messages between loyal generals. The gong sounds slowly: there is ample time for exchange of messages. Commanding general sends his order. Then all other generals relay to all what they received.
BA with Oral Messages(3) Limitations Let t be the maximum number of faulty processes (disloyal generals). Byzantine agreement is not possible with fewer than 3t+1 processes Same result holds for fault-tolerant consensus in the Byzantine model
Byzantine Consensus Oral Messages(1) The Byzantine generals problem for 3 loyal generals and1 traitor. a)The generals announce their troop strengths (in units of 1 kilosoldiers) to all other generals. b)The vectors that each general assembles based on (a) c)Additional vectors that each general receives in next round (all send what they received to all). Decide by majority.
ByzantineConsensus Oral Messages(2) The same as in previous slide, except now with 2 loyal generals and one traitor. Majority decision does not guarantee consensus.
BA with Signed Messages (1) Faulty process can send arbitrary message, but cannot forge signatures. All messages are digitally signed for authentication. Assume at most f faulty nodes. At the start, coordinator sends signed message to each node. Each process at round I –endorses (authenticate) and forwards all messages received in round I-1
BA with Signed Messages (2) At round f+1, either: –1 value endorsed by at least f+1 nodes, decide majority –else, coordinator is faulty If coordinator is faulty: –either abort, –or retry after leader election to choose new coordinator f+1 rounds proven to be necessary and sufficient. Must have f+2 processes.
Consensus in Asynchronous Systems All of the preceding agreement and consensus algorithms are for synchronous systems, that is the algorithm works by sending messages in rounds or phases. What about Byzantine Consensus in an asynchronous system? Provably impossible [FLP1985]
Client-Server Communications Possible problems: 1. client unable to locate server 2. request message from client to server gets lost 3. server crashes after receiving request 4. reply message from the server to client is lost 5. client crashes after sending request client server
Client-Server Communications Possible Solutions 1. client cannot locate server: client reports exception to user. 2. Request message lost: use timeouts and message numbers 3. Server crashes: client cannot distinguish #2,3, and 4. What to do? Application dependent. 4. Reply lost: see #3: timeout and try again (resend original request and hope that it is recognized as a duplicate and that reply needs to be sent again).
Client-Server Communications 5. Client crashes before reply is received; resources are locked up; orphan processes may exist. Upon recovery, release resources and kill processes? Solution 1 "log and exterminate", keep log of activity and write to stable storage before you send each request - drawback: expense of writing to disk. Solution 2 "reincarnation": release everything, kill local processes, broadcast msg to kill orphans associated with this process. Solution 3 "gentle reincarnation": remote process killed if owner cannot be found. Solution 4 "expiration": remote processes get a timeout value, if not renewed, they can be killed.