Chapter 11 Fault Tolerance
Topics Introduction Process Resilience Reliable Group Communication Recovery
Basic Concepts Failure = state of a system where system fails to meet its contract Error= part of the system state that leads to failure (e.g. differing from its intended value) Faults = cause of an error, e.g. results from Design errors Manufacturing faults Deterioration External disturbance
Fault Error Failure Remark: Presence of a fault does not ensure that an error will occur, e.g. memory stuck-at-0
Characteristics of Faults Duration Permanent fault Once a component fails, it never works correctly again Easiest to diagnose Transient fault 1 time only 10 times as likely as permanent faults Intermittent fault Re-occurring May appear to be transient (if long period) Hard and expensive to detect
Fault Tolerance Includes… Fault tolerance: system can avoid a failure despite occurrence of faults Availability: probability that system works correctly at a given instance of time Reliability: expected time between failures Safety: absence of catastrophic consequences of a fault Maintainability: ease of recovering from a failure (incl. automatic recognizing of faults)
Failure Models Type of failureDescription Crash failureA server halts, but is working correctly until it halts Omission failure Receive omission Send omission A server fails to respond to incoming requests A server fails to receive incoming messages A server fails to send messages Timing failureA server's response lies outside the specified time interval Response failure Value failure State transition failure The server's response is incorrect The value of the response is wrong The server deviates from the correct flow of control Arbitrary failure (Byzantine failure) A server may produce arbitrary responses at arbitrary times
How to Overcome Failures? Design servers being able to announce that they might fail in the near future? Design a DS that is able to detect that A server is down and/or a server does no longer work correctly? Design a DS that is able to mask faults via redundancy?
Failure Masking Hide occurrence of faults using redundancy Information (e.g., additional bits, i.e. error correcting codes, e.g. Hamming-code) Time (e.g., retry an operation, an aborted transaction may be repeated without any side effects) Physical Hardware (replicated equipment) Software (replicated server processes/threads)
Hardware Redundancy Passive (static) Uses fault masking to hide occurrence of faults No action from system is required e.g. voting (see next slide) Active (dynamic) Uses comparison for detection and/or diagnosis Remove faulty hardware from system reconfiguration Hybrid Combine both approaches Masking until diagnostic complete Expensive, but better to achieve higher reliability
Failure Masking by Redundancy Triple modular redundancy.
Stand-by-Sparing Only one module is driving outputs Other modules are Idle hot spares Shut down cold spares In case of error detection switch to new module Hot spares No power up delays But may be significant power consumption Cold spares Vice versa to hot spares
Failure Masking by Software Redundancy How to improve reliability? What can we do to mask thread/process faults?
Process Resilience Protection against process failures Group of identical processes provides redundancy Software: multiple processes on same machine Hardware: processes on different machines Multicast communication ensures all members receive all messages (often atomic and ordered) Processes can join and leave groups dynamically e.g., to replace failed processes Membership protocol ensures agreement on group membership at any given time
Flat Groups versus Hierarchical Groups a) Communication in a flat group. b) Communication in a simple hierarchical group
Group Management 1. Use a single group-server with a single data base typical single point of failure 2. Use a single data base but several group-servers (standby solution) 3. Manage groups in a distributed way, i.e. every outsider wanting to enter a group sends a corresponding enter_group message per reliable multicast to every current group member, but When does a new group member gets all the group internal messages? When leaving the group, what about already sent but not yet received messages?
Agreement in Faulty Systems (1) Ensure all non-faulty processes reach consensus in a finite number of steps 1. Reliable processes, faulty communication (omission faults). Two-army problem 2. Reliable communication, faulty processes (Byzantine faults).
Agreement in Faulty Systems (2) The Byzantine generals problem for 3 loyal generals and1 traitor. a) The generals announce their troop strengths (in units of 1 kilosoldiers). b) The vectors that each general assembles based on (a) c) The vectors that each general receives in step 3.
Agreement in Faulty Systems (3) The same as in previous slide, except now with 2 loyal generals and one traitor. With m faulty processes, at least 2m+1 correctly functioning processes are required to reach an agreement.
Reliable Group Communication
Basic Reliable-Multicasting Schemes A simple solution to reliable multicasting when all receivers are known and are assumed not to fail a) Message transmission b) Reporting feedback
Scalability Feedback implosion: sender is swamped with feedback messages Nonhierarchical multicast: Use NACKS Feedback suppression: NACK’s multicast to everyone. Prevents other receivers from sending NACK’s if they have already seen one Reduces (N)ACK load on server Receivers have to be coordinated so they do not all multicast NACKs at same time Multicasting feedback also interrupts processes that successfully have received messages
Nonhierarchical Feedback Control Several receivers have scheduled a request for retransmission, but the first retransmission request leads to the suppression of others.
Hierarchical Feedback Control The essence of hierarchical reliable multicasting. a) Each local coordinator forwards the message to its children. b) A local coordinator handles retransmission requests.
Atomic Multicast: Virtual Synchrony Deliver a message either to all group members (in the same order), or to none. Requires agreement about group membership Replica crash? Process group: Group view: list of processes the sender has when a message is sent. Each message uniquely associated with a group View changes need to be ordered with respect to message transmissions: Either the message is delivered to the old or the new view Special case: sender failure
Virtual Synchrony (2) The principle of virtual synchronous multicast. If the sender crashes during the multicast, the message may either be delivered to all or ignored by each of them.
Implementing Virtual Synchrony A message m sent in view G i is stable if it was received by all members of G i Only stable messages are delivered View changes are announced By the arriving/departing node or failure detecting node via a view change message, followed by any unstable messages in the old view, followed by a flush message View is changed after the flush message has arrived from all members of the old view
Implementing Virtual Synchrony (2) a) Process 4 notices that process 7 has crashed, sends a view change b) Process 6 sends out all its unstable messages, followed by a flush message c) Process 6 installs the new view when it has received a flush message from everyone else