Introduction to Fault Tolerance -Sandeep Karanam
Content Fault tolerance in distributed systems Failure models Failure masking
Fault tolerance in distributed systems Distributed systems being fault tolerant is related to dependable systems. Dependability Dependability is a term, that covers useful requirements for distributed systems. Availability Reliability Safety Maintainability
Availability is defined as the moment at which the system is ready to perform the functions on behalf of the user. If the system is highly available, then it is most likely be working at a given instant of time. Reliability is defined as the time interval in which the system could run continuously without a failure. If the system is highly reliable then it is working for a relatively longer period of time with out interruption. Safety is defined as, when system fails temporarily nothing disastrous should happen. Maintainability is defined as how easily the system could be repaired when failure happens.
Fault and error A fault is a physical defect, imperfection, or flaw that occurs in some hardware or software component. Examples are short-circuit between two adjacent interconnects, broken pin, or a software bug. An error is a deviation from correctness or accuracy in computation, which occurs as a result of a fault. Errors are usually associated with incorrect values in the system state. For example, a circuit or a program computed an incorrect value, an incorrect information was received while transmitting data
Types of Fault Transient: These faults occur once and disappear. Intermittent: These faults occurs and goes away but often comes back and goes at varied times. They are difficult to find. Permanent: These faults remain until they are diagnosed and replaced with the working ones. Ex: burnt-out chips. . Transient fault are dominant type of faults in computer memories. For example, about 98% of RAM faults are transient faults. The causes of transient faults are mostly environmental, such as alpha particles, cosmic rays, electrostatic discharge, electrical power drops, overheating or mechanical shock. Intermittent faults can be due to implementation flaws, aging and wear-out, and to unexpected operation conditions.
Fault tolerance Fault tolerance is the ability of a system to continue performing its intended function in spite of faults Fault tolerance is needed because it is practically impossible to build a perfect system. as the complexity of a system increases, its reliability drastically deteriorates, unless compensatory measures are taken.
Failure models Crash Failure- A server halts, but is working correctly until it halts Omission failures- A server fails to respond to incoming requests, incoming messages A server fails to send messages. Timing failure- A servers response lies outside of the specified time interval. Response Failure- A servers response is incorrect The value of the response is wrong
Failure masking By redundancy Information redundancy: Extra information(bits) is added in order to recover from grabbled bits. Time redundancy: Action is performed once again if needed. Example: Transactions. Physical redundancy: Extra physical component is added in order to handle any of the malfunctioning components.
Triple modular Redundancy Triple modular redundancy is a general technique for fault tolerance. Each device is replicated three times, if two or three inputs are correct then output is defined. If A1 device fails, the circuit still works of two more inputs A2, A3. A fault in V1 or in B1 means the same.
Reference: Andrew S. Tanenbaum , and Maarten Van Steen Reference: Andrew S. Tanenbaum , and Maarten Van Steen. Distributed Systems Principles and paradigms. Second Edition, 2007. Thank you