Introduction to Fault Tolerance By Sahithi Podila
Basic Concepts
Distributed systems being fault tolerant is related to dependable systems. Dependability Dependability is a term, that covers useful requirements for distributed systems. 1. Availability 2. Reliability 3. Safety 4. Maintainability Fault tolerance in distributed systems
Dependability Availability is defined as the moment at which the system is ready to perform the functions on behalf of the user. If the system is highly available, then it is most likely be working at a given instant of time. Reliability is defined as the time interval in which the system could run continuously without a failure. If the system is highly reliable then it is working for a relatively longer period of time with out interruption.
Dependability Safety is defined as, when system fails temporarily nothing disastrous should happen. Maintainability is defined as how easily the system could be repaired when failure happens.
Fault and Error Fault means that when a system fails to do some required services. Error is defined as the state of the system that leads to failure. Fault is the cause of an error.
Fault Tolerance Fault tolerance is defined as the ability the system has to provide the services even in the presence of faults. Types of Fault Transient: These faults occur once and disappear. Intermittent: These faults occurs and goes away but often comes back and goes at varied times. They are difficult to find. Permanent: These faults remain until they are diagnosed and replaced with the working ones. Ex: burnt-out chips.
Failure Models
Types of failure Type of failureDescription Crash failureA server halts, but is working correctly until it halts Omission failure Receive omission Send omission A server fails to respond to incoming requests. A server fails to receive incoming messages A server fails to send messages Timing failureA server’s response lies outside the specified time interval Response failure Value failure State transition failure A server’s response is incorrect The value of the response is wrong The server deviates from the correct flow of control Arbitrary failureA server may produce arbitrary responses at arbitrary times
Redundancy
Failure Masking- Redundancy Three kinds of redundancy Information redundancy: Extra information(bits) is added in order to recover from grabbled bits. Time redundancy: Action is performed once again if needed. Example: Transactions. Physical redundancy: Extra physical component is added in order to handle any of the malfunctioning components.
Physical Redundancy Physical redundancy is a well known technique for fault-tolerance. The following example illustrates how fault tolerance is achieved by using physical redundancy technique in electronic circuit.
Triple modular redundancy Triple modular redundancy is a general technique for fault tolerance. Each device is replicated three times, if two or three inputs are correct then output is defined. If A 1 device fails, the circuit still works of two more inputs A 2, A 3. A fault in V 1 or in B 1 means the same.
Reference: Andrew S. Tanenbaum, and Maarten Van Steen. Distributed Systems Principles and paradigms. Second Edition, Thank You