Fault Tolerance In Operating System Zihao Mao
Fault-Tolerance Definition Refers to the ability of a system or component to continue normal operation despite the presence of hardware or software faults A fault management technique which build a components in such a way that it can meet its specifications in the presence of faults.
Properties Reliability Availability Safety Maintainability R(t) : Probability of a system operating correctly up to time t given that the system was operating correctly at time t=o Mean Time To Failure(MTTF) Mean Time To Repair(MTTR) Availability Defined as the fraction of time the system is available to service users’ requests Safety The ability to minimize the impacts of small failures. Maintainability How easy is it to repair the faults? High reliability doesn’t mean high availability
Fault Definitions Fault Failure An erroneous hardware or software state resulting from component failure, operator error, physical interference from the environment, design error, program error, or data structure error. A defect in a hardware device or component An incorrect step, process, or data definition in a computer program Failure When the system failed to meet its promises. Caused by faults.
Typical Failure Types Crash failure Omission failure Timing failure When the system halts, but it behaves correctly before halting Omission failure fails to respond Timing failure correct output, but the time taken to respond has exceeded the specification. Response failure Wrong output Arbitrary/Byzantine failure: Arbitrary/Malicious output The severity raises from top to bottom in the list.
Fault Categories Temporary: A fault that is not present all the time for all operating conditions Transient: A fault that occurs only once. Intermittent: A fault that occurs at multiple, unpredictable times. Permanent A fault that, after it occurs, is always present.
Fault Detection Techniques Fail-stop Detects the crashing Fail-silent Detects when the system remain silent.(after crashing) Fail-safe Detects wrong outputs Byzantine failure Hard to detect
Fault Tolerant Techniques Redundancy Hide the effects of the faults. Recovery Bring the system to a fault-free state to remove the effect of faults.
Redundancy Physical Redundancy Temporal Redundancy Involves the use of multiple components that either perform the same function simultaneously or are configured so that one component is available as a backup in case of the failure of another component Ex: extra CPUs, multiple parallel circuitry, multi-versions software, backup name server. Temporal Redundancy repeating a function or operation when an error is detected. EX: re-execution, execute backup copy, retransmission Information Redundancy replicating or coding data in such a way that bit errors can be both detected and corrected. Ex: Parity, Hamming codes.
Triple Modular Redundancy(TMR) If A2 fails a V1: majority vote a all B get good result What if V1 fails?
Redundancy level How man faults can be tolerated in the system? k-fault tolerant system : handles k number of faults. TMR: 1-fault tolerant For silent faults: (k + 1) components required. For Byzantine faults: (2k + 1) components required.
Recovery Forward Recovery Backward Recovery Move the system to a new state from which system continue operating Ex: Error-corrections Backward Recovery Bring the system back into a previous fault-free state Ex: Checkpoints, Message Logging, Unix Targon/32 System
Checkpoints Periodically store system states on stable storage when system is operating suffering the effects from faults. At recovery, bring the system back to the last state stored in checkpoints. Problem: inconsistent cut.
Independent Checkpoints Each processes periodically checkpoints independently Fix the problem of inconsistency.
Message logging Checkpointing is expensive Message logging: Periodically saving states. Restart from the last consistent state. Message logging: Take infrequent checkpoints Log all messages between checkpoints to local stable storage At recovery: simply relay messages from previous checkpoint. Avoid re-computations Problem: inconsistency
Summary Fault-tolerance is the ability of a system or component to continue normal operation despite the presence of hardware or software faults Reliability Mean Time To Failure(MTTF) Mean Time To Repair(MTTR) Availability Techniques Redundancy Triple Modular Redundancy(TMR) Recovery Checkpoints Independent checkpoints Message logging