Download presentation
Presentation is loading. Please wait.
1
Fault-Tolerant Design
EE141 Chapter 3 Fault-Tolerant Design
2
What is this chapter about?
Gives Overview of Fault-Tolerant Design Focus on Basic Concepts in Fault-Tolerant Design Metrics Used to Specify and Evaluate Dependability Review of Coding Theory Fault-Tolerant Design Schemes Hardware Redundancy Information Redundancy Time Redundancy Examples of Fault-Tolerant Applications in Industry
3
Fault-Tolerant Design
Introduction Fundamentals of Fault Tolerance Fundamentals of Coding Theory Fault Tolerant Schemes Industry Practices Concluding Remarks
4
Introduction Fault Tolerance
Ability of system to continue error-free operation in presence of unexpected fault Important in mission-critical applications E.g., medical, aviation, banking, etc. Errors very costly Becoming important in mainstream applications Technology scaling causing circuit behavior to become less predictable and more prone to failures Needing fault tolerance to keep failure rate within acceptable levels
5
Faults Permanent Faults Temporary Faults
Due to manufacturing defects, early life failures, wearout failures Wearout failures due to various mechanisms e.g., electromigration, hot carrier degradation, dielectric breakdown, etc. Temporary Faults Only present for short period of time Caused by external disturbance or marginal design parameters
6
Temporary Faults Transient Errors (Non-recurring errors)
Cause by external disturbance e.g., radiation, noise, power disturbance, etc. Intermittent Errors (Recurring errors) Cause by marginal design parameters Timing problems e.g., races, hazards, skew Signal integrity problems e.g., crosstalk, ground bounce, etc.
7
Redundancy Fault Tolerance requires some form of redundancy
Time Redundancy Hardware Redundancy Information Redundancy
8
Time Redundancy Perform Same Operation Twice Advantage Disadvantage
See if get same result both times If not, then fault occurred Can detect temporary faults Cannot detect permanent faults Would affect both computations Advantage Little to no hardware overhead Disadvantage Impacts system or circuit performance
9
Hardware Redundancy Replicate hardware and compare outputs Advantage
From two or more modules Detects both permanent and temporary faults Advantage Little or no performance impact Disadvantage Area and power for redundant hardware
10
Information Redundancy
Encode outputs with error detecting or correcting code Code selected to minimize redundancy for class of faults Advantage Less hardware to generate redundant information than replicating module Drawback Added complexity in design
11
Failure Rate (t) = Component failure rate
Measured in FITS (failures per 109 hours)
12
System Failure Rate System constructed from components
No Fault Tolerance Any component fails, whole system fails
13
Reliability If component working at time 0 Exponential Failure Law
R(t) = Probability still working at time t Exponential Failure Law If failure rate assumed constant Good approximation if past infant mortality period
14
Reliability for Series System
All components need to work for system to work
15
System Reliability with Redundancy
System reliability with component B in Parallel Can tolerate one component B failing
16
Mean-Time-to-Failure (MTTF)
Average time before system fails Equal to area under reliability curve For Exponential Failure Law
17
Maintainability If system failed at time 0
M(t) = Probability repaired and operational at time t System repair time divided into Passive repair time Time for service engineer to travel to site Active repair time Time to locate failing component, repair/replace, and verify system operational Can be improved through designing system so easy to locate failed component and verify
18
Repair Rate and MTTR = rate at which system repaired
Analogous to failure rate Maintainability often modeled as Mean-Time-to-Repair (MTTR) = 1/
19
Availability System Availability
t t1 t t t t S 1 failures Normal system operation System Availability Fraction of time system is operational
20
Availability Telephone Systems High-Reliability Systems
Required to have system availability of (“four nines”) High-Reliability Systems May require 7 or more nines Fault-Tolerant Design Needed to achieve such high availability from less reliable components
21
Coding Theory Coding Error Detecting Codes
Using more bits than necessary to represent data Provides way to detect errors Errors occur when bits get flipped Error Detecting Codes Many types Detect different classes of errors Use different amounts of redundancy Ease of encoding and decoding data varies
22
Block Code Message = Data Being Encoded Block code If no redundancy
Encodes m messages with n-bit codeword If no redundancy m messages encoded with log2(m) bits minimum possible
23
Block Code To detect errors, some redundancy needed
Space of distinct 2n blocks partitioned into codewords and non-codewords Can detect errors that cause codeword to become non-codeword Cannot detect errors that cause codeword to become another codeword
24
Separable Block Code Separable Advantage
n-bit blocks partitioned into k information bits directly representing message (n-k) check bits Denoted (n,k) Block Code Advantage k-bit message directly extracted without decoding Rate of Separable Block Code = k/n
25
Example of Separable Block Code
(4,3) Parity Code Check bit is XOR of 3 message bits message 101 codeword 1010 Single Bit Parity
26
Example of Non-Separable Block Code
One-Hot Code Each Codeword has single 1 Example of 8-bit one-hot , , , , , , Redundancy = 1 - log2(8)/8 = 5/8
27
Linear Block Codes Special class For any n-bit codeword c
Modulo-2 sum of any 2 codewords also codeword Null space of (n-k)xn Boolean matrix Called Parity Check Matrix, H For any n-bit codeword c cHT = 0 All 0 codeword exists in any linear code
28
Linear Block Codes Generator Matrix, G Codeword c for message m
kxn Matrix Codeword c for message m c = mG GHT = 0
29
Systematic Block Code First k-bits correspond to message
Last n-k bits correspond to check bits For Systematic Code G = [Ikxk : Pkx(n-k)] H = [I(n-k)x(n-k) : PT(n-k)xk] Example
30
Distance of Code Distance between two codewords Distance of Code
Number of bits in which they differ Distance of Code Minimum distance between any two codewords in code If n=k (no redundancy), distance = 1 Single-bit parity, distance = 2 Code with distance d Detect d-1 errors Correct up to (d-1)/2 errors
31
Error Correcting Codes
Code with distance 3 Called single error correcting (SEC) code Code with distance 4 Called single error correcting and double error detecting (SEC-DED) code Procedure for constructing SEC code Described in [Hamming 1950] Any H-matrix with all columns distinct and no all-0 column is SEC
32
Hamming Code For any value of n Example of SEC Hamming Code for n=7
SEC code constructed by setting each column in H equal to binary representation of column number (starting from 1) Number of rows in H equal to log2(n+1) Example of SEC Hamming Code for n=7
33
Error Correction in Hamming Code
Syndrome, s s = HvT for received vector v If v is codeword Syndrome = 0 If v non-codeword and single-bit error Syndrome will match one of columns of H Will contain binary value of bit position in error
34
Example of Error Correction
For (7,3) Hamming Code Suppose codeword has one-bit error changing it to
35
SEC-DED Code Make SEC Hamming Code SEC-DED
By adding parity check over all bits Extra parity bit 1 for single-bit error 0 for double-bit error Makes possible to detect double bit error Avoid assuming single-bit error and miscorrecting it
36
Example of Error Correction
For (7,4) SEC-DED Hamming Code Suppose codeword has two-bit error changing it to Doesn’t match any column in H
37
Hsiao Code Weight of column Constructing n-bit SEC-DED Hsiao Code
Number of 1’s in column Constructing n-bit SEC-DED Hsiao Code First use all possible weight-1 columns Then all possible weight-3 columns Then weight-5 columns, etc. Until n columns formed Number check bits is log2(n+1) Minimizes number of 1’s in H-matrix Less hardware and delay for computing syndrome Disadvantage: Correction logic more complex
38
Example of Hsiao Code (7,3) Hsiao Code
Uses weight-1 and weight-3 columns
39
Unidirectional Errors
Errors in block of data which only cause 01 or 10, but not both Any number of bits in error in one direction Example Correct codeword Unidirectional errors could cause 001000, , (only 10 errors) Non-unidirectional errors 101001, , (both10 and 01)
40
Unidirectional Error Detecting Codes
All unidirectional error detecting (AUED) Codes Detect all unidirectional errors in codeword Single-bit parity is not AUED Cannot detect even number of errors No linear code is AUED All linear codes must contain all-0 vector, so cannot detect all 10 errors
41
Two-Rail Code Two-Rail Code Example of (6,3) Two-Rail Code
One check bit for each information bit Equal to complement of information bit Two-Rail Code is AEUD 50% Redundancy Example of (6,3) Two-Rail Code Message 101 has Codeword Set of all codewords 000111, , , , , , ,
42
Berger Codes Lowest redundancy of separable AUED codes Example
For k information bits, log2(k+1) check bits Check bits equal to binary representation of number of 0’s in information bits Example Information bits log2(7+1)=3 check bits Check bits equal to 100 (4 zero’s)
43
Berger Codes Codewords for (5,3) Berger Code If unidirectional errors
00011, 00110, 01010, 01101, 10010, 10101, 11001, 11100 If unidirectional errors Contain 10 errors increase 0’s in information bits can only decrease binary number in check bits Contain 01 errors decrease 0’s in information bits can only increase binary number in check bits
44
Berger Codes If 8 information bits (16,8) Two-Rail Code
Berger code requires log28+1=4 check bits (16,8) Two-Rail Code Requires 50% redundancy Redundancy advantage of Berger Code Increases as k increased
45
Constant Weight Codes Constant Weight Codes
Non-separable, but lower redundancy than Berger Each codeword has same number of 1’s Example 2-out-of-3 constant weight code 110, 011, 101 AEUD code Unidirectional errors always change number of 1’s
46
Constant Weight Codes Number codewords in m-out-of-n code
Codewords maximized when m close to n/2 as possible n/2-out-of-n when n even (n/2-0.5 or n/2+0.5)-out-of-n when n odd Minimizes redundancy of code
47
Example 6-out-of-12 constant weight code 12-bit Berger Code
Only 28 = 256 codewords
48
Constant Weight Codes Advantage Disadvantage
Less redundancy than Berger codes Disadvantage Non-separable Need decoding logic to convert codeword back to binary message
49
Burst Error Burst Error Example: Original codeword 00000000
Common, multi-bit errors tend to be clustered Noise source affects contiguous set of bus lines Length of burst error number of bits between first and last error Wrap around from last to first bit of codeword Example: Original codeword is burst error length 4 is burst error length 4 Any number of errors between first and last error
50
Cyclic Codes Special class of linear code
Any codeword shifted cyclically is another codeword Used to detect burst errors Less redundancy required to detect burst error than general multi-bit errors Some distance 2 codes can detect all burst errors of length 4 detecting all possible 4-bit errors requires distance 5 code
51
Cyclic Redundancy Check (CRC) Code
Most widely used cyclic code Uses binary alphabet based on GF(2) CRC code is (n,k) block code Formed using generator polynomial, g(x) called code generator degree n-k polynomial (same degree as number of check bits)
52
Message m(x) g(x) c(x) Codeword 0000 x2 + 1 000000 0001 1 000101 0010 x x3 + x 001010 0011 x + 1 x3 + x2 + x + 1 001111 0100 x2 x4 + x2 010100 0101 x4 + 1 010001 0110 x2 + x x4 + x3 + x2 + x 011110 0111 x2 + x + 1 x4 + x3 + x + 1 011011 1000 x3 x5 + x3 101000 1001 x3 + 1 x5 + x3 + x2 + 1 101101 1010 x5 + x 100010 1011 x3 + x + 1 x5 + x2 + x + 1 100111 1100 x3 + x2 x5 + x4 + x3 + x2 111100 1101 x3 + x2 + 1 x5 + x4 + x3 + 1 111001 1110 x3 + x2 + x x5 + x4 + x2 + x 110110 1111 x5 + x4 + x + 1 110011
53
CRC Code Linear block code Has G-matrix and H-matrix
G-matrix shifted version of generator polynomial
54
CRC Code Example (6,4) CRC code generated by g(x)=x2+1
55
Systematic CRC Codes To obtain systematic CRC code
codewords formed using Galois division nice because LFSR can be used for performing division
56
Galois Division Example
Encode m(x)=x2+x with g(x)=x2+1 Requires dividing m(x)xn-k =x4+x3 by g(x) Remainder r(x)=x+1 c(x) = m(x)xn-k+r(x) = (x2+x)(x2)+x+1 = x4+x3+x+1
57
Message m(x) g(x) r(x) c(x) Codeword 0000 x2 + 1 000000 0001 1 000101 0010 x x3 + x 001010 0011 x + 1 x3 + x2 + x + 1 001111 0100 x2 x4 + 1 010001 0101 x4 + x2 010100 0110 x2 + x x4 + x3 + x + 1 011011 0111 x2 + x + 1 011110 1000 x3 100010 1001 x3 + 1 100111 1010 101000 1011 x3 + x + 1 101101 1100 x3 + x2 110011 1101 x3 + x2 + 1 110110 1110 x3 + x2 + x 111001 1111 x4 + x3 + x2 + x 111100
58
Generating Check Bits for CRC Code
Use LFSR With characteristic polynomial equal to g(x) Append n-k 0’s to end of message Example: m(x)=x2+x+1 and g(x)=x3+x+1
59
Checking CRC Codeword Checking Received Codeword for Errors
Shift codeword into LFSR with same characteristic polynomial as used to generate it If final state of LFSR non-zero, then error
60
Selecting Generator Polynomial
Key issue for CRC Codes If first and last bit of polynomial are 1 Will detect burst errors of length n-k or less If generator polynomial is multiple of (x+1) Will detect any odd number of errors If g(x) = (x+1)p(x) where p(x) primitive of degree n-k-1 and n < 2n-k-1 Will detect single, double, triple, and odd errors
61
Commonly Used CRC Generators
CRC code Generator Polynomial CRC-5 (USB token packets) x5+x2+1 CRC-12 (Telecom systems) x12+x11+x3+x2+x+1 CRC-16-CCITT (X25, Bluetooth) x16+x12+x5+1 CRC-32 (Ethernet) x32+x26+x23+x22+x16+x12+x11+x10+x8 +x7+x5+x4+x+1 CRC-64 (ISO) x64+x4+x3+x+1
62
Fault Tolerance Schemes
Adding Fault Tolerance to Design Improves dependability of system Requires redundancy Hardware Time Information
63
Hardware Redundancy Involves replicating hardware units
At any level of design gate-level, module-level, chip-level, board-level Three Basic Forms Static (also called Passive) Masks faults rather than detects them Dynamic (also called Active) Detects faults and reconfigures to spare hardware Hybrid Combines active and passive approaches
64
Static Redundancy Masks faults so no erroneous outputs
Provides uninterrupted operation Important for real-time systems No time to reconfigure or retry operation Simple self-contained No need to update or rollback system state
65
Triple Module Redundancy (TMR)
Well-known static redundancy scheme Three copies of module Use majority voter to determine final output Error in one module out-voted by other two
66
TMR Reliability and MTTF
TMR works if any 2 modules work Rm = reliability of each module Rv = reliability of voter MTTF for TMR
67
Comparison with Simplex
Neglecting fault rate of voter TMR has lower MTTF, but Can tolerate temporary faults Higher reliability for short mission times
68
Comparison with Simplex
Crossover point RTMR > Rsimplex when Mission time shorter than 70% of MTTF
69
N-Modular Redundancy (NMR)
N modules along with majority voter TMR special case Number of failed modules masked = (N-1)/2 As N increases, MTTF decreases But, reliability for short missions increases If goal only to tolerate temporary faults TMR sufficient
70
Interwoven Logic Replace each gate
with 4 gates using inconnection pattern that automatically corrects errors Traditionally not as attractive as TMR Requires lots of area overhead Renewed interest by researchers investigating emerging nanoelectronic technologies
71
Interwoven Logic with 4 NOR Gates
72
Example of Error on Third Y Input
73
Dynamic Redundancy Involves Detecting fault
Locating faulty hardware unit Reconfiguring system to use spare fault-free hardware unit
74
Unpowered (Cold) Spares
Advantage Extends lifetime of spares Equations Assume spare not failing until powered Perfect reconfiguration capability
75
Unpowered (Cold) Spares
One cold spare doubles MTTF Assuming faults always detected and reconfiguration circuitry never fails Drawback of cold spare Extra time to power and initialize Cannot be used to help in detecting faults Fault detection requires either periodic offline testing online testing using time or information redundancy
76
Powered (Hot) Spares Can use spares for online fault detection
One approach is duplicate-and-compare If outputs mismatch then fault occurred Run diagnostic procedure to determine which module is faulty and replace with spare Any number of spares can be used
77
Pair-and-a-Spare Avoids halting system to run diagnostic procedure when fault occurs
78
TMR/Simplex When one module in TMR fails TMR/Simplex
Disconnect one of remaining modules Improves MTTF while retaining advantages of TMR when 3 good modules TMR/Simplex Reliability always better than either TMR or Simplex alone
79
Comparison of Reliability vs Time
80
Hybrid Redundancy Combines both static and dynamic redundancy
Masks faults like static Detects and reconfigures like dynamic
81
TMR with Spares If TMR module fails Replace with spare
can be either hot or cold spare While system has three working modules TMR will provide fault masking for uninterrupted operation
82
Self-Purging Redundancy
Uses threshold voter instead of majority voter Threshold voter outputs 1 if number of input that are 1 greater than threshold Otherwise outputs 0 Requires hot spares
83
Self-Purging Redundancy
84
Self-Purging Redundancy
Compared with 5MR Self-purging with 5 modules Tolerate up to 3 failing modules (5MR cannot) Cannot tolerate two modules simultaneously failing (5MR can) Compared with TMR with 2 spares simpler reconfiguration circuitry requires hot spares (3MR w/spares can use either hot or cold spares)
85
Time Redundancy Advantage Drawback If error detected Less hardware
Cannot detect permanent faults If error detected System needs to rollback to known good state before resuming operation
86
Repeated Execution Repeat operation twice
Simplest time redundancy approach Detects temporary faults occurring during one execution (but not both) Causes mismatch in results Can reuse same hardware for both executions Only one copy of functional hardware needed
87
Repeated Execution Requires mechanism for storing and comparing results of both executions In processor, can store in memory or on disk and use software to compare Main cost Additional time for redundant execution and comparison
88
Multi-threaded Redundant Execution
Can use in processor-based system that can run multiple threads Two copies of thread executed concurrently Results compared when both complete Take advantage of processor’s built-in capability to exploit processing resources Reduce execution time Can significantly reduce performance penalty
89
Multiple Sampling of Outputs
Done at circuit-level Sample once at end of normal clock cycle Same again after delay of t Two samples compared to detect mismatch Indicates error occurred Detect fault whose duration is less than t Performance overhead depends on Size of t relative to normal clock period
90
Multiple Sampling of Outputs
Simple approach using two latches
91
Multiple Sampling of Outputs
Approach using stability checker at output
92
Diverse Recomputation
Use same hardware, but perform computation differently second time Can detect permanent faults that affects only one computation For arithmetic or logical operations Shift operands when performing second computation [Patel 1982] Detects permanent fault affecting only one bit-slice
93
Information Redundancy
Based on Error Detecting and Correcting Codes Advantage Detects both permanent and temporary faults Implemented with less hardware overhead than using multiple copies of module Disadvantage More complex design
94
Error Detection Error detecting codes used to detect errors
If error detected Rollback to previous known error-free state Retry operation
95
Rollback Requires adding storage to save previous state
Amount of rollback depends on latency of error detection mechanism Zero-latency error detection rollback implemented by preventing system state from updating If errors detected after n cycles need rollback restoring system to state at least n clock cycles earlier
96
Checkpoint Execution divided into set of operations
Before each operation executed checkpoint created where system state saved If any error detected during operation rollback to last checkpoint and retry operation If multiple retries fail operation halts and system flags that permanent fault has occurred
97
Error Detection Encode outputs of circuit with error detecting code
Non-codeword output indicates error
98
Self-Checking Checker
Has two outputs Normal error-free case (1,0) or (0,1) If equal to each other, then error (0,0) or (1,1) Cannot have single error indicator output Stuck-at 0 fault on output could never be detected
99
Totally Self-Checking Checker
Requires three properties Code Disjoint all codeword inputs mapped to codeword outputs Fault Secure for all codeword inputs, checker in presence of fault will either procedure correct codeword output or non-codeword output (not incorrect codeword) Self-Testing For each fault, at least one codeword input gives error indication
100
Duplicate-and-Compare
Equality checker indicates error Undetected error can occur only if common-mode fault affecting both copies Only faults after stems detected Over 100% overhead (including checker)
101
Single-Bit Parity Code
Totally self-checking checker formed by removing final gate from XOR tree
102
Single-Bit Parity Code
Cannot detect even bit errors Can ensure no even bit errors by generating each output with independent cone of logic Only single bit errors can occur due to single point fault Typically requires a lot of overhead
103
Parity-Check Codes Each check bit is parity for some set of output bits Example: 6 outputs and 3 check bits
104
Parity-Check Codes For c check bits and k functional outputs
2ck possible parity check codes Can choose code based on structure of circuit to minimize undetected error combinations Fanouts in circuit determine possible error combinations due to single-point fault
105
Checker for Parity-Check Codes
Constructed from single-bit parity checkers and two-rail checkers
106
Two-Rail Checkers Totally self-checking two-rail checker
107
Berger Codes Inverter-free circuit Inverters only at primary inputs
Can be synthesized using only algebraic factoring [Jha 1993] Only unidirectional errors possible for single point faults Can use unidirectional code Berger code gives 100% coverage
108
Constant Weight Codes Non-separable with lower redundancy
Drawback: need decoding logic to convert codeword back to its original binary value Can use for encoding states of FSM No need for decoding logic
109
Error Correction Information redundancy can also be used to mask errors Not as attractive as TMR because logic for predicting check bits very complex However, very good for memories Check bits stored with data Error do not propagate in memories as in logic circuits, so SEC-DED usually sufficient
110
Error Correction Memories very dense and prone to errors
Especially due to single-event upsets (SEUs) from radiation SEC-DED check bits stored in memory 32-bit word, SEC-DED requires 7 check bits Increases size of memory by 7/32=21.9% 64-bit word, SEC-DED requires 8 check bits Increases size of memory by 8/64=12.5%
111
Memory ECC Architecture
112
Hamming Code for ECC RAM
Z Z Z Z Z Z Z Z c c c c 1 2 3 4 5 6 7 8 1 2 3 4 Parity Group 1 1 1 1 1 1 1 Parity Group 2 1 1 1 1 1 1 Parity Group 3 1 1 1 1 1 Parity Group 4 1 1 1 1 1
113
Memory ECC SEC-DED generally very effective
Memory bit-flips tend to be independent and uniformly distributed If bit-flip occurs, gets corrected next time memory location accessed Main risk is if memory word not access for long time Multiple bit-flips could accumulate
114
Memory Scrubbing Every location in memory read on periodic basis
Reduces chance of multiple errors accumulating in a memory word Can be implemented by having memory controller cycle through memory during idle periods
115
Multiple-Bit Upsets (MBU)
Can occur due to single SEU Typically occur in adjacent memory cells Memory interleaving used To prevent MBUs from resulting in multiple bit errors in same word
116
Error or Delay Catastrophic Fault Masking Capability Aircraft
Type Issues Goal Examples Techniques Long-Life Systems Difficult or Expensive to Repair Maximize MTTF Satellites Spacecraft Implanted Biomedical Dynamic Redundancy Reliable Real-Time Error or Delay Catastrophic Fault Masking Capability Aircraft Nuclear Power Plant Air Bag Electronics Radar TMR High Availability Systems Downtime Very Costly High Availability Reservation System Stock Exchange Telephone Systems No Single Point of Failure; Self-Checking Pairs; Fault Isolation High Integrity Systems Data Corruption Data Integrity Banking Transaction Processing Database Checkpointing, Time Redundancy; ECC; Redundant Disks Mainstream Low-Cost Systems Reasonable Level of Failures Acceptable Meet Failure Rate Expectations at Low Cost Consumer Electronics Personal Computers Often None; Memory ECC; Bus Parity; Changing as Technology Scales
117
Concluding Remarks Many different fault-tolerant schemes
Choosing scheme depends on Types of faults to be tolerated Temporary or permanent Single or multiple point failures etc. Design constraints Area, performance, power, etc.
118
Concluding Remarks As technology scales
Circuits increasingly prone to failure Achieving sufficient fault tolerance will be major design issue
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.