EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design.

Slides:



Advertisements
Similar presentations
Noise, Information Theory, and Entropy (cont.) CS414 – Spring 2007 By Karrie Karahalios, Roger Cheng, Brian Bailey.
Advertisements

Cyclic Code.
Error Control Code.
10.1 Chapter 10 Error Detection and Correction Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Fault-Tolerant Systems Design Part 1.
3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani.
Self-Checking Circuits
NETWORKING CONCEPTS. ERROR DETECTION Error occures when a bit is altered between transmission& reception ie. Binary 1 is transmitted but received is binary.
Reliability of Disk Systems. Reliability So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
Chapter 10 Error Detection and Correction
Fundamentals of Computer Networks ECE 478/578 Lecture #4: Error Detection and Correction Instructor: Loukas Lazos Dept of Electrical and Computer Engineering.
1 Chapter Fault Tolerant Design of Digital Systems.
Error detection and correction
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
Error Detection and Correction
Error Detection and Correction Rizwan Rehman Centre for Computer Studies Dibrugarh University.
McGraw-Hill©The McGraw-Hill Companies, Inc., 2000 PART III: DATA LINK LAYER ERROR DETECTION AND CORRECTION 7.1 Chapter 10.
Shashank Srivastava Motilal Nehru National Institute Of Technology, Allahabad Error Detection and Correction : Data Link Layer.
1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University
USING THE MATLAB COMMUNICATIONS TOOLBOX TO LOOK AT CYCLIC CODING Wm. Hugh Blanton East Tennessee State University
Part.7.1 Copyright 2007 Koren & Krishna, Morgan-Kaufman FAULT TOLERANT SYSTEMS Part 7 - Coding.
Lecture 10: Error Control Coding I Chapter 8 – Coding and Error Control From: Wireless Communications and Networks by William Stallings, Prentice Hall,
Cyclic Codes for Error Detection W. W. Peterson and D. T. Brown by Maheshwar R Geereddy.
Error Coding Transmission process may introduce errors into a message.  Single bit errors versus burst errors Detection:  Requires a convention that.
1 SNS COLLEGE OF ENGINEERING Department of Electronics and Communication Engineering Subject: Digital communication Sem: V Cyclic Codes.
10.1 Chapter 10 Error Detection and Correction Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Data Link Layer: Error Detection and Correction
MIMO continued and Error Correction Code. 2 by 2 MIMO Now consider we have two transmitting antennas and two receiving antennas. A simple scheme called.
Fault-Tolerant Systems Design Part 1.
Data and Computer Communications Chapter 6 – Digital Data Communications Techniques.
Data and Computer Communications by William Stallings Eighth Edition Digital Data Communications Techniques Digital Data Communications Techniques Click.
Cyclic Redundancy Check CRC Chapter CYCLIC CODES Cyclic codes are special linear block codes with one extra property. In a cyclic code, if a codeword.
Linear Feedback Shift Register. 2 Linear Feedback Shift Registers (LFSRs) These are n-bit counters exhibiting pseudo-random behavior. Built from simple.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
Error Detection and Correction
DIGITAL COMMUNICATIONS Linear Block Codes
Error Detection in Hardware VO Hardware-Software-Codesign Philipp Jahn.
CprE 458/558: Real-Time Systems
FTC (DS) - V - TT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 5 FAULT RECOVERY AND TOLERANCE TECHNIQUES (SYSTEM.
Fault-Tolerant Systems Design Part 1.
Computer Science Division
Error Detection and Correction
Lecture Focus: Data Communications and Networking  Data Link Layer  Error Control Lecture 19 CSCS 311.
10.1 Chapter 10 Error Detection and Correction Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
The parity bits of linear block codes are linear combination of the message. Therefore, we can represent the encoder by a linear system described by matrices.
1 Kyung Hee University Position of the data-link layer.
Error Detection. Data can be corrupted during transmission. Some applications require that errors be detected and corrected. An error-detecting code can.
Error Detection and Correction
McGraw-Hill©The McGraw-Hill Companies, Inc., 2000 PART III: DATA LINK LAYER ERROR DETECTION AND CORRECTION 7.1 Chapter 10.
1/30/ :20 PM1 Chapter 6 ─ Digital Data Communication Techniques CSE 3213 Fall 2011.
Hamming Distance & Hamming Code
Transmission Errors Error Detection and Correction.
10.1 Chapter 10 Error Detection and Correction Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
10.1 Chapter 10 Error Detection and Correction Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Part III: Data Link Layer Error Detection and Correction
CS203 – Advanced Computer Architecture Dependability & Reliability.
Reliability of Disk Systems. Reliability So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
Fault-Tolerant Design
Error Detection and Correction
Subject Name: COMPUTER NETWORKS-1
Communication Networks: Technology & Protocols
DATA COMMUNICATION AND NETWORKINGS
Fault Tolerance & Reliability CDA 5140 Spring 2006
Coding Theory Dan Siewiorek June 2012.
Chapter 10 Error Detection And Correction
Information Redundancy Fault Tolerant Computing
RAID Redundant Array of Inexpensive (Independent) Disks
Error Detection and Correction
Error Detection and Correction
Chapter 10 Error Detection and Correction
Presentation transcript:

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 1 1 Chapter 3 Fault-Tolerant Design

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 2 2 What is this chapter about?  Gives Overview of Fault-Tolerant Design  Focus on  Basic Concepts in Fault-Tolerant Design  Metrics Used to Specify and Evaluate Dependability  Review of Coding Theory  Fault-Tolerant Design Schemes –Hardware Redundancy –Information Redundancy –Time Redundancy  Examples of Fault-Tolerant Applications in Industry

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 3 3 Fault-Tolerant Design  Introduction  Fundamentals of Fault Tolerance  Fundamentals of Coding Theory  Fault Tolerant Schemes  Industry Practices  Concluding Remarks

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 4 4 Introduction  Fault Tolerance  Ability of system to continue error-free operation in presence of unexpected fault  Important in mission-critical applications  E.g., medical, aviation, banking, etc.  Errors very costly  Becoming important in mainstream applications  Technology scaling causing circuit behavior to become less predictable and more prone to failures  Needing fault tolerance to keep failure rate within acceptable levels

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 5 5 Faults  Permanent Faults  Due to manufacturing defects, early life failures, wearout failures  Wearout failures due to various mechanisms –e.g., electromigration, hot carrier degradation, dielectric breakdown, etc.  Temporary Faults  Only present for short period of time  Caused by external disturbance or marginal design parameters

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 6 6 Temporary Faults  Transient Errors (Non-recurring errors)  Cause by external disturbance –e.g., radiation, noise, power disturbance, etc.  Intermittent Errors (Recurring errors)  Cause by marginal design parameters  Timing problems –e.g., races, hazards, skew  Signal integrity problems –e.g., crosstalk, ground bounce, etc.

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 7 7 Redundancy  Fault Tolerance requires some form of redundancy  Time Redundancy  Hardware Redundancy  Information Redundancy

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 8 8 Time Redundancy  Perform Same Operation Twice  See if get same result both times  If not, then fault occurred  Can detect temporary faults  Cannot detect permanent faults –Would affect both computations  Advantage  Little to no hardware overhead  Disadvantage  Impacts system or circuit performance

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P. 9 9 Hardware Redundancy  Replicate hardware and compare outputs  From two or more modules  Detects both permanent and temporary faults  Advantage  Little or no performance impact  Disadvantage  Area and power for redundant hardware

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Information Redundancy  Encode outputs with error detecting or correcting code  Code selected to minimize redundancy for class of faults  Advantage  Less hardware to generate redundant information than replicating module  Drawback  Added complexity in design

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Failure Rate  (t) = Component failure rate  Measured in FITS (failures per 10 9 hours)

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P System Failure Rate  System constructed from components  No Fault Tolerance  Any component fails, whole system fails

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Reliability  If component working at time 0  R(t) = Probability still working at time t  Exponential Failure Law  If failure rate assumed constant –Good approximation if past infant mortality period

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Reliability for Series System  Series System  All components need to work for system to work

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P System Reliability with Redundancy  System reliability with component B in Parallel  Can tolerate one component B failing

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Mean-Time-to-Failure (MTTF)  Average time before system fails  Equal to area under reliability curve  For Exponential Failure Law

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Maintainability  If system failed at time 0  M(t) = Probability repaired and operational at time t  System repair time divided into  Passive repair time –Time for service engineer to travel to site  Active repair time –Time to locate failing component, repair/replace, and verify system operational –Can be improved through designing system so easy to locate failed component and verify

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Repair Rate and MTTR   = rate at which system repaired  Analogous to failure rate  Maintainability often modeled as  Mean-Time-to-Repair (MTTR) = 1/ 

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Availability  System Availability  Fraction of time system is operational t 0 t 1 t 2 t 3 t 4 t S10S10 failures Normal system operation

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Availability  Telephone Systems  Required to have system availability of (“four nines”)  High-Reliability Systems  May require 7 or more nines  Fault-Tolerant Design  Needed to achieve such high availability from less reliable components

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Coding Theory  Coding  Using more bits than necessary to represent data  Provides way to detect errors –Errors occur when bits get flipped  Error Detecting Codes  Many types  Detect different classes of errors  Use different amounts of redundancy  Ease of encoding and decoding data varies

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Block Code  Message = Data Being Encoded  Block code  Encodes m messages with n-bit codeword  If no redundancy  m messages encoded with log 2 (m) bits  minimum possible

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Block Code  To detect errors, some redundancy needed  Space of distinct 2 n blocks partitioned into codewords and non-codewords  Can detect errors that cause codeword to become non-codeword  Cannot detect errors that cause codeword to become another codeword

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Separable Block Code  Separable  n-bit blocks partitioned into –k information bits directly representing message –(n-k) check bits  Denoted (n,k) Block Code  Advantage  k-bit message directly extracted without decoding  Rate of Separable Block Code = k/n

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Example of Separable Block Code  (4,3) Parity Code  Check bit is XOR of 3 message bits  message 101  codeword 1010  Single Bit Parity

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Example of Non-Separable Block Code  One-Hot Code  Each Codeword has single 1  Example of 8-bit one-hot – , , , , , ,  Redundancy = 1 - log 2 (8)/8 = 5/8

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Linear Block Codes  Special class  Modulo-2 sum of any 2 codewords also codeword  Null space of (n-k)xn Boolean matrix –Called Parity Check Matrix, H  For any n-bit codeword c  cH T = 0  All 0 codeword exists in any linear code

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Linear Block Codes  Generator Matrix, G  kxn Matrix  Codeword c for message m  c = mG  GH T = 0

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Systematic Block Code  First k-bits correspond to message  Last n-k bits correspond to check bits  For Systematic Code  G = [I kxk : P kx(n-k) ]  H = [I (n-k)x(n-k) : P T (n-k)xk ]  Example

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Distance of Code  Distance between two codewords  Number of bits in which they differ  Distance of Code  Minimum distance between any two codewords in code  If n=k (no redundancy), distance = 1  Single-bit parity, distance = 2  Code with distance d  Detect d-1 errors  Correct up to  (d-1)/2  errors

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Error Correcting Codes  Code with distance 3  Called single error correcting (SEC) code  Code with distance 4  Called single error correcting and double error detecting (SEC-DED) code  Procedure for constructing SEC code  Described in [Hamming 1950]  Any H-matrix with all columns distinct and no all-0 column is SEC

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Hamming Code  For any value of n  SEC code constructed by –setting each column in H equal to binary representation of column number (starting from 1)  Number of rows in H equal to  log 2 (n+1)   Example of SEC Hamming Code for n=7

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Error Correction in Hamming Code  Syndrome, s  s = Hv T for received vector v  If v is codeword –Syndrome = 0  If v non-codeword and single-bit error –Syndrome will match one of columns of H –Will contain binary value of bit position in error

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Example of Error Correction  For (7,3) Hamming Code  Suppose codeword has one-bit error changing it to

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P SEC-DED Code  Make SEC Hamming Code SEC-DED  By adding parity check over all bits  Extra parity bit –1 for single-bit error –0 for double-bit error  Makes possible to detect double bit error –Avoid assuming single-bit error and miscorrecting it

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Example of Error Correction  For (7,4) SEC-DED Hamming Code  Suppose codeword has two-bit error changing it to –Doesn’t match any column in H

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Hsiao Code  Weight of column  Number of 1’s in column  Constructing n-bit SEC-DED Hsiao Code  First use all possible weight-1 columns –Then all possible weight-3 columns –Then weight-5 columns, etc.  Until n columns formed  Number check bits is  log 2 (n+1)   Minimizes number of 1’s in H-matrix –Less hardware and delay for computing syndrome –Disadvantage: Correction logic more complex

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Example of Hsiao Code  (7,3) Hsiao Code  Uses weight-1 and weight-3 columns

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Unidirectional Errors  Errors in block of data which only cause 0  1 or 1  0, but not both  Any number of bits in error in one direction  Example  Correct codeword  Unidirectional errors could cause –001000, , (only 1  0 errors)  Non-unidirectional errors –101001, , (both1  0 and 0  1)

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Unidirectional Error Detecting Codes  All unidirectional error detecting (AUED) Codes  Detect all unidirectional errors in codeword  Single-bit parity is not AUED –Cannot detect even number of errors  No linear code is AUED –All linear codes must contain all-0 vector, so cannot detect all 1  0 errors

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Two-Rail Code  Two-Rail Code  One check bit for each information bit –Equal to complement of information bit  Two-Rail Code is AEUD  50% Redundancy  Example of (6,3) Two-Rail Code  Message 101 has Codeword  Set of all codewords –000111, , , , , , ,

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Berger Codes  Lowest redundancy of separable AUED codes  For k information bits, log 2 (k+1) check bits  Check bits equal to binary representation of number of 0’s in information bits  Example  Information bits –log 2 (7+1)=3 check bits –Check bits equal to 100 (4 zero’s)

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Berger Codes  Codewords for (5,3) Berger Code  00011, 00110, 01010, 01101, 10010, 10101, 11001,  If unidirectional errors  Contain 1  0 errors –increase 0’s in information bits –can only decrease binary number in check bits  Contain 0  1 errors –decrease 0’s in information bits –can only increase binary number in check bits

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Berger Codes  If 8 information bits  Berger code requires log 2  8+1  =4 check bits  (16,8) Two-Rail Code  Requires 50% redundancy  Redundancy advantage of Berger Code  Increases as k increased

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Constant Weight Codes  Constant Weight Codes  Non-separable, but lower redundancy than Berger  Each codeword has same number of 1’s  Example 2-out-of-3 constant weight code  110, 011, 101  AEUD code  Unidirectional errors always change number of 1’s

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Constant Weight Codes  Number codewords in m-out-of-n code  Codewords maximized when m close to n/2 as possible  n/2-out-of-n when n even  (n/2-0.5 or n/2+0.5)-out-of-n when n odd  Minimizes redundancy of code

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Example  6-out-of-12 constant weight code  12-bit Berger Code  Only 2 8 = 256 codewords

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Constant Weight Codes  Advantage  Less redundancy than Berger codes  Disadvantage  Non-separable  Need decoding logic –to convert codeword back to binary message

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Burst Error  Burst Error  Common, multi-bit errors tend to be clustered –Noise source affects contiguous set of bus lines  Length of burst error –number of bits between first and last error  Wrap around from last to first bit of codeword  Example: Original codeword  is burst error length 4  is burst error length 4 –Any number of errors between first and last error

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Cyclic Codes  Special class of linear code  Any codeword shifted cyclically is another codeword  Used to detect burst errors  Less redundancy required to detect burst error than general multi-bit errors –Some distance 2 codes can detect all burst errors of length 4 –detecting all possible 4-bit errors requires distance 5 code

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Cyclic Redundancy Check (CRC) Code  Most widely used cyclic code  Uses binary alphabet based on GF(2)  CRC code is (n,k) block code  Formed using generator polynomial, g(x) –called code generator –degree n-k polynomial (same degree as number of check bits)

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Messagem(x)g(x)c(x)Codeword 00000x x xx 2 + 1x 3 + x x + 1x 2 + 1x 3 + x 2 + x x2x2 x 2 + 1x 4 + x x x x 2 + xx 2 + 1x 4 + x 3 + x 2 + x x 2 + x + 1x 2 + 1x 4 + x 3 + x x3x3 x 2 + 1x 5 + x x 3 + 1x 2 + 1x 5 + x 3 + x x 3 + xx 2 + 1x 5 + x x 3 + x + 1x 2 + 1x 5 + x 2 + x x 3 + x 2 x 2 + 1x 5 + x 4 + x 3 + x x 3 + x 2 + 1x 2 + 1x 5 + x 4 + x x 3 + x 2 + xx 2 + 1x 5 + x 4 + x 2 + x x 3 + x 2 + x + 1x 2 + 1x 5 + x 4 + x

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P CRC Code  Linear block code  Has G-matrix and H-matrix  G-matrix shifted version of generator polynomial

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P CRC Code Example  (6,4) CRC code generated by g(x)=x 2 +1

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Systematic CRC Codes  To obtain systematic CRC code  codewords formed using Galois division –nice because LFSR can be used for performing division

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Galois Division Example  Encode m(x)=x 2 +x with g(x)=x 2 +1  Requires dividing m(x)x n-k =x 4 +x 3 by g(x)  Remainder r(x)=x+1 –c(x) = m(x)x n-k +r(x) = (x 2 +x)(x 2 )+x+1 = x 4 +x 3 +x+1

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Messagem(x)g(x)r(x)c(x)Codeword 00000x x xx 2 + 1xx 3 + x x + 1x 2 + 1x + 1x 3 + x 2 + x x2x2 x x x x 4 + x x 2 + xx 2 + 1x + 1x 4 + x 3 + x x 2 + x + 1x 2 + 1xx 4 + x 3 + x x3x3 x 2 + 1xx 4 + x 3 + x x 3 + 1x 2 + 1x + 1x 4 + x 3 + x x 3 + xx x 4 + x 3 + x x 3 + x + 1x x 4 + x 3 + x x 3 + x 2 x 2 + 1x + 1x 4 + x 3 + x x 3 + x 2 + 1x 2 + 1xx 4 + x 3 + x x 3 + x 2 + xx x 4 + x 3 + x x 3 + x 2 + x + 1x x 4 + x 3 + x 2 + x111100

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Generating Check Bits for CRC Code  Use LFSR  With characteristic polynomial equal to g(x)  Append n-k 0’s to end of message  Example: m(x)=x 2 +x+1 and g(x)=x 3 +x+1

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Checking CRC Codeword  Checking Received Codeword for Errors  Shift codeword into LFSR –with same characteristic polynomial as used to generate it  If final state of LFSR non-zero, then error

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Selecting Generator Polynomial  Key issue for CRC Codes  If first and last bit of polynomial are 1 –Will detect burst errors of length n-k or less  If generator polynomial is mutliple of (x+1) –Will detect any odd number of errors  If g(x) = (x+1)p(x) where p(x) primitive of degree n-k-1 and n < 2 n-k-1 –Will detect single, double, triple, and odd errors

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Commonly Used CRC Generators CRC codeGenerator Polynomial CRC-5 (USB token packets)x 5 +x 2 +1 CRC-12 (Telecom systems)x 12 +x 11 +x 3 +x 2 +x+1 CRC-16-CCITT (X25, Bluetooth)x 16 +x 12 +x 5 +1 CRC-32 (Ethernet)x 32 +x 26 +x 23 +x 22 +x 16 +x 12 +x 11 +x 10 +x 8 +x 7 +x 5 +x 4 +x+1 CRC-64 (ISO)x 64 +x 4 +x 3 +x+1

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Fault Tolerance Schemes  Adding Fault Tolerance to Design  Improves dependability of system  Requires redundancy –Hardware –Time –Information

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Hardware Redundancy  Involves replicating hardware units  At any level of design –gate-level, module-level, chip-level, board-level  Three Basic Forms  Static (also called Passive) –Masks faults rather than detects them  Dynamic (also called Active) –Detects faults and reconfigures to spare hardware  Hybrid –Combines active and passive approaches

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Static Redundancy  Masks faults so no erroneous outputs  Provides uninterrupted operation  Important for real-time systems –No time to reconfigure or retry operation  Simple self-contained –No need to update or rollback system state

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Triple Module Redundancy (TMR)  Well-known static redundancy scheme  Three copies of module  Use majority voter to determine final output  Error in one module out-voted by other two

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P TMR Reliability and MTTF  TMR works if any 2 modules work  Rm = reliability of each module  Rv = reliability of voter  MTTF for TMR

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Comparison with Simplex  Neglecting fault rate of voter  TMR has lower MTTF, but  Can tolerate temporary faults  Higher reliability for short mission times

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Comparison with Simplex  Crossover point  R TMR > R simplex when  Mission time shorter than 70% of MTTF

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P N-Modular Redundancy (NMR)  NMR  N modules along with majority voter –TMR special case  Number of failed modules masked =  (N-1)/2   As N increases, MTTF decreases –But, reliability for short missions increases  If goal only to tolerate temporary faults  TMR sufficient

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Interwoven Logic  Replace each gate  with 4 gates using inconnection pattern that automatically corrects errors  Traditionally not as attractive as TMR  Requires lots of area overhead  Renewed interest by researchers investigating emerging nanoelectronic technologies

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Interwoven Logic with 4 NOR Gates

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Example of Error on Third Y Input

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Dynamic Redundancy  Involves  Detecting fault  Locating faulty hardware unit  Reconfiguring system to use spare fault-free hardware unit

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Unpowered (Cold) Spares  Advantage  Extends lifetime of spares  Equations  Assume spare not failing until powered  Perfect reconfiguration capability

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Unpowered (Cold) Spares  One cold spare doubles MTTF  Assuming faults always detected and reconfiguration circuitry never fails  Drawback of cold spare  Extra time to power and initialize  Cannot be used to help in detecting faults  Fault detection requires either –periodic offline testing –online testing using time or information redundancy

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Powered (Hot) Spares  Can use spares for online fault detection  One approach is duplicate-and-compare  If outputs mismatch then fault occurred –Run diagnostic procedure to determine which module is faulty and replace with spare  Any number of spares can be used

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Pair-and-a-Spare  Avoids halting system to run diagnostic procedure when fault occurs

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P TMR/Simplex  When one module in TMR fails  Disconnect one of remaining modules  Improves MTTF while retaining advantages of TMR when 3 good modules  TMR/Simplex  Reliability always better than either TMR or Simplex alone

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Comparison of Reliability vs Time

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Hybrid Redundancy  Combines both static and dynamic redundancy  Masks faults like static  Detects and reconfigures like dynamic

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P TMR with Spares  If TMR module fails  Replace with spare –can be either hot or cold spare  While system has three working modules –TMR will provide fault masking for uninterrupted operation

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Self-Purging Redundancy  Uses threshold voter instead of majority voter  Threshold voter outputs 1 if number of input that are 1 greater than threshold –Otherwise outputs 0  Requires hot spares

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Self-Purging Redundancy

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Self-Purging Redundancy  Compared with 5MR  Self-purging with 5 modules –Tolerate up to 3 failing modules (5MR cannot) –Cannot tolerate two modules simultaneously failing (5MR can)  Compared with TMR with 2 spares  Self-purging with 5 modules –simpler reconfiguration circuitry –requires hot spares (3MR w/spares can use either hot or cold spares)

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Time Redundancy  Advantage  Less hardware  Drawback  Cannot detect permanent faults  If error detected  System needs to rollback to known good state before resuming operation

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Repeated Execution  Repeat operation twice  Simplest time redundancy approach  Detects temporary faults occurring during one execution (but not both) –Causes mismatch in results  Can reuse same hardware for both executions –Only one copy of functional hardware needed

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Repeated Execution  Requires mechanism for storing and comparing results of both executions  In processor, can store in memory or on disk and use software to compare  Main cost  Additional time for redundant execution and comparison

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Multi-threaded Redundant Execution  Can use in processor-based system that can run multiple threads  Two copies of thread executed concurrently  Results compared when both complete  Take advantage of processor’s built-in capability to exploit processing resources –Reduce execution time –Can significantly reduce performance penalty

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Multiple Sampling of Ouputs  Done at circuit-level  Sample once at end of normal clock cycle  Same again after delay of  t  Two samples compared to detect mismatch –Indicates error occurred  Detect fault whose duration is less than  t  Performance overhead depends on –Size of  t relative to normal clock period

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Multiple Sampling of Outputs  Simple approach using two latches

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Multiple Sampling of Outputs  Approach using stability checker at output

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Diverse Recomputation  Use same hardware, but perform computation differently second time  Can detect permanent faults that affects only one computation  For arithmetic or logical operations  Shift operands when performing second computation [Patel 1982]  Detects permanent fault affecting only one bit-slice

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Information Redundancy  Based on Error Detecting and Correcting Codes  Advantage  Detects both permanent and temporary faults  Implemented with less hardware overhead than using multiple copies of module  Disadvantage  More complex design

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Error Detection  Error detecting codes used to detect errors  If error detected –Rollback to previous known error-free state –Retry operation

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Rollback  Requires adding storage to save previous state  Amount of rollback depends on latency of error detection mechanism  Zero-latency error detection –rollback implemented by preventing system state from updating  If errors detected after n cycles –need rollback restoring system to state at least n clock cycles earlier

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Checkpoint  Execution divided into set of operations  Before each operation executed –checkpoint created where system state saved  If any error detected during operation –rollback to last checkpoint and retry operation  If multiple retries fail –operation halts and system flags that permanent fault has occurred

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Error Detection  Encode outputs of circuit with error detecting code  Non-codeword output indicates error

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Self-Checking Checker  Has two outputs  Normal error-free case (1,0) or (0,1)  If equal to each other, then error (0,0) or (1,1)  Cannot have single error indicator output –Stuck-at 0 fault on output could never be detected

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Totally Self-Checking Checker  Requires three properties  Code Disjoint –all codeword inputs mapped to codeword outputs  Fault Secure –for all codeword inputs, checker in presence of fault will either procedure correct codeword output or non-codeword output (not incorrect codeword)  Self-Testing –For each fault, at least one codeword input gives error indication

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Duplicate-and-Compare  Equality checker indicates error  Undetected error can occur only if common-mode fault affecting both copies  Only faults after stems detected  Over 100% overhead (including checker)

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Single-Bit Parity Code  Totally self-checking checker formed by removing final gate from XOR tree

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Single-Bit Parity Code  Cannot detect even bit errors  Can ensure no even bit errors by generating each output with independent cone of logic –Only single bit errors can occur due to single point fault –Typically requires a lot of overhead

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Parity-Check Codes  Each check bit is parity for some set of output bits  Example: 6 outputs and 3 check bits

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Parity-Check Codes  For c check bits and k functional outputs  2 ck possible parity check codes  Can choose code based on structure of circuit to minimize undetected error combinations  Fanouts in circuit determine possible error combinations due to single-point fault

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Checker for Parity-Check Codes  Constructed from single-bit parity checkers and two-rail checkers

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Two-Rail Checkers  Totally self-checking two-rail checker

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Berger Codes  Inverter-free circuit  Inverters only at primary inputs  Can be synthesized using only algebraic factoring [Jha 1993]  Only unidirectional errors possible for single point faults –Can use unidirectional code –Berger code gives 100% coverage

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Constant Weight Codes  Non-separable with lower redundancy  Drawback: need decoding logic to convert codeword back to its original binary value  Can use for encoding states of FSM –No need for decoding logic

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Error Correction  Information redundancy can also be used to mask errors  Not as attractive as TMR because logic for predicting check bits very complex  However, very good for memories –Check bits stored with data –Error do not propagate in memories as in logic circuits, so SEC-DED usually sufficient

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Error Correction  Memories very dense and prone to errors  Especially due to single-event upsets (SEUs) from radiation  SEC-DED check bits stored in memory  32-bit word, SEC-DED requires 7 check bits –Increases size of memory by 7/32=21.9%  64-bit word, SEC-DED requires 8 check bits –Increases size of memory by 8/64=12.5%

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Memory ECC Architecture

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Hamming Code for ECC RAM Z 1 Z 2 Z 3 Z 4 Z 5 Z 6 Z 7 Z 8 c 1 c 2 c 3 c 4 Parity Group Parity Group Parity Group Parity Group

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Memory ECC  SEC-DED generally very effective  Memory bit-flips tend to be independent and uniformly distributed  If bit-flip occurs, gets corrected next time memory location accessed  Main risk is if memory word not access for long time –Multiple bit-flips could accumulate

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Memory Scrubbing  Every location in memory read on periodic basis  Reduces chance of multiple errors accumulating in a memory word  Can be implemented by having memory controller cycle through memory during idle periods

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Multiple-Bit Upsets (MBU)  Can occur due to single SEU  Typically occur in adjacent memory cells  Memory interleaving used  To prevent MBUs from resulting in multiple bit errors in same word

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P TypeIssuesGoalExamplesTechniques Long-Life Systems Difficult or Expensive to Repair Maximize MTTF Satellites Spacecraft Implanted Biomedical Dynamic Redundancy Reliable Real-Time Systems Error or Delay Catastrophic Fault Masking Capability Aircraft Nuclear Power Plant Air Bag Electronics Radar TMR High Availability Systems Downtime Very Costly High Availability Reservation System Stock Exchange Telephone Systems No Single Point of Failure; Self-Checking Pairs; Fault Isolation High Integrity Systems Data Corruption Very Costly High Data Integrity Banking Transaction Processing Database Checkpointing, Time Redundancy; ECC; Redundant Disks Mainstream Low-Cost Systems Reasonable Level of Failures Acceptable Meet Failure Rate Expectations at Low Cost Consumer Electronics Personal Computers Often None; Memory ECC; Bus Parity; Changing as Technology Scales

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Concluding Remarks  Many different fault-tolerant schemes  Choosing scheme depends on  Types of faults to be tolerated –Temporary or permanent –Single or multiple point failures –etc.  Design constraints –Area, performance, power, etc.

EE141 System-on-Chip Test Architectures Ch. 3 - Fault-Tolerant Design - P Concluding Remarks  As technology scales  Circuits increasingly prone to failure  Achieving sufficient fault tolerance will be major design issue