4. Information Redundancy Reliable System Design 2010 by: Amir M. Rahmani
matlab1.ir Information Redundancy Code: representing information - Morse code Code word: collection of symbols or digit, use to representing information according to the rules of a given code Binary code: a code in which code words contain only symbols that are either 0 or 1. Error detection, Error correction Coding often applied to - Information transfer: often serial communication through a channel - Information storage
matlab1.ir Start with k-bit data word Add r code bits to k-bit data Total = n-bit code word (n=k+r) Not all 2 n combinations are valid code words For certain encoding schemes - some types of errors can also be corrected To extract original data - n bits must be decoded Overhead = r/n – e.g., for (single-bit) parity, the overhead is 1/n – additional bits required – time to encode and decode
matlab1.ir Hamming distance (d) Number of bits in which two words differ from each other; d (x,y)=Σ(x k XOR y k ) E.g., 0010 and 1110 have a Hamming distance of 2 Rules: Iff d (x,y)= 0 then x=y d (x,y)= d (y,x) d (x,y)= d (y,z)>= d (x,z) For a group of code words, d is the minimum of all hamming distance between all possible pairs of code words. E.g., {000, 011, 101, 110} have a Hamming distance of 2 d determines the code’s ability to detect and/or correct errors – d-1 bit for error detection – [(d-1)/2] bit for error correction
matlab1.ir Hamming distance (d) Two words in this figure are connected by an edge if their d is 1 d=2 Can detect single bit errors
matlab1.ir Hamming distance (d) The code {000,111} can be used to encode a single data bit. 0 can be encoded as 000 and 1 as 111. This code is identical to TMR d=3 Can detect single & double bit errors, can correct single bit errors
matlab1.ir Separability of a Code A code is separable if it has separate fields for the data and the code bits. Decoding consists of disregarding the code bits The code bits can be processed separately to verify the correctness of the data A non-separable code has the data and code bits integrated together - extracting the data from the encoded word requires some processing
matlab1.ir Single-bit Parity Simplest separable error detection code – Adds one bit of redundancy to each data word Encoding and decoding cost is low Even (odd) parity: add bit such that total number of ones in code word is even (odd) – E.g., gets a parity bit of 0 for even parity (1 for odd) Can detect all single-bit errors (All odd-bit errors) – Hamming distance >= 2 – Could be greater than 2 if data words don’t use all bit combinations Drawbacks: – Unable to detect common even errors
matlab1.ir Single-bit Even Parity
matlab1.ir Even or Odd Parity? The decision depends on which type of all-bits error is more probable For even parity - the parity bit for the all zeroes data word will be 0 and an all-0’s failure will go undetected - it is a valid code word Selecting the odd parity code will allow the detection of the all-0's failure If all-1's failure is more likely - the odd parity code must be selected if the total number of bits (n+1) is even, and the even parity if n+1 is odd
matlab1.ir Byte-Interlaced Parity Code Example: n=64, data bits - a 63,a 62,…,a 0 Eight parity bits: First - parity bit of a 63,a 55,a 47,a 39,a 31,a 23,a 15,a 7 - the most significant bits in the eight bytes Remaining seven parity bits - assigned so that the corresponding groups of bits are interlaced Scheme is beneficial when shorting of adjacent bits is a common failure mode (example - a bus) If parity type (odd or even) is alternated between groups - unidirectional errors (all-0's or all-1's) will also be detected
matlab1.ir Overlapping Parity Code Simplest scheme; data is organized in a 2-dimensional array Bits at the end of row - parity over that row Bits at the bottom of column - parity over column Error correcting code? - A single-bit error anywhere will cause a row and a column to be erroneous This identifies a unique erroneous bit This is an example of overlapping parity - each bit is covered by more than one parity bit
matlab1.ir Checksum Separable code Checksum is the sum of the original data All checksum schemes allow error detection but not error location - entire block of data must be retransmitted if an error is detected a) Single-precision checksum – overflow problem, i.e. adding n bits modulo 2 n b) Double-precision checksum – uses double precision, i.e. compute 2n-bit checksum from n-bit words using modulo 2 2n arithmetic. c) Residue checksum – like single-precision checksum, but overflow is now fed back as carry d) Honeywell checksum – compose word of double length by concatenating 2 consecutive words (done modulo 2 2n ) – compute checksum on these double words
matlab1.ir Comparing the Checksum Types
matlab1.ir Comparison - Example In Single-precision checksum - transmitted checksum differs from computed checksum In Honeywell checksum computed checksum differs from received checksum and error is detected
matlab1.ir Cyclic Codes Cyclic codes are often non-separable although separable cyclic codes exist Encoding consists of dividing the data word by a constant number The coded word is the product Decoding is dividing by the same constant - if the remainder is non-zero, an error has occurred Cyclic codes are widely used in data storage and communication
matlab1.ir Cyclic Redundancy Checks (CRC) CRC is based on a mathematical calculation performed on message. We will use the following terms: M - Message to be sent (k bits) F - Frame Check Sequence (FCS) or CRC to be appended to message (n bits) T - Transmitted message includes both M and F => (k+n bits) G - n+1 bit pattern (called polynomial generator) used to calculate F and check T
matlab1.ir Cyclic Redundancy Check (CRC) Key idea – given a k-bit frame (message) – transmitter generates a n-bit sequence called frame check sequence (FCS) – so that resulting frame of size k+n is exactly divisible by some predetermined number Multiply M by 2 n to shift, and add F to padded 0s T = 2 n M + F Dividing 2 n M by G gives quotient and remainder (remainder is 1 bit less than divisor) 2 n M/G = Q + R/G then using R as our FCS we get T = 2 n M + F on the receiving end, division by G leads to T/G = (2 n M +R)/G = Q + R/G +R/G =Q If remainder is non-zero, it’s an error
matlab1.ir Cyclic Redundancy Check (CRC) Example, assume G(X) has at least 3 terms – G(x) has 3 1-bits » detects all single bit errors » detects all double bit errors » detects odd #’s of errors if G(X) contains the factor (X + 1) » any burst errors < length of FCS » most larger burst errors
matlab1.ir Cyclic Redundancy Check (CRC) A polynomial view: variable X with binary coefficients, where the coefficients correspond to the bits in the number. M = , M(X) = X 5 + X 4 + X + 1, and for G = we have G(X) = X 4 + X Math is still mod 2 » An error E(X) is received, and undetected iff it is divisible by G(X)
matlab1.ir CRC Example M = , G = 1101 ; XOR instead of Minus | => CRC = 100
matlab1.ir Cyclic Redundancy Check (CRC) Pre-defined polynomial examples: CRC-12: X 12 +X 11 +X 3 +X 2 +X+1 CRC-16: X 16 +X 15 +X 2 +1 CRC-CCITT = X 16 + X 12 + X Why is CRC popular? Easy to implement! Just need shifters and XORs Hardware Implementation: G(X) = 1 + a 1 X +a 2 X + …+ a n-1 X n-1 + a n X n
matlab1.ir Hamming Code (7,4) Class of (n,k) Hamming codes, e.g., (7,4) [r= n-k =3] Let i 1, i 2, i 3, i 4 be the information bits Let p 1, p 2, p 4 be the check bits p 1 = i 1 XOR i 2 XOR i 4 p 2 = i 1 XOR i 3 XOR i 4 p 4 = i 2 XOR i 3 XOR i 4
Unordered code To detect all unidirectional errors M-of-n code Berger code matlab1.ir
m-of-n codes All code words are n bits in length and contain exactly m 1’s Simple implementation Can detect all single errors Can detect all unidirectional multiple errors
matlab1.ir Berger Code Separable code. counts the number of 1s in the word. expresses it in binary. complements it. appends this quantity to the data Example - encoding Four 1s. 100 in binary. 011 after complementing. the encoded word Can detect all single errors Can Detects all unidirectional bit errors - one or more 1s turn to 0s and no 0s turn to 1s (or vice versa) Overhead = r/(k+r) k data bits - at most k 1s, r =[log 2 (k+1)] redundant bits
matlab1.ir Other Coding Schemes Many Error Detecting/Correcting codes exist – E.g., Arithmetic codes, Reed-Solomon codes, Residue codes, Bi-Residue codes, etc. Many of them require more mathematic than belongs in this course Reasons for other types of codes – Burst errors – Byte errors – Cost/Performance – Multiple-bit errors – Ease of hardware implementation
matlab1.ir Error Recovery Probably the most important phase of any fault-tolerance technique Two approaches: Forward Backward
matlab1.ir Forward Error Recovery Forward Error Recovery continues from an erroneous state by making selective corrections to the system state This includes making safe the controlled environment which may be damaged because of the failure It is system specific and depends on accurate predictions of the location and cause of errors Examples: redundant pointers in data structures and the use of self-correcting codes such as Hamming Codes
matlab1.ir Backward Error Recovery (BER) If error detected, recover backwards & re-execute – Recover to previous state of system that we know is error-free – Assumes that error will be gone by time of re-execution Some terminology: – Recovery point: the point to which we recover in case of error – Check pointing: periodically saving state of system – Logging: saving changes made to system state Many commercial machines use BER – Sequoia, Synapse N+1, Tandem NonStop BER also includes all-software schemes – Nightly backups of file systems May sacrifice performance to achieve availability – Where might we lose performance? – May not be suitable for real-time systems Disadvantage – it cannot undo errors in the environment!
matlab1.ir The Domino Effect With concurrent processes that interact with each other, BER is more complex Consider: R 22 R 21 R 13 R 12 R 11 IPC 4 IPC 3 IPC 2 IPC 1 Execution time P1P1 P2P2 If the error is detected in P1 rollback to R13. If the error is detected in P2?
matlab1.ir 6 BER Issues 1- What state needs to be saved? 2- How do we save this state? 3- Where do we save it? 4- How often do we save it? 5- How do we recover the system to this state? 6- How do we resume execution after recovery?
matlab1.ir 1- What State needs to be saved Need to save all state that would be necessary if this were to become the recovery point In general, we only need to save the user-visible state For example, microprocessors: – Must save architectural state – Don’t have to worry about micro- architectural state
matlab1.ir 2- How to Save State Two “hints” of BER: – Check pointing: Periodically stop system and save state – Logging: Log all changes to state Check pointing – Only suffers overhead at periodic checkpoints – Can only recover at coarse granularity – Size of checkpoint is often fixed Logging – Finer granularity of rollback – suffers overhead for logging many common operations – Amount of state logged is variable
matlab1.ir 3- Where to Save State Have to save state where it is “reliable” – A fault in the recovery point state could make recovery impossible In processor (can’t survive loss of processor chip) – Processor saves registers to shadow registers In cache (same as processor, if on-chip cache) – Processor copies registers into cache In memory (memory can be made very reliable) – Processor copies registers into memory – Write-through cache copies data into memory In disk (maybe the safest, but slow) – E.g., databases log updates to disks In tape (too slow except for rare backups)
matlab1.ir 4- When to Save State Check pointing – Can choose checkpoint interval Logging – Continuously saving state (every time it changes) For check pointing, a larger checkpoint interval means – Less overhead due to check pointing (since less frequent) – Coarser checkpoint granularity (can’t recover to arbitrary point)
matlab1.ir 5- How to Recover State Check pointing: Copy pre-fault recovery point checkpoint into architectural state Logging: Unroll log to undo changes since recovery point Tradeoff between these two depends on system
matlab1.ir 6- How to Resume Execution Simply resuming execution after recovery may not be possible – E.g., recovery due to hard fault in interconnection switch May need to reconfigure before resuming, to ensure forward progress – E.g., reconfiguring the routing in interconnect to avoid dead switch
matlab1.ir Implementing EDC/ECC in Hardware Where does EDC/ECC get used? – Disk, CD-ROM – Memory (DRAM, SRAM) – Buses Tradeoff between EDC and ECC ECC: Forward error recovery – Often on critical path, so can slow down even fault- free system - in a ony-way transmition EDC: Backward error recovery – Detecting error requires recovery (can be slow)