4. Information Redundancy Reliable System Design 2010 by: Amir M. Rahmani.

Slides:



Advertisements
Similar presentations
Noise, Information Theory, and Entropy (cont.) CS414 – Spring 2007 By Karrie Karahalios, Roger Cheng, Brian Bailey.
Advertisements

Cyclic Code.
10.1 Chapter 10 Error Detection and Correction Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Ch 2.7 Error Detection & Correction CS-147 Tu Hoang.
Introduction to Information Technologies
Data and Computer Communications
CSE 461: Error Detection and Correction. Next Topic  Error detection and correction  Focus: How do we detect and correct messages that are garbled during.
Reliability & Channel Coding
NETWORKING CONCEPTS. ERROR DETECTION Error occures when a bit is altered between transmission& reception ie. Binary 1 is transmitted but received is binary.
Coding and Error Control
Fundamentals of Computer Networks ECE 478/578 Lecture #4: Error Detection and Correction Instructor: Loukas Lazos Dept of Electrical and Computer Engineering.
CSCI 4550/8556 Computer Networks Comer, Chapter 7: Packets, Frames, And Error Detection.
Error detection and correction
7/2/2015Errors1 Transmission errors are a way of life. In the digital world an error means that a bit value is flipped. An error can be isolated to a single.
Data Transmission Most digital messages are longer than just a few bits. It is neither practical nor economical to transfer all bits of a long message.
Error Detection and Correction
1/26 Chapter 6 Digital Data Communication Techniques.
Chapter 10 Error Detection and Correction
Data link layer: services
Channel Coding and Error Control
Part.7.1 Copyright 2007 Koren & Krishna, Morgan-Kaufman FAULT TOLERANT SYSTEMS Part 7 - Coding.
It is physically impossible for any data recording or transmission medium to be 100% perfect 100% of the time over its entire expected useful life. As.
COM342 Networks and Data Communications
Lecture 10: Error Control Coding I Chapter 8 – Coding and Error Control From: Wireless Communications and Networks by William Stallings, Prentice Hall,
Data Link Layer - 1 Dr. Sanjay P. Ahuja, Ph.D. Fidelity National Financial Distinguished Professor of CIS School of Computing, UNF.
CS3502: Data and Computer Networks DATA LINK LAYER - 1.
1 Data Link Layer Lecture 20 Imran Ahmed University of Management & Technology.
British Computer Society
Error Coding Transmission process may introduce errors into a message.  Single bit errors versus burst errors Detection:  Requires a convention that.
Part 2: Packet Transmission Packets, frames Local area networks (LANs) Wide area networks (LANs) Hardware addresses Bridges and switches Routing and protocols.
10.1 Chapter 10 Error Detection and Correction Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Data Link Layer: Error Detection and Correction
MIMO continued and Error Correction Code. 2 by 2 MIMO Now consider we have two transmitting antennas and two receiving antennas. A simple scheme called.
Data and Computer Communications Chapter 6 – Digital Data Communications Techniques.
Data and Computer Communications by William Stallings Eighth Edition Digital Data Communications Techniques Digital Data Communications Techniques Click.
Cyclic Redundancy Check CRC Chapter CYCLIC CODES Cyclic codes are special linear block codes with one extra property. In a cyclic code, if a codeword.
Lecture 3-2: Coding and Error Control (Cont.) ECE
Data Integrity © Prof. Aiman Hanna Department of Computer Science Concordia University Montreal, Canada.
CS3505: DATA LINK LAYER. data link layer  phys. layer subject to errors; not reliable; and only moves information as bits, which alone are not meaningful.
Computer Science Division
Computer Networks Lecture 2: Data Link Based on slides from D. Choffnes Northeastern U. and P. Gill from StonyBrook University Revised Autumn 2015 by S.
CSCI 465 D ata Communications and Networks Lecture 9 Martin van Bommel CSCI 465 Data Communications & Networks 1.
Lecture Focus: Data Communications and Networking  Data Link Layer  Error Control Lecture 19 CSCS 311.
1 © Unitec New Zealand CRC calculation and Hammings code.
Error Detection.
10.1 Chapter 10 Error Detection and Correction Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
10.1 Chapter 10 Error Detection and Correction Data can be corrupted during transmission. Some applications require that errors be detected and.
Error Detection and Correction
McGraw-Hill©The McGraw-Hill Companies, Inc., 2000 PART III: DATA LINK LAYER ERROR DETECTION AND CORRECTION 7.1 Chapter 10.
Transmission Errors Error Detection and Correction.
10.1 Chapter 10 Error Detection and Correction Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Network Layer4-1 Chapter 5: The Data Link Layer Our goals: r understand principles behind data link layer services: m error detection, correction m sharing.
Part III: Data Link Layer Error Detection and Correction
Lecture 4 Error Detecting and Correcting Techniques Dr. Ghalib A. Shah
Error Detection and Correction
2.8 Error Detection and Correction
Error Detection & Correction
Introduction to Information Technologies
Data Link Layer.
Error Detection and Correction
Subject Name: COMPUTER NETWORKS-1
Advanced Computer Networks
CIS 321 Data Communications & Networking
Error Detection and Correction
Basic concepts Networks must be able to transfer data from one device to another with complete accuracy. Data can be corrupted during transmission. For.
Introduction to Information Technologies
Error Detection and Correction
Error Detection and Correction
2.8 Error Detection and Correction
Data Link Layer. Position of the data-link layer.
Presentation transcript:

4. Information Redundancy Reliable System Design 2010 by: Amir M. Rahmani

matlab1.ir Information Redundancy Code: representing information - Morse code Code word: collection of symbols or digit, use to representing information according to the rules of a given code Binary code: a code in which code words contain only symbols that are either 0 or 1. Error detection, Error correction Coding often applied to - Information transfer: often serial communication through a channel - Information storage

matlab1.ir Start with k-bit data word Add r code bits to k-bit data Total = n-bit code word (n=k+r) Not all 2 n combinations are valid code words For certain encoding schemes - some types of errors can also be corrected To extract original data - n bits must be decoded Overhead = r/n – e.g., for (single-bit) parity, the overhead is 1/n – additional bits required – time to encode and decode

matlab1.ir Hamming distance (d) Number of bits in which two words differ from each other; d (x,y)=Σ(x k XOR y k ) E.g., 0010 and 1110 have a Hamming distance of 2 Rules: Iff d (x,y)= 0 then x=y d (x,y)= d (y,x) d (x,y)= d (y,z)>= d (x,z) For a group of code words, d is the minimum of all hamming distance between all possible pairs of code words. E.g., {000, 011, 101, 110} have a Hamming distance of 2 d determines the code’s ability to detect and/or correct errors – d-1 bit for error detection – [(d-1)/2] bit for error correction

matlab1.ir Hamming distance (d) Two words in this figure are connected by an edge if their d is 1 d=2 Can detect single bit errors

matlab1.ir Hamming distance (d) The code {000,111} can be used to encode a single data bit. 0 can be encoded as 000 and 1 as 111. This code is identical to TMR d=3 Can detect single & double bit errors, can correct single bit errors

matlab1.ir Separability of a Code A code is separable if it has separate fields for the data and the code bits. Decoding consists of disregarding the code bits The code bits can be processed separately to verify the correctness of the data A non-separable code has the data and code bits integrated together - extracting the data from the encoded word requires some processing

matlab1.ir Single-bit Parity Simplest separable error detection code – Adds one bit of redundancy to each data word Encoding and decoding cost is low Even (odd) parity: add bit such that total number of ones in code word is even (odd) – E.g., gets a parity bit of 0 for even parity (1 for odd) Can detect all single-bit errors (All odd-bit errors) – Hamming distance >= 2 – Could be greater than 2 if data words don’t use all bit combinations Drawbacks: – Unable to detect common even errors

matlab1.ir Single-bit Even Parity

matlab1.ir Even or Odd Parity? The decision depends on which type of all-bits error is more probable For even parity - the parity bit for the all zeroes data word will be 0 and an all-0’s failure will go undetected - it is a valid code word Selecting the odd parity code will allow the detection of the all-0's failure If all-1's failure is more likely - the odd parity code must be selected if the total number of bits (n+1) is even, and the even parity if n+1 is odd

matlab1.ir Byte-Interlaced Parity Code Example: n=64, data bits - a 63,a 62,…,a 0 Eight parity bits: First - parity bit of a 63,a 55,a 47,a 39,a 31,a 23,a 15,a 7 - the most significant bits in the eight bytes Remaining seven parity bits - assigned so that the corresponding groups of bits are interlaced Scheme is beneficial when shorting of adjacent bits is a common failure mode (example - a bus) If parity type (odd or even) is alternated between groups - unidirectional errors (all-0's or all-1's) will also be detected

matlab1.ir Overlapping Parity Code Simplest scheme; data is organized in a 2-dimensional array Bits at the end of row - parity over that row Bits at the bottom of column - parity over column Error correcting code? - A single-bit error anywhere will cause a row and a column to be erroneous This identifies a unique erroneous bit This is an example of overlapping parity - each bit is covered by more than one parity bit

matlab1.ir Checksum Separable code Checksum is the sum of the original data All checksum schemes allow error detection but not error location - entire block of data must be retransmitted if an error is detected a) Single-precision checksum – overflow problem, i.e. adding n bits modulo 2 n b) Double-precision checksum – uses double precision, i.e. compute 2n-bit checksum from n-bit words using modulo 2 2n arithmetic. c) Residue checksum – like single-precision checksum, but overflow is now fed back as carry d) Honeywell checksum – compose word of double length by concatenating 2 consecutive words (done modulo 2 2n ) – compute checksum on these double words

matlab1.ir Comparing the Checksum Types

matlab1.ir Comparison - Example In Single-precision checksum - transmitted checksum differs from computed checksum In Honeywell checksum computed checksum differs from received checksum and error is detected

matlab1.ir Cyclic Codes Cyclic codes are often non-separable although separable cyclic codes exist Encoding consists of dividing the data word by a constant number The coded word is the product Decoding is dividing by the same constant - if the remainder is non-zero, an error has occurred Cyclic codes are widely used in data storage and communication

matlab1.ir Cyclic Redundancy Checks (CRC) CRC is based on a mathematical calculation performed on message. We will use the following terms: M - Message to be sent (k bits) F - Frame Check Sequence (FCS) or CRC to be appended to message (n bits) T - Transmitted message includes both M and F => (k+n bits) G - n+1 bit pattern (called polynomial generator) used to calculate F and check T

matlab1.ir Cyclic Redundancy Check (CRC) Key idea – given a k-bit frame (message) – transmitter generates a n-bit sequence called frame check sequence (FCS) – so that resulting frame of size k+n is exactly divisible by some predetermined number Multiply M by 2 n to shift, and add F to padded 0s T = 2 n M + F Dividing 2 n M by G gives quotient and remainder (remainder is 1 bit less than divisor) 2 n M/G = Q + R/G then using R as our FCS we get T = 2 n M + F on the receiving end, division by G leads to T/G = (2 n M +R)/G = Q + R/G +R/G =Q If remainder is non-zero, it’s an error

matlab1.ir Cyclic Redundancy Check (CRC) Example, assume G(X) has at least 3 terms – G(x) has 3 1-bits » detects all single bit errors » detects all double bit errors » detects odd #’s of errors if G(X) contains the factor (X + 1) » any burst errors < length of FCS » most larger burst errors

matlab1.ir Cyclic Redundancy Check (CRC) A polynomial view: variable X with binary coefficients, where the coefficients correspond to the bits in the number. M = , M(X) = X 5 + X 4 + X + 1, and for G = we have G(X) = X 4 + X Math is still mod 2 » An error E(X) is received, and undetected iff it is divisible by G(X)

matlab1.ir CRC Example M = , G = 1101 ; XOR instead of Minus | => CRC = 100

matlab1.ir Cyclic Redundancy Check (CRC) Pre-defined polynomial examples: CRC-12: X 12 +X 11 +X 3 +X 2 +X+1 CRC-16: X 16 +X 15 +X 2 +1 CRC-CCITT = X 16 + X 12 + X Why is CRC popular? Easy to implement! Just need shifters and XORs Hardware Implementation: G(X) = 1 + a 1 X +a 2 X + …+ a n-1 X n-1 + a n X n

matlab1.ir Hamming Code (7,4) Class of (n,k) Hamming codes, e.g., (7,4) [r= n-k =3] Let i 1, i 2, i 3, i 4 be the information bits Let p 1, p 2, p 4 be the check bits p 1 = i 1 XOR i 2 XOR i 4 p 2 = i 1 XOR i 3 XOR i 4 p 4 = i 2 XOR i 3 XOR i 4

Unordered code To detect all unidirectional errors M-of-n code Berger code matlab1.ir

m-of-n codes All code words are n bits in length and contain exactly m 1’s Simple implementation Can detect all single errors Can detect all unidirectional multiple errors

matlab1.ir Berger Code Separable code. counts the number of 1s in the word. expresses it in binary. complements it. appends this quantity to the data Example - encoding Four 1s. 100 in binary. 011 after complementing. the encoded word Can detect all single errors Can Detects all unidirectional bit errors - one or more 1s turn to 0s and no 0s turn to 1s (or vice versa) Overhead = r/(k+r) k data bits - at most k 1s, r =[log 2 (k+1)] redundant bits

matlab1.ir Other Coding Schemes Many Error Detecting/Correcting codes exist – E.g., Arithmetic codes, Reed-Solomon codes, Residue codes, Bi-Residue codes, etc. Many of them require more mathematic than belongs in this course Reasons for other types of codes – Burst errors – Byte errors – Cost/Performance – Multiple-bit errors – Ease of hardware implementation

matlab1.ir Error Recovery Probably the most important phase of any fault-tolerance technique Two approaches: Forward Backward

matlab1.ir Forward Error Recovery Forward Error Recovery continues from an erroneous state by making selective corrections to the system state This includes making safe the controlled environment which may be damaged because of the failure It is system specific and depends on accurate predictions of the location and cause of errors Examples: redundant pointers in data structures and the use of self-correcting codes such as Hamming Codes

matlab1.ir Backward Error Recovery (BER) If error detected, recover backwards & re-execute – Recover to previous state of system that we know is error-free – Assumes that error will be gone by time of re-execution Some terminology: – Recovery point: the point to which we recover in case of error – Check pointing: periodically saving state of system – Logging: saving changes made to system state Many commercial machines use BER – Sequoia, Synapse N+1, Tandem NonStop BER also includes all-software schemes – Nightly backups of file systems May sacrifice performance to achieve availability – Where might we lose performance? – May not be suitable for real-time systems Disadvantage – it cannot undo errors in the environment!

matlab1.ir The Domino Effect With concurrent processes that interact with each other, BER is more complex Consider: R 22 R 21 R 13 R 12 R 11 IPC 4 IPC 3 IPC 2 IPC 1 Execution time P1P1 P2P2 If the error is detected in P1 rollback to R13. If the error is detected in P2?

matlab1.ir 6 BER Issues 1- What state needs to be saved? 2- How do we save this state? 3- Where do we save it? 4- How often do we save it? 5- How do we recover the system to this state? 6- How do we resume execution after recovery?

matlab1.ir 1- What State needs to be saved Need to save all state that would be necessary if this were to become the recovery point In general, we only need to save the user-visible state For example, microprocessors: – Must save architectural state – Don’t have to worry about micro- architectural state

matlab1.ir 2- How to Save State Two “hints” of BER: – Check pointing: Periodically stop system and save state – Logging: Log all changes to state Check pointing – Only suffers overhead at periodic checkpoints – Can only recover at coarse granularity – Size of checkpoint is often fixed Logging – Finer granularity of rollback – suffers overhead for logging many common operations – Amount of state logged is variable

matlab1.ir 3- Where to Save State Have to save state where it is “reliable” – A fault in the recovery point state could make recovery impossible In processor (can’t survive loss of processor chip) – Processor saves registers to shadow registers In cache (same as processor, if on-chip cache) – Processor copies registers into cache In memory (memory can be made very reliable) – Processor copies registers into memory – Write-through cache copies data into memory In disk (maybe the safest, but slow) – E.g., databases log updates to disks In tape (too slow except for rare backups)

matlab1.ir 4- When to Save State Check pointing – Can choose checkpoint interval Logging – Continuously saving state (every time it changes) For check pointing, a larger checkpoint interval means – Less overhead due to check pointing (since less frequent) – Coarser checkpoint granularity (can’t recover to arbitrary point)

matlab1.ir 5- How to Recover State Check pointing: Copy pre-fault recovery point checkpoint into architectural state Logging: Unroll log to undo changes since recovery point Tradeoff between these two depends on system

matlab1.ir 6- How to Resume Execution Simply resuming execution after recovery may not be possible – E.g., recovery due to hard fault in interconnection switch May need to reconfigure before resuming, to ensure forward progress – E.g., reconfiguring the routing in interconnect to avoid dead switch

matlab1.ir Implementing EDC/ECC in Hardware Where does EDC/ECC get used? – Disk, CD-ROM – Memory (DRAM, SRAM) – Buses Tradeoff between EDC and ECC ECC: Forward error recovery – Often on critical path, so can slow down even fault- free system - in a ony-way transmition EDC: Backward error recovery – Detecting error requires recovery (can be slow)