Error Correcting Memory EECS 373 Jon Beaumont Ben Mason
What is ECC? Error Correcting Code is a mechanism for systems to ensure that data is reliable in all cases
Why ECC? ECC prevents both Soft Errors Transmission Errors This is particularly necessary in systems that must run continuously with very low tolerance for error
What happens after a Soft Error? Incorrect values in the instruction or data streams Best case: Execution of illegal instructions or memory addresses Automatic reboot Worst case: Error goes undetected and multiplies as data is used to calculate new data http://www.eetimes.com/design/programmable-logic/4390101/Enabling-error-resilience-throughout-the-embedded-system
ECC vs No-ECC
ECC Considerations What range of errors? How much overhead? Detection versus Correction
Different Methods of Memory Correction Detection Parity bit Detection and correction Triple-redundancy Hamming Code
Parity Bit (even parity) For every chunk of data, add a single parity bit set so there are in total an even number of binary 1's An odd number of binary 1's means an error has occured
Parity Bit (even parity) Raw Data: 1001011 (4 1’s) 0101111 (5 1’s) Prepend a parity bit 01001011 10101111
Parity Bit Cons Pros Can detect only an odd number of errors No way to detect which bit caused an error, can only discard data Pros Simple to implement (XOR) Low overhead Good for applications in which the original data can be easily resent/recalculated (e.g. SCSI, PCI, UART)
Triple Redundancy Data is calculated and stored 3 times Majority wins Pros: Simple to execute Can correct errors (potentially multiple bits) Cons: Very inefficient (1/2 data:overhead)
Hamming Code Objective: A concise method of detecting the precise location of an error so that it can be detected and corrected without drastic action Intuition: include multiple parity bits, so that each data bit can be uniquely identified by a set of parity bits which cover it
Hamming Code Algorithm: Assign each position in a chunk of data a binary number Those positions that are a power of 2 (i.e. have exactly one 1 bit) are parity bits
Hamming Code Algorithm: Parity bits cover all data bits whose binary position shares a common 1 bit [D7, D5, D3 , P1]
Hamming Code Algorithm: Parity bits cover all data bits whose binary position shares a common 1 bit [D7,D6, D3, P2]
Hamming Code Algorithm: Parity bits cover all data bits whose binary position shares a common 1 bit [D7, D6, D5, P4]
Hamming Code Example: Encoding the following nibble using even-parity: Allocate space for parity bits: b110_1__
Hamming Code Example: Encoding the following nibble using even-parity: P1 covers [D7,D5,D3] b110_1_?
Hamming Code Example: Encoding the following nibble using even-parity: P1 covers [D7,D5,D3] b110_1_0
Hamming Code Example: Encoding the following nibble using even-parity: P2 covers [D7,D6,D3] b110_1?0
Hamming Code Example: Encoding the following nibble using even-parity: P2 covers [D7,D6,D3] b110_110
Hamming Code Example: Encoding the following nibble using even-parity: P4 covers [D7,D6,D5] b110?110
Hamming Code Example: Encoding the following nibble using even-parity: P4 covers [D7,D6,D5] b1100110
Hamming Code Example: Encoding the following nibble using even-parity: Encoded data b1100110
Hamming Code D6 gets flipped between write and read b1100110 -> b1000110
Hamming Code D6 gets flipped between write and read b1100110 -> b1000110 Parity bit 1: b1000110 Even number 1 bits -> No Error
Hamming Code D6 gets flipped between write and read b1100110 -> b1000110 Parity bit 2: b1000110 Odd number 1 bits -> ERROR Parity bits generating error: [P2]
Hamming Code D6 gets flipped between write and read b1100110 -> b1000110 Parity bit 4: b1000110 Odd number 1 bits -> ERROR Parity bits generating error: [P2, P4]
Hamming Code Parity bits generating error: [P2, P4] X= ERROR O= NO ERROR Only column with just X's is D6, the incorrect bit D3 D5 D6 D7 P1 O P2 X P4
Hamming Code Pros: Overhead of only O(log(n)) bits 4 data bits -> 3 parity bits (57%) 248 data bits -> 8 parity bits (97%) Good for large chunks of memory (DRAM) Cons: More complicated to implement detection logic than simple parity bit
Drawbacks of ECC More Expensive When error correcting algorithm acts on shorter correction code, performance drops abruptly. This loss of performance known as “error floor phenomenon”
Recent Developments in ECC Moving away from Hamming Code scheme towards BCH code which is more efficient For more information visit http://www.princeton.edu/~achaney/tmve/wiki100k/docs/ BCH_code.html
Questions?