Download presentation
Presentation is loading. Please wait.
Published byAgnes Watts Modified over 9 years ago
1
SAFER: Stuck-At-Fault Error Recovery for Memories Nak Hee Seong † Dong Hyuk Woo † Vijayalakshmi Srinivasan ‡ Jude A. Rivers ‡ Hsien-Hsin S. Lee † ‡†
2
2 Emerging Memory Technologies Resistive memories –Due to DRAM scaling challenge Phase Change Memory (PCM) Scalability, high density Limited write endurance (Avg. 10 8 writes) Incurring stuck-at faults
3
3 Cell Write Endurance Endurance variation –No spatial correlation –Increases with technology scaling Issues –Unpredictable cell endurance Read verification required for each write –The weakest cell dictates memory lifetime! –# of stuck-at faults gradually grows! Multi-bit error recovery scheme is needed!
4
4 Existing Error Correcting Methods (72,64) Hamming code –For transient faults –Single Error Correction Double Error Detection (SECDED) –12.5% overhead Error-Correcting Pointers (ECP) [Schechter, ISCA37] –Dynamically replace failed cells with extra cells –Storing multiple fail pointers for each data block –Recover from 6 fails with 61-bit overhead (11.9%)
5
5 SAFER: Stuck-At-Fault Error Recovery
6
SEC 6 Concept of SAFER Exploit two properties of Stuck-At Faults –Permanency –Readability Multiple error correction –Fault separation –Low-cost Single Error Correction (SEC) Fault Separation
7
7 SAFER: 1. Fault Separation 2. Single Error Correction
8
8 Fault Separation Assuming 2 faults in an 8-bit block –C(8,2) = 28 possible fault pairs How to separate these 2 faults (of all 28 pairs)? 76543210 Pattern #2 Pattern #1 Pattern #0 76543210 76543210
9
9 Pattern #2 Pattern #1 Pattern #0 Decision for Fault Separation Use bit pointers for fault separation Data Block Bit Pointer 76543210 1 1 0 1 0 1 1 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1 1 1 bit 2 bit 1 bit 0 Bit Pointer
10
10 Pattern #0 Pattern #1 Pattern #2 Decision for Fault Separation Find pattern candidates by XORing bit pointers Data Block Bit Pointer 76543210 11110000 11001100 10101010 1 0 0 Difference Vector bit 2 bit 1 bit 0 Bit Pointer
11
11 Pattern #0 Pattern #1 Pattern #2 Decision for Fault Separation Find pattern candidates by XORing bit pointers Data Block Bit Pointer 76543210 11110000 11001100 10101010 0 1 1 bit 2 bit 1 bit 0 Bit Pointer
12
12 Extension to Multi-Group Partition Use two bits for 4 group partition Data Block Bit Pointer 1514131211109876543210 (bit 3, bit 2) (bit 3, bit 1) (bit 3, bit 0) (bit 2, bit 1) (bit 2, bit 0) (bit 1, bit 0) bit 2 bit 1 bit 0 Bit Pointer bit 31111111100000000 1111111100000000 1100110011001100 1010101010101010
13
Data Block 1 st Partition Field 2 nd Partition Field bit 0 Fixed Partition Counter 1 bit 3 Data Block 1 st Partition Field bit 2 2 nd Partition Field bit 0 Fixed Partition Counter 0 13 Dynamic Partition 4 group partition for a 16-bit data block Data Block 1 st Partition Field bit 2 2 nd Partition Field bit 0 Fixed Partition Counter 0 Bit Pointer 1514131211109876543210 bit 3 1 bit 1 2 1000 0010 = 1010 0010 0000 = 0010
14
14 Dynamic Partition Objective –Separate multiple stuck-at faults into different groups Additional meta data –Assuming an n bit block and a k group partition – log 2 k log 2 log 2 n log 2 log 2 k 1 Example: n = 512, k = 32 –Required meta data: 23 bits/block –6 the number of separable stuck-at faults 32 # of partition fields size of each partition field size of fixed partition counter
15
15 SAFER: 1. Fault Separation 2. Single Error Correction
16
16 Low-cost Single Error Correction Stuck-At Fault Property: Readability 1010 1010 Write Verify 1010
17
1010 17 Low-cost Single Error Correction Stuck-At Fault Property: Readability Write 0101 011 Verify 0101 0
18
18 Low-cost Single Error Correction Stuck-At Fault Property: Readability 1010 1000 Write Verify Need to recover!! 0 011100
19
19 Low-cost Single Error Correction Data Inversion as an SEC 1010 0101 2 nd Write 2 nd Verify Recovered from Stuck-At Fault!! 0101“F” Inversion & Mark 011 “F” 1010 Inversion 0 Flip Mark One additional bit per group
20
20 Design Issues
21
21 SAFER Sequence for a Write N Start Read Write (1 st ) Verify Error SuccessFailure Inversion Write (2 nd ) Verify Error N Y Y Fixed Partition Counter < MAX Re-partition Y N Y Drawbacks: - accelerating wear-out - performance degradation
22
22 Fail Information Cache Objective: avoid the 2 nd writes Solution: early inversion decision Fail Info. Cache with 1K entries –Keep track of recent data blocks with stuck-at faults –Store fail positions and their stuck-at values 0 0 1 tag_a 0 0 Bank #0 TagValidStuck Value Cache Index 0 1 tag_b 0 1 tag_c 1 0 Bank #1 1 tag_d 1 0 0 1 tag_e 0 Bank #15 Block AddressFail Pointer TagIndexBank Addr
23
23 Evaluation
24
Monte Carlo simulations –Data block size = 512 bits –Perfect wear-leveling scheme (256-byte block) –Cell write endurance: –IdealECC, ECP, SAFER, SAFER_FC Hardware overhead
25
25 Relative Lifetime Improvement 14.8% Cell write endurance: – = 100M writes, = 10M writes
26
26 Conclusion Need to recover from multiple stuck-at faults SAFER –Efficient recovery scheme –handles the growing stuck-at faults Dynamic partition Data inversion –SAFER32_FC 11.9% (11.5%) better hardware efficiency than ECP6 (IdealECC8) 14.8% (3.1%) better lifetime improvement than ECP6 (IdealECC8)
27
27 Thank You All!! Questions?
28
28 SRAM Fail Info. Cache Overhead Cell size in 2024 –SRAM = 140 F 2 @ 10nm, PCM = 6 F 2 @ 8nm –36.6X difference Compared with a 8 Gbit PCM chip Number of Entries Tag Size (bits) Entry Size (bits) Cache Size (bits) Area Overhead 1K232525.6K0.01% 2K222449.2K0.02% 4K212394.2K0.04% 8K20220.18M0.08% 16K19210.33M0.15% 32K18200.63M0.28% 64K17191.19M0.53% 128K16182.25M1.00%
29
29 Relative Lifetime Improvement Need a method measuring relative lifetime –independent from and T Definition Cell Write Endurance Distribution: 100M writes 10M writes Bit Toggle Rate ( T ) = 0.5 Recovery scheme contribution for lifetime T = (L F) T = FL Lifetime Contribution
30
30 Lifetime Contribution per Meta-bit
31
31 Average Number of Recovered Fails
32
32 SAFER with Fail Cache
33
33 Low-cost Single Error Correction Stuck-At Fault Property: Readability Write 0101 0101 Verify 0101 1010 1010 Write Verify 1010
34
34 Low-cost Single Error Correction Stuck-At Fault Property: Readability Write 0101 011 Verify 0101 1010 1000 Write Verify 100 Need to recover!! 0
35
35 Low-cost Single Error Correction Data Inversion as an SEC – one additional bit per group Write 0101 Verify 0101 1010 0101 2 nd Write 2 nd Verify Recovered from Stuck-At Fault!! 0101“F” Inversion & Mark 011 “F” 1010 Inversion 0
36
36 Evaluation Monte Carlo simulations –Data block size = 512 bits –Perfect wear-leveling scheme (256-byte block) –Cell write endurance: –IdealECC, ECP, SAFER, SAFER_FC 11.9%
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.