Download presentation
Presentation is loading. Please wait.
Published byRandolf Ward Modified over 9 years ago
1
Efficient Scrub Mechanisms for Error-Prone Emerging Memories Manu Awasthi ǂ, Manjunath Shevgoor⁺, Kshitij Sudan⁺, Rajeev Balasubramonian⁺, Bipin Rajendran ‡, Viji Srinivasan ‡ ǂ Micron, ⁺ University of Utah, ‡ IBM Research
2
Executive Summary Future memory hierarchies will likely include NVMs – Focus of this work: Phase Change Memory (PCM) Multi level cells in PCM appear imminent A number of proposals exist to handle hard errors and lifetime issues of PCM devices Resistance Drift is a lesser explored phenomenon – Will become primary cause of “soft errors” -- Need to explore holistic solutions to counter drift 2
3
Phase Change Memory - MLC Chalcogenide material can exist in crystalline or amorphous states The material can also be programmed into intermediate states – Leads to many intermediate states, paving way for Multi Level Cells (MLCs) 3 Resistance AmorphousCrystalline (11)(00)(10) (01) (111) (000) (101)(011)(010)(001) (110)(100)
4
What is Resistance Drift? 4 Resistance Time 11100100 A B ERROR!! T0T0 TnTn
5
Resistance Drift - Issues Programmed resistance drifts according to power law equation - R 0, α usually follow a Gaussian distribution Time to drift (error) depends on – Programmed resistance (R 0 ), and – Drift Coefficient (α) – Is highly unpredictable!! 5 R drift (t) = R 0 х (t) α
6
Resistance Drift - How it happens 6 11100100 Median case cell Typical R 0 Typical α Median case cell Typical R 0 Typical α Worst case cell High R 0 High α Worst case cell High R 0 High α Scrub rate will be dictated by the Worst Case R 0 and Worst Case α Scrub rate will be dictated by the Worst Case R 0 and Worst Case α Drift R0R0 R0R0 RtRt RtRt ERROR!! Number of Cells
7
Resistance Drift Data 7 Cell TypeDrift Time at Room temperature (secs) Median 11 cell10 499 Worst 11 Case cell10 15 Median 10 cell10 24 Worst Case 10 cell5.94 Median 01 cell10 8 Worst Case 01 cell1.81 (11) (00)(10) (01)
8
Uncorrectable Errors 8
9
Problem Severity 9
10
Inadequacy of Device Level Solutions Precise writes –> write – compare – write – Can be effective, requires multiple iterations – Increases write time, total write energy, decreases lifetime Correcting values during read not effective Coding techniques – Can help, but cannot eliminate the problem by themselves 10
11
Naïve Solution : Scrubbing Drift resets with every cell reprogram (write) Leverage existing error correction mechanisms, e.g., ECC - has its own drawbacks A Full Refresh/Scrub (read-compare-write) is extremely costly in PCM – Each PCM write takes 100 - 1000ns – Writing to a 2-bit cell may consume as much as 1.6nJ – Requires 600 refreshes in parallel 11 Refresh should be reactionary NOT precautionary!
12
Solution Outline A three pronged approach – Incorporate stronger ECC codes – Incorporate low-cost error detection mechanisms – Adapt scrub algorithms that are conservative and adaptive 12
13
Additional ECC support With stronger ECC support, scrub operations can be made infrequent – If E errors can be corrected and E+1 detected, then scrub operation can be postponed till the E th error – Can provide support for hard error tolerance as well We advocate ECC-8 (E-8) – Correct eight errors, detect nine – Leads to 14.25% storage overhead, for 512 bits 13
14
Light Array Reads for Drift Detection – Support for E Error-correcting, E+1 error detecting codes assumed – Lines are read periodically and checked for correctness – Only after the number of errors reaches a threshold, scrubbing is performed – Drift detection has to be simple, on-chip 14 LARDD Read Line Check for Errors Errors < E Scrub Line True False After N cycles
15
Read Pipeline with LARDD & Parity Simple error detection mechanism to bypass complex BCH circuit – 256 cell line divided up into eight 32-cell fields – Count number of drift prone states, store as parity information – For every LARDD, check if parity has changed – If true, invoke BCH circuitry 15 (11) (00)(10) (01)
16
Headroom Headroom-h scheme – scrub is triggered if E-h errors are detected †Decreases probability of errors slipping through –Increases frequency of full scrub and hence decreases life time Presents trade-off between Hard and Soft errors 16 Read Line Check for Errors Errors < E-h Scrub Line True False After N cycles
17
Headroom Gradual Start with a small LARDD frequency Increase frequency as drift-based errors increase – Decreases energy overhead – Might cause slight increase in error rate 17 Read Line Check for Errors Errors <= E-h-g True False After N cycles Double LARDD rate
18
Adaptive Scrub Dynamic events can affect reliability – Temperature increases can increase α and decrease drift time – Cell lifetime/wearout is also an issue – Soft error rate depends on prevalence of drift prone states – Hard errors show up with aging cells Start with headroom h policy, get rid of headroom when at least E-h-1 hard errors Can increase LARDD frequency when total number of hard errors increases preset threshold 18
19
Results 19 Stronger ECC + LARDD support leads to decrease in unrecoverable errors
20
Results - Headroom Schemes Compared to ECC-1, for 512 s LARDD interval – ECC-8-headroom-3 leads to 99.6% reduction in error rate, 35.4% reduction in scrub energy – ECC-8-headroom-3-gradual-1 leads to 96.5% reduction in error rate, 37.8% reduction in scrub energy 20
21
Device Level Solution – Non Uniform Banding 21 Resistance Before After Mean R 0 11100100
22
Comparing with Device Level Solutions 22 ECC-8-headroom-3 provides for least number of uncorrectable errors
23
Conclusions Resistance drift will exacerbate with MLC scaling Naïve solutions based on ECC support are costly for PCM – Increased write energy, decreased lifetimes Holistic solutions need to be explored to counter drift at device, architectural and system levels – 39% reduction in energy, 4x less errors, 102x increase in lifetime 23
24
Thanks!! www.cs.utah.edu/arch-research 24
25
Backup Slides 25
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.