Presentation is loading. Please wait.

Presentation is loading. Please wait.

Efficient Scrub Mechanisms for Error-Prone Emerging Memories Manu Awasthi ǂ, Manjunath Shevgoor⁺, Kshitij Sudan⁺, Rajeev Balasubramonian⁺, Bipin Rajendran.

Similar presentations


Presentation on theme: "Efficient Scrub Mechanisms for Error-Prone Emerging Memories Manu Awasthi ǂ, Manjunath Shevgoor⁺, Kshitij Sudan⁺, Rajeev Balasubramonian⁺, Bipin Rajendran."— Presentation transcript:

1 Efficient Scrub Mechanisms for Error-Prone Emerging Memories Manu Awasthi ǂ, Manjunath Shevgoor⁺, Kshitij Sudan⁺, Rajeev Balasubramonian⁺, Bipin Rajendran ‡, Viji Srinivasan ‡ ǂ Micron, ⁺ University of Utah, ‡ IBM Research

2 Executive Summary Future memory hierarchies will likely include NVMs – Focus of this work: Phase Change Memory (PCM) Multi level cells in PCM appear imminent A number of proposals exist to handle hard errors and lifetime issues of PCM devices Resistance Drift is a lesser explored phenomenon – Will become primary cause of “soft errors” -- Need to explore holistic solutions to counter drift 2

3 Phase Change Memory - MLC Chalcogenide material can exist in crystalline or amorphous states The material can also be programmed into intermediate states – Leads to many intermediate states, paving way for Multi Level Cells (MLCs) 3 Resistance AmorphousCrystalline (11)(00)(10) (01) (111) (000) (101)(011)(010)(001) (110)(100)

4 What is Resistance Drift? 4 Resistance Time 11100100 A B ERROR!! T0T0 TnTn

5 Resistance Drift - Issues Programmed resistance drifts according to power law equation - R 0, α usually follow a Gaussian distribution Time to drift (error) depends on – Programmed resistance (R 0 ), and – Drift Coefficient (α) – Is highly unpredictable!! 5 R drift (t) = R 0 х (t) α

6 Resistance Drift - How it happens 6 11100100 Median case cell Typical R 0 Typical α Median case cell Typical R 0 Typical α Worst case cell High R 0 High α Worst case cell High R 0 High α Scrub rate will be dictated by the Worst Case R 0 and Worst Case α Scrub rate will be dictated by the Worst Case R 0 and Worst Case α Drift R0R0 R0R0 RtRt RtRt ERROR!! Number of Cells

7 Resistance Drift Data 7 Cell TypeDrift Time at Room temperature (secs) Median 11 cell10 499 Worst 11 Case cell10 15 Median 10 cell10 24 Worst Case 10 cell5.94 Median 01 cell10 8 Worst Case 01 cell1.81 (11) (00)(10) (01)

8 Uncorrectable Errors 8

9 Problem Severity 9

10 Inadequacy of Device Level Solutions Precise writes –> write – compare – write – Can be effective, requires multiple iterations – Increases write time, total write energy, decreases lifetime Correcting values during read not effective Coding techniques – Can help, but cannot eliminate the problem by themselves 10

11 Naïve Solution : Scrubbing Drift resets with every cell reprogram (write) Leverage existing error correction mechanisms, e.g., ECC - has its own drawbacks A Full Refresh/Scrub (read-compare-write) is extremely costly in PCM – Each PCM write takes 100 - 1000ns – Writing to a 2-bit cell may consume as much as 1.6nJ – Requires 600 refreshes in parallel 11 Refresh should be reactionary NOT precautionary!

12 Solution Outline A three pronged approach – Incorporate stronger ECC codes – Incorporate low-cost error detection mechanisms – Adapt scrub algorithms that are conservative and adaptive 12

13 Additional ECC support With stronger ECC support, scrub operations can be made infrequent – If E errors can be corrected and E+1 detected, then scrub operation can be postponed till the E th error – Can provide support for hard error tolerance as well We advocate ECC-8 (E-8) – Correct eight errors, detect nine – Leads to 14.25% storage overhead, for 512 bits 13

14 Light Array Reads for Drift Detection – Support for E Error-correcting, E+1 error detecting codes assumed – Lines are read periodically and checked for correctness – Only after the number of errors reaches a threshold, scrubbing is performed – Drift detection has to be simple, on-chip 14 LARDD Read Line Check for Errors Errors < E Scrub Line True False After N cycles

15 Read Pipeline with LARDD & Parity Simple error detection mechanism to bypass complex BCH circuit – 256 cell line divided up into eight 32-cell fields – Count number of drift prone states, store as parity information – For every LARDD, check if parity has changed – If true, invoke BCH circuitry 15 (11) (00)(10) (01)

16 Headroom Headroom-h scheme – scrub is triggered if E-h errors are detected †Decreases probability of errors slipping through –Increases frequency of full scrub and hence decreases life time Presents trade-off between Hard and Soft errors 16 Read Line Check for Errors Errors < E-h Scrub Line True False After N cycles

17 Headroom Gradual Start with a small LARDD frequency Increase frequency as drift-based errors increase – Decreases energy overhead – Might cause slight increase in error rate 17 Read Line Check for Errors Errors <= E-h-g True False After N cycles Double LARDD rate

18 Adaptive Scrub Dynamic events can affect reliability – Temperature increases can increase α and decrease drift time – Cell lifetime/wearout is also an issue – Soft error rate depends on prevalence of drift prone states – Hard errors show up with aging cells Start with headroom h policy, get rid of headroom when at least E-h-1 hard errors Can increase LARDD frequency when total number of hard errors increases preset threshold 18

19 Results 19 Stronger ECC + LARDD support leads to decrease in unrecoverable errors

20 Results - Headroom Schemes Compared to ECC-1, for 512 s LARDD interval – ECC-8-headroom-3 leads to 99.6% reduction in error rate, 35.4% reduction in scrub energy – ECC-8-headroom-3-gradual-1 leads to 96.5% reduction in error rate, 37.8% reduction in scrub energy 20

21 Device Level Solution – Non Uniform Banding 21 Resistance Before After Mean R 0 11100100

22 Comparing with Device Level Solutions 22 ECC-8-headroom-3 provides for least number of uncorrectable errors

23 Conclusions Resistance drift will exacerbate with MLC scaling Naïve solutions based on ECC support are costly for PCM – Increased write energy, decreased lifetimes Holistic solutions need to be explored to counter drift at device, architectural and system levels – 39% reduction in energy, 4x less errors, 102x increase in lifetime 23

24 Thanks!! www.cs.utah.edu/arch-research 24

25 Backup Slides 25


Download ppt "Efficient Scrub Mechanisms for Error-Prone Emerging Memories Manu Awasthi ǂ, Manjunath Shevgoor⁺, Kshitij Sudan⁺, Rajeev Balasubramonian⁺, Bipin Rajendran."

Similar presentations


Ads by Google