Presentation is loading. Please wait.

Presentation is loading. Please wait.

Prof. Onur Mutlu ETH Zürich Fall November 2018

Similar presentations


Presentation on theme: "Prof. Onur Mutlu ETH Zürich Fall November 2018"— Presentation transcript:

1 Prof. Onur Mutlu ETH Zürich Fall 2018 14 November 2018
Computer Architecture Lecture 15: Flash Memory and Solid-State Drives (II) Prof. Onur Mutlu ETH Zürich Fall 2018 14 November 2018

2 Readings on Flash Memory

3 More Background and State-of-the-Art
Proceedings of the IEEE, Sept. 2017

4 More Up-to-date Version
Yu Cai, Saugata Ghose, Erich F. Haratsch, Yixin Luo, and Onur Mutlu, "Errors in Flash-Memory-Based Solid-State Drives: Analysis, Mitigation, and Recovery" Invited Book Chapter in Inside Solid State Drives, 2018.  [Preliminary arxiv.org version] 

5 Flash Memory Reliability

6 Agenda Background, Motivation and Approach
Experimental Characterization Methodology Error Analysis and Management Main Characterization Results Retention-Aware Error Management Threshold Voltage and Program Interference Analysis Read Reference Voltage Prediction Neighbor-Assisted Error Correction Read Disturb Error Handling Retention Error Handling 3D NAND Flash Memory Reliability Summary

7 Using the Vth Distribution Models
So, what can we do with the model? Goal: Mitigate the effects of program interference caused voltage shifts

8 Optimum Read Reference for Flash Memory
Read reference voltage affects the raw bit error rate There exists an optimal read reference voltage Predictable if the statistics (i.e. mean, variance) of threshold voltage distributions are characterized and modeled Vth f(x) g(x) v0 v1 vref Vth f(x) g(x) v’ref State-A State-B State-A State-B v0 v1 We introduce a new error tolerance mechanism that can adjust the read reference voltage dynamically based on the predictive threshold voltage model. The key observation is that the threshold voltage distribution of a cell shifts to the right and widens with program interference in a manner that is predictable via online learning of the distribution. As a result, the optimum read reference voltage between different threshold voltage ranges (i.e., logical values) can be predicted. Our evaluations show that doing so with our online mechanism reduces the raw bit error rate (BER) by 64% and improves the P/E cycle lifetime of NAND flash memory by 30% compared to a baseline that uses state-of-the-art error-correcting codes (ECC).

9 Optimum Read Reference Voltage Prediction
Vth shift After program interference Vth shift learning (done every ~1k P/E cycles) Program sample cells with known data pattern and test Vth Program aggressor neighbor cells and test victim Vth after interference Characterize the mean shift in Vth (i.e., program interference noise) Optimum read reference voltage prediction Default read reference voltage + Predicted mean Vth shift by model

10 Effect of Read Reference Voltage Prediction
30% lifetime improvement Raw bit error rate 32k-bit BCH Code (acceptable BER = 2x10-3) No read reference voltage prediction With read reference voltage prediction Read reference voltage prediction reduces raw BER (by 64%) and increases the P/E cycle lifetime (by 30%)

11 More on Read Reference Voltage Prediction
Yu Cai, Onur Mutlu, Erich F. Haratsch, and Ken Mai, "Program Interference in MLC NAND Flash Memory: Characterization, Modeling, and Mitigation" Proceedings of the 31st IEEE International Conference on Computer Design (ICCD), Asheville, NC, October Slides (pptx) (pdf) Lightning Session Slides (pdf)

12 More Accurate and Online Channel Modeling
Yixin Luo, Saugata Ghose, Yu Cai, Erich F. Haratsch, and Onur Mutlu, "Enabling Accurate and Practical Online Flash Channel Modeling for Modern MLC NAND Flash Memory" to appear in IEEE Journal on Selected Areas in Communications (JSAC), 2016.

13 Non-Gaussian Vth Distributions (1X-nm)
Luo+, “Enabling Accurate and Practical Online Flash Channel Modeling for Modern MLC NAND Flash Memory”, JSAC 2016.

14 Better Modeling of Vth Distributions (I)
Luo+, “Enabling Accurate and Practical Online Flash Channel Modeling for Modern MLC NAND Flash Memory”, JSAC 2016.

15 Better Modeling of Vth Distributions (II)

16 Vth Prediction vs. Reality with Better Modeling
Luo+, “Enabling Accurate and Practical Online Flash Channel Modeling for Modern MLC NAND Flash Memory”, JSAC 2016.

17 Online Read Reference Voltage Prediction

18 Effect on RBER of Read Ref V Prediction

19 More Accurate and Online Channel Modeling
Yixin Luo, Saugata Ghose, Yu Cai, Erich F. Haratsch, and Onur Mutlu, "Enabling Accurate and Practical Online Flash Channel Modeling for Modern MLC NAND Flash Memory" to appear in IEEE Journal on Selected Areas in Communications (JSAC), 2016.

20 Agenda Background, Motivation and Approach
Experimental Characterization Methodology Error Analysis and Management Main Characterization Results Retention-Aware Error Management Threshold Voltage and Program Interference Analysis Read Reference Voltage Prediction Neighbor-Assisted Error Correction Read Disturb Error Handling Retention Error Handling 3D NAND Flash Memory Reliability Summary

21 Goal Develop a better error correction mechanism for cases where ECC fails to correct a page

22 Observations So Far Immediate neighbor cell has the most effect on the victim cell when programmed A single set of read reference voltages is used to determine the value of the (victim) cell The set of read reference voltages is determined based on the overall threshold voltage distribution of all cells in flash memory

23 New Observations [Cai+ SIGMETRICS’14]
Vth distributions of cells with different-valued immediate-neighbor cells are significantly different Because neighbor value affects the amount of Vth shift Corollary: If we know the value of the immediate-neighbor, we can find a more accurate set of read reference voltages based on the “conditional” threshold voltage distribution Cai et al., Neighbor-Cell Assisted Error Correction for MLC NAND Flash Memories, SIGMETRICS 2014.

24 Secrets of Threshold Voltage Distributions
…… Aggressor WL 11 10 01 00 …… Victim WL State P(i) State P(i+1) Victim WL before MSB page of aggressor WL are programmed N11 N00 N10 N01 State P’(i) State P’(i+1) Victim WL after MSB page of aggressor WL are programmed

25 If We Knew the Immediate Neighbor …
Then, we could choose a different read reference voltage to more accurately read the “victim” cell

26 Overall vs Conditional Reading
REFx N11 N00 N10 N01 State P’(i) State P’(i+1) Vth Using the optimum read reference voltage based on the overall distribution leads to more errors Better to use the optimum read reference voltage based on the conditional distribution (i.e., value of the neighbor) Conditional distributions of two states are farther apart from each other

27 Real NAND Flash Chip Measurement Results
P1 State P2 State P3 State Small margin Large margin Raw BER of conditional reading is much smaller than overall reading

28 Idea: Neighbor Assisted Correction (NAC)
Read a page with the read reference voltages based on overall Vth distribution (same as today) and buffer it If ECC fails: Read the immediate-neighbor page Re-read the page using the read reference voltages corresponding to the voltage distribution assuming a particular immediate-neighbor value Replace the buffered values of the cells with that particular immediate-neighbor cell value Apply ECC again

29 Neighbor Assisted Correction Flow
How to select next local optimum read reference voltage? Trigger neighbor-assisted reading only when ECC fails Read neighbor values and use corresponding read reference voltages in a prioritized order until ECC passes

30 Lifetime Extension with NAC
Stage-0 Stage-1 Stage-2 Stage-3 ECC capable of correcting 40 bits per 1k-Byte 22% 33% 39% 33% lifetime improvement at no performance loss

31 Performance Analysis of NAC
No performance loss within nominal lifetime and with reasonable (1%) ECC fail rates

32 More on Neighbor-Assisted Correction
Yu Cai, Gulay Yalcin, Onur Mutlu, Eric Haratsch, Osman Unsal, Adrian Cristal, and Ken Mai, "Neighbor-Cell Assisted Error Correction for MLC NAND Flash Memories" Proceedings of the ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), Austin, TX, June Slides (ppt) (pdf)

33 Agenda Background, Motivation and Approach
Experimental Characterization Methodology Error Analysis and Management Main Characterization Results Retention-Aware Error Management Threshold Voltage and Program Interference Analysis Read Reference Voltage Prediction Neighbor-Assisted Error Correction Read Disturb Error Handling Retention Error Handling 3D NAND Flash Memory Reliability Summary

34 Read Disturb Errors in Flash Memory

35 One Issue: Read Disturb in Flash Memory
All scaled memories are prone to read disturb errors DRAM SRAM Hard Disks: Adjacent Track Interference NAND Flash

36 NAND Flash Memory Background
Block 0 Page 1 Page 0 Page 2 Page 255 …… Page 257 Page 256 Page 258 Page 511 Page M+1 Page M Page M+2 Page M+255 Read Pass Block 1 Block N When we read from one page in a block, we apply a high voltage to all the other pages in the same block. We will explain why later Flash Controller

37 Flash Cell Array Row Column Sense Amplifiers Block X Page Y
Now, let’s zoom in to a single block of flash memory, which consists of an array of flash pages. Each flash page is written to flash cells located in the same row. Each row of cells are controlled by the same horizontal wire. Each column of cells are connected in series with a sense amplifier at the bottom. Sense Amplifiers Sense Amplifiers

38 Floating Gate Transistor
Flash Cell Drain Floating Gate Gate Vth = 2.5 V Source Now let’s zoom in again to look at a single flash cell, or, it can also be called, a floating gate transistor. Floating Gate Transistor (Flash Cell)

39 Flash Read 1 Vth = 2 V Vth = 3 V Vread = 2.5 V Vread = 2.5 V Gate
Now, if we have two flash cells programmed to different threshold voltages, 2V and 3V, we can distinguish them with a read voltage of 2.5V. When we apply 2.5V to the gate, since the read voltage is higher than the threshold voltage, … Now if we look back at this example, we can see that the threshold voltage of each flash cell actually represents the value being stored. 1

40 Flash Pass-Through 1 1 Vth = 2 V Vth = 3 V Vpass = 5 V Vpass = 5 V
Gate But, for the same cells, when we apply a high voltage to them, regardless of the threshold voltage, any cell will be passed-through. So we call this voltage pass-through voltage or Vpass 1 1

41 Read from Flash Cell Array
Vpass = 5.0 V Pass (5V) 3.0V 3.8V 3.9V 4.8V Page 1 Vread = 2.5 V Read (2.5V) 3.5V 2.9V 2.4V 2.1V Page 2 Vpass = 5.0 V Pass (5V) 2.2V 4.3V 4.6V 1.8V Page 3 Now, let’s look at an example array of flash cells. In this example, we make up threshold voltage numbers for each cell. Let’s say we want to read from the second page in this array. So we apply a read voltage of 2.5 V to the second page And since all the other pages are connected in series, we apply a high pass-through voltage of 5 V to the other pages. As we can see, the high pass through voltage switches on all the flash pages being passed-through, allowing the sense amplifier to read the values stored in the second page. In this case, the correct values … Vpass = 5.0 V Pass (5V) 3.5V 2.3V 1.9V 4.3V Page 4 Correct values for page 2: 1 1

42 Read Disturb Problem: “Weak Programming” Effect
Pass (5V) 3.0V 3.8V 3.9V 4.8V Page 1 Pass (5V) 3.5V 2.9V 2.4V 2.1V Page 2 Read (2.5V) 2.2V 4.3V 4.6V 1.8V Page 3 However, the high pass through voltage can generate read disturb problem, which we will introduce now. When we want to read page 3, instead of page 2, this high pass through voltage now needs to be applied on the 2nd page. When we repeatedly read page 3, the pass through voltage induces a weak programming effect on the second page. Pass (5V) 3.5V 2.3V 1.9V 4.3V Page 4 Repeatedly read page 3 (or any page other than page 2)

43 Read Disturb Problem: “Weak Programming” Effect
Vpass = 5.0 V 3.0V 3.8V 3.9V 4.8V Page 1 Vread = 2.5 V 3.5V 2.9V 2.6V 2.4V 2.1V Page 2 Vpass = 5.0 V 2.2V 4.3V 4.6V 1.8V Page 3 As the weak programming effect accumulates, the threshold voltage of each flash cell increases, and the threshold voltage state of a cell can be altered. In this case, when we read page 2 again, we will read incorrect values from page 2, 0001 Flash vendors conservatively set this pass-through voltage to a high voltage. However, we find in the paper that it is unnecessary to set to this high, and that reducing this voltage by just a bit will reduce read disturb errors significantly in the future. Vpass = 5.0 V 3.5V 2.3V 1.9V 4.3V Page 4 Incorrect values from page 2: 1 High pass-through voltage induces “weak-programming” effect

44 Executive Summary [DSN’15]
Read disturb errors limit flash memory lifetime today Apply a high pass-through voltage (Vpass) to multiple pages on a read Repeated application of Vpass can alter stored values in unread pages We characterize read disturb on real NAND flash chips Slightly lowering Vpass greatly reduces read disturb errors Some flash cells are more prone to read disturb Technique 1: Mitigate read disturb errors online Vpass Tuning dynamically finds and applies a lowered Vpass per block Flash memory lifetime improves by 21% Technique 2: Recover after failure to prevent data loss Read Disturb Oriented Error Recovery (RDR) selectively corrects cells more susceptible to read disturb errors Reduces raw bit error rate (RBER) by up to 36% When we read data from one page, we have to apply a high voltage to multiple other flash pages that are connected to this page, and we call this high pass-through voltage. Over time this high pass-through voltage can alter the values stored in unread flash pages. We call this a read disturb error. For the first time in open literature, we characterize read disturb on real NAND flash chips, and show that read disturb errors exist in today’s flash chips and is expected to increase in the future. We make two key observations from the characterization. Using our characterization, we develop two techniques that can help flash memory tolerate such read disturb errors.

45 Reducing The Pass-Through Voltage
Key Observation 1: Slightly lowering Vpass greatly reduces read disturb errors Now let’s look at the effect of reducing pass-through voltage, because, as we have introduced earlier, high pass-through voltage is the main reason for read disturb errors. We set up an experiment that emulates a reduced the pass through voltage. We assume an acceptable raw bit error rate of 10^-3, which is typical amount for a flash memory. And we record the tolerable read disturb count before the raw bit error rate exceeds this amount. In this figure, the x axis is … y axis is … We normalize tolerable read disturb count to that of the default pass through voltage, which is 0% on this figure. As we can see, as we continue to reduce the pass through voltage by 6%, the tolerable read disturb count increases significantly up to 1300x.

46 Outline Background (Problem and Goal) Key Experimental Observations
Mitigation: Vpass Tuning Recovery: Read Disturb Oriented Error Recovery Conclusion Now that we have developed an understanding of the read disturb phenomenon in flash memory. Let’s look at how we exploit these observations to mitigate read disturb errors.

47 Read Disturb Mitigation: Vpass Tuning
Key Idea: Dynamically find and apply a lowered Vpass Trade-off for lowering Vpass Allows more read disturbs Induces more read errors To mitigate read disturb errors, we propose Vpass Tuning, whose key idea is to dynamically find and apply a lowered pass-through voltage. However, lowering the pass through voltage presents a trade off. First, as we have just described, lowering pass through voltage by a little significantly reduces the read disturb effect. However, lowering Vpass also induces more read errors, which we will introduce next.

48 Read Errors Induced by Vpass Reduction
Reducing Vpass to 4.9V Vpass = 4.9 V 3.0V 3.8V 3.9V 4.8V Page 1 Vread = 2.5 V 3.5V 2.9V 2.4V 2.1V Page 2 Vpass = 4.9 V 2.2V 4.3V 4.6V 1.8V Page 3 If we only reduce the pass-through voltage by a little, very few read errors will be generated. In this example, we reduce Vpass to 4.9V, which is higher than the threshold voltage of any cell that we want to pass through. Vpass = 4.9 V 3.5V 2.3V 1.9V 4.3V Page 4 1 1

49 Read Errors Induced by Vpass Reduction
Reducing Vpass to 4.7V Vpass = 4.7 V 3.0V 3.8V 3.9V 4.8V Page 1 Vread = 2.5 V 3.5V 2.9V 2.4V 2.1V Page 2 Vpass = 4.7 V 2.2V 4.3V 4.6V 1.8V Page 3 However, as we further reduce the pass through voltage to 4.7V, some cell with higher threshold voltage cannot be correctly passed through, and thus will generate a read error. In this case, incorrect values are being read from page 2. Vpass = 4.7 V 3.5V 2.3V 1.9V 4.3V Page 4 Incorrect values from page 2: 1

50 Utilizing the Unused ECC Capability
ECC Correction Capability N-day Retention 1.0 0.8 0.6 0.4 0.2 RBER × 10-3 Unused ECC capability ECC provisioned for high retention “age” However, these read errors can be tolerated by the unused ECC capability when the error rate is low, allowing us to reduce read disturb errors when the error rate is high. In this figure, we plot how raw bit error rate increases over retention age, the time since the data is last programmed. The x-axis is … y-axis is … Over time, flash retention errors and read disturb errors are accumulated, so the error rate keeps increasing. The red line shows … As we can see, in early retention age, there is huge unused ECC correction capability that can be used to tolerate read errors. Then over time, the unused ECC capability decreases so we need to readjust Vpass reduction to reduce read errors. This leads to our Vpass tuning technique that … 2. Unused ECC capability can be used to fix read errors 3. Unused ECC capability decreases over retention age Dynamically adjust Vpass so that read errors fully utilize the unused ECC capability

51 Vpass Reduction Trade-Off Summary
Today: Conservatively set Vpass to a high voltage Accumulates more read disturb errors at the end of each refresh interval No read errors Idea: Dynamically adjust Vpass to unused ECC capability Minimize read disturb errors Control read errors to be tolerable by ECC If read errors exceed ECC capability, read again with a higher Vpass to correct read errors Now, let’s summarize and compare the trade-offs for Vpass Reduction. If we conservatively …, which is what flash vendors do today, we will accumulate lots of read disturb errors at the end of each refresh interval, but we don’t trade-off read errors. If we dynamically …, we can minimize read disturb errors, and in the mean time control the number of read errors to be tolerable by ECC. Even if we tune the Vpass too aggressively that the read errors …

52 Vpass Tuning Steps Perform once for each block every day:
Estimate unused ECC capability (using retention age) Aggressively reduce Vpass until read errors exceeds ECC capability Gradually increase Vpass until read error becomes just less than ECC capability Here, we listed the detailed steps for pass-through voltage tuning. These steps are performed once for each block every day. Using these steps, we can maximize the utilization of the unused ECC capability such that read disturb errors are minimized when the error rate is high.

53 Evaluation of Vpass Tuning
19 real workload I/O traces Assume 7-day refresh period Similar methodology as before to determine acceptable Vpass reduction Overhead for a 512 GB flash drive: 128 KB storage overhead for per-block Vpass setting and worst-case page 24.34 sec/day average Vpass Tuning overhead … which can be hidden by performing in the background.

54 Vpass Tuning Lifetime Improvements
This figure shows the Vpass tuning lifetime improvements. The length of each bar shows the PE cycle lifetime for each configuration. The blue bars show our baseline system without Vpass tuning. The red bars show the lifetime after applying Vpass tuning for each workload. On average, Vpass tuning improves lifetime by 21%. Average lifetime improvement: 21.0%

55 Read Disturb Prone vs. Resistant Cells
PDF N read disturbs Disturb-Resistant R R N read disturbs Disturb-Prone P P RDR exploits each cell’s read disturb resistance. Because of process variation, each cell can be classified as either disturb resistant or disturb prone. After the same amount of read disturbs, the threshold voltage of a disturb-resistant cell does not change much, but the threshold voltage change of a disturb-prone cell is high. Normalized Vth

56 Observation 2: Some Flash Cells Are More Prone to Read Disturb
After 250K read disturbs: PDF Disturb-prone cells have higher threshold voltages Disturb-resistant cells have lower threshold voltages ER P1 R R As our fourth observation, we find some flash cells are more prone to read disturb. Now lets look at an example of the erased and P1 state. With no read disturb, the threshold voltage of disturb prone and resistant cells are randomly distributed. After a significant amount of read disturb, the cells redistribute because of read disturb. At this point, we can see that disturb-prone cells have higher threshold voltages within each state, And disturb-resistant cells have lower threshold voltages. Now, if we look at the cells susceptible to errors, which are cells with threshold voltage closer to the boundary. The disturb-prone cells are from the erased state, and the disturb-resistant cells are from the P1 state. P P P P Disturb-resistant P1 state Disturb-prone ER state P P P P R R Normalized Vth

57 Read Disturb Oriented Error Recovery (RDR)
Triggered by an uncorrectable flash error Back up all valid data in the faulty block Disturb the faulty page 100K times (more) Compare Vth’s before and after read disturb Select cells susceptible to flash errors (Vref−σ<Vth<Vref−σ) Predict among these susceptible cells Cells with more Vth shifts are disturb-prone  Lower Vth state Cells with less Vth shifts are disturb-resistant  Higher Vth state Based on this finding, we propose read disturb oriented error recovery to identify disturb-prone and resistant cells after a failure has happened to recover from read disturb errors. After an uncorrectable flash error, the following steps are triggered. Reduces total error count by up to 1M read disturbs ECC can be used to correct the remaining errors

58 RDR Evaluation RBER Read Disturb Count
× 10-3 12 10 8 6 4 2 RBER Read Disturb Count 0.2M 0.4M 0.6M 0.8M 1M No Recovery RDR Finally, we evaluate RDR from 0 to 1M read disturbs. The x-axis shows the read disturb count, The y-axis shows the raw bit error rate. The blue curve shows the RBER without RDR, the dotted red curve shows the RBER with RDR. On average, RDR reduces RBER by 36% and the ECC can be used to correct the remaining errors. Reduces total error counts by up to 1M read disturbs ECC can be used to correct the remaining errors

59 More on Flash Read Disturb Errors [DSN’15]
Yu Cai, Yixin Luo, Saugata Ghose, Erich F. Haratsch, Ken Mai, and Onur Mutlu, "Read Disturb Errors in MLC NAND Flash Memory: Characterization and Mitigation" Proceedings of the 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Rio de Janeiro, Brazil, June 2015.

60 Agenda Background, Motivation and Approach
Experimental Characterization Methodology Error Analysis and Management Main Characterization Results Retention-Aware Error Management Threshold Voltage and Program Interference Analysis Read Reference Voltage Prediction Neighbor-Assisted Error Correction Read Disturb Error Handling Retention Error Handling 3D NAND Flash Memory Reliability Summary

61 Data Retention in Flash Memory

62 Characterize Optimize Recover retention loss in real NAND chip
read performance for old data Today, I will teach you something new about retention in flash memory through characterization of real NAND chip. I will also show you an interesting way to optimize for read performance when reading old data. Finally, I will also show you an effective technique to recover data from a failed flash page. Recover old data after failure

63 read performance degradation
old files Now let’s start with a news report I downloaded from techspot. What they observe is that as slow as 30MB/s newly-written files 500 MB/s Reference: (May 5, 2015) Per Hansson, “When SSD Performance Goes Awry”

64 Why is old data slower? Retention loss!
Image source:

65 Retention loss Charge leakage over time
Flash cell Flash cell Flash cell Retention error One dominant source of flash memory errors [DATE ‘12, ICCD ‘12] (fast) Retention loss is the phenomenon of charge leakage over time. (natural) Flash memory cell stores data as the amount of charge in its floating gate Unfortunately, the more charge it holds, the faster it leaks charge. As it leaks more and more charge, the data stored in the cell can be altered, resulting in a retention error. (fast) Prior works have shown that retention error is one dominant source of flash memory errors. In the mean time, retention loss also have a side effect, which is longer read latency. Now let’s talk about the story of how we get this side effect. Side effect: Longer read latency

66 Multi-Level Cell (MLC) threshold voltage distribution
PDF Va Vb Vc Erased (11) The story starts from the threshold voltage distribution. (point to) This is figure of the threshold voltage distribution, which shows the probability density function of the threshold voltage for each multi-level cell. Multi-level cells use 4 threshold voltage states to represent 2 bits. Due to program variation, the threshold voltage of each state is distributed across a range, which looks like a hump. And we use read reference voltages Va Vb Vc to read the bit values. Note that in the rest of our talk, we will not show the erased state, because we are not able to measure the threshold voltage for the erased state. P1 (10) P2 (00) P3 (01) Normalized Vth

67 Experimental Testing Platform
USB Daughter Board USB Jack HAPS-52 Mother Board Virtex-II Pro (USB controller) 3x-nm NAND Flash Virtex-V FPGA (NAND Controller) In order to characterize the distribution, we use an experimental testing platform which is shown on this figure. This is an FPGA based platform to test 2x-nm MLC raw NAND chips. We have used this in many prior works including this one. NAND Daughter Board [Cai+, FCCM 2011, DATE 2012, ICCD 2012, DATE 2013, ITJ 2013, ICCD 2013, SIGMETRICS 2014, DSN 2015, HPCA 2015] Cai et al., FPGA-based Solid-State Drive prototyping platform, FCCM 2011.

68 Characterized threshold voltage distribution
0-day 0-day 40-day 40-day P1 P2 P3 This figure shows the real characterization of the threshold voltage distribution using the platform. The blue line shows distribution 0 day after programming. The black line shows distribution 40 day after programming. The key takeaway from this figure is the finding that … We can see this from … Finding: Cell’s threshold voltage decreases over time

69 Threshold voltage reduces over time
New data Old data PDF Less charge More charge P1 (10) P2 (00) P3 (01) (fast) Now let’s explain this effect a bit more. (natural) Higher threshold voltage represents more charge in the floating gate, Lower threshold voltage represents less charge. As data becomes older, electric charge is leaked over time so the threshold voltage decreases gradually, as we have seen in the real characterization. (natural But the flash controller is unaware of this change, so it will apply the default read reference voltage. Normalized Vth

70 First read attempt fails
Old data PDF Less charge More charge Vb Vc P1 (10) P2 (00) P3 (01) So it will still use the default read reference voltages, which is not a good fit for the shifted distribution. So we get lots of raw bit errors. For old data, raw bit errors exceed the ECC correctable errors, and so we have to read again. Normalized Vth Normalized Vth Raw bit errors > ECC correctable errors

71 Read-retry Old data PDF Increase read latency Vb’ Vb Vc’ Vc P1 (10)
After the first read to old data fails, multiple read retries will be attempted (point) to reduce the raw bit errors (point) until a read succeeds. Note that all these read attempts and ECC checks are processed sequentially, So the overall read latency becomes much longer and read bandwidth is significantly reduced. Normalized Vth Fewer raw bit errors

72 Why is old data slower? Retention loss  Leak charge over time  Generate retention errors  Require read-retry  Longer read latency (fast)

73 Characterize Optimize Recover retention loss in real NAND chip
read performance for old data Recover old data after failure

74 The ideal read voltage Old data OPT: Optimal read reference voltage
 minimal read latency PDF OPTb OPTc P1 (10) P2 (00) P3 (01) Ideally, to read the data with the shortest latency, we can apply the OPT, which is at the exact intersection of the two neighboring distribution. OPT by definition achieves the minimal raw bit errors, and does not need read-retry, so it achieves minimal read latency. So we say OPT is the ideal read voltage. Normalized Vth Minimal raw bit errors

75 In reality OPT changes over time due to retention loss
Luckily, OPT change is: Gradual Uni-directional (decreases over time) But in reality, OPT changes over time due to retention loss and many other factors, so it is not easy to predict the value of OPT accurately. Luckily, OPT changes gradually, so it does not change much in a day. Also, OPT change is uni-directional. In fact, it decreases monotonically over time. As we will show next, these two property becomes useful when we design techniques to reduce latency.

76 Retention Optimized Reading (ROR)
Components: 1. Online pre-optimization algorithm Learns and records OPT Performs in the background once every day 2. Simpler read-retry technique If recorded OPT is out-of-date, read-retry with lower voltage The technique we developed to reduce read latency is called retention optimized reading, or ROR, which is composed of two major components that coordinates with each other. The first component is an online pre-optimization algorithm. This algorithm periodically learns OPT. The second component is an improved read retry technique. This technique utilizes the recorded OPT to minimize read retries. We next explain each component in turn. Note that this pre-optimization is performed in the background once every day to have minimal performance impact. And the read-retry only needs to find read to succeed, and is now triggered much less frequently.

77 1. Online Pre-Optimization Algorithm
Triggered periodically (e.g., per day) Find and record an OPT as per-block Vpred Performed in background Small storage overhead Normalized Vth PDF New Vpred Old Vpred We next explain each component in detail. The online pre-optimization algorithm is triggered periodically. And for each block, it finds and records an upper-bound for OPT within a time period. The algorithm itself is very simple. It starts with the old starting read reference voltage. Then it progressively increases or decreases read reference voltage until it achieves a minimal raw bit error rate. Note that this algorithm is performed in the background to have minimal performance impact.

78 2. Improved Read-Retry Technique
Performed as normal read Vpred already close to actual OPT Decrease Vref if Vpred fails, and retry PDF OPT Vpred The second component of ROR is an improved read-retry technique. This technique is performed as a normal read operation. Since we already have a starting read reference voltage that is close to OPT. Even if the actual OPT shifts over retention age, we decrease read reference voltage for a few read-retries to successfully read the data. Very close Normalized Vth

79 ROR result (fast) Now let’s look at the result.
Blue bars are baseline, red bars are ROR, our proposal. The first two bars show that ROR reduces read-retry count to 30%. The second two bars show that ROR also reduces ECC decoding latency due to fewer retention errors. The last two bars show the total latency is reduced to 29%, which is less than a third than baseline.

80 Retention optimized reading
Retention loss  longer read latency Optimal read reference voltage (OPT)  Shortest read latency  Decreases gradually over time (retention)  Learn OPT periodically  Minimize read-retry & RBER  Shorter read latency (fast)

81 Characterize Optimize Recover retention loss in real NAND chip
read performance for old data Next, let’s look at an even more challenging and of course more interesting problem. Recover old data after failure

82 Retention failure Old data Very old data PDF P1 (10) P2 (00) P3 (01)
OPTc OPTb (natural) With ROR, we can quickly read old data with low RBER, which we optimistically assume is always correctable. But when the data becomes very old, retention loss will shift the threshold voltage even further, and result in a lot more errors than before. To some point they become uncorrectable errors even when we use optimal read reference voltage. In this case, the read to this page fails, so we call this retention failure. So what we want to do next is to recover the data after such retention failure happens. Normalized Vth Uncorrectable errors Uncorrectable errors

83 Leakage speed variation
PDF N-day retention low-leaking cell S S ast-leaking cell F F N-day retention (natural) At a very high level, what we do is to relate this property called leakage speed variation with the cell’s correct state, in order to correct a majority of retention errors before feeding it to ECC decoder. Now let me elaborate how we did that. Due to process variation, there are two types of cells. Slow-leaking cells leak charge slowly over retention time. Fast-leaking cells leak charge quickly over time. Now how can we correlate this property with a cell’s correct state? Normalized Vth

84 A simplified example PDF P2 P3 S S F F F F S S Normalized Vth
Let’s look at a simplified example of P2 and P3 state distribution. Initially, due to program variation, the fast and slow leaking cells are randomly programmed into different threshold voltage levels within each state. F F F F S S Normalized Vth

85 Reading very old data Fast-leaking cells have lower Vth
PDF Slow-leaking cells have higher Vth P2 P3 S S But as the data becomes old, fast-leaking cells naturally shift to lower threshold voltages within each state, and slow-leaking cells are left behind in higher voltages ranges within each state. But at this point, we are still missing something to get the correct values. Because for example, F F F F F F F F S S Normalized Vth

86 “Risky” cells OPT Key Formula OPT–σ + S = P2 PDF OPT+σ Risky cells
So we look at “risky” cells, which are defined as cells with threshold voltages close to OPT. Now we can see the correlation: (slow) risky and slow-leaking cells have P2 state, risky and fast-leaking cells have P3 state. Now if we apply this key formula to the cells, we are able to correct these retention errors. (Pause) F F S Normalized Vth Uncorrectable errors

87 Retention Failure Recovery (RFR)
Key idea: Guess original state of the cell from its leakage speed property Three steps Identify risky cells Identify fast-/slow-leaking cells Guess original states Key Formula + S = P2 Risky cells + F = P3 (fast) The technique we propose is retention failure recovery, RFR, whose key idea is to guess original state of the cell from its leakage speed property. This technique is composed of three steps.

88 RFR Evaluation Expect to eliminate 50% of raw bit errors
ECC can correct remaining errors Program with random data 28 days Detect failure, backup data (fast) We evaluate our RFR technique based on our characterization. We assume that the faulty page is originally programmed with random data. After 28 days, we detect the failure and back up the data. After 12 additional days, we perform RFR to recover the data. On average, RFR reduces RBER by 50%, which means that it is expected to eliminate 50% of the original raw bit errors. At this point, ECC code has a much higher chance to correct the remaining errors to recover the data. 12 addt’l. days Recover data

89 Characterize Optimize Recover retention loss in real NAND chip
read performance for old data Now let me conclude Recover old data after failure

90 Conclusion Retention loss Longer read latency Retention optimized reading (ROR)  Learns OPT periodically  71% shorter read latency Retention failure recovery (RFR)  Use leakage property to guess correct state  50% error reduction before ECC correction  Recover data after failure (very fast) We have walked through how retention loss leads to longer read latency. And we have talked about how we reduce read latency with retention optimized reading. And we have also introduced retention failure recovery, which recovers data after failure.

91 More on Flash Read Disturb Errors
Yu Cai, Yixin Luo, Erich F. Haratsch, Ken Mai, and Onur Mutlu, "Data Retention in MLC NAND Flash Memory: Characterization, Optimization and Recovery" Proceedings of the 21st International Symposium on High-Performance Computer Architecture (HPCA), Bay Area, CA, February [Slides (pptx) (pdf)]

92 Agenda Background, Motivation and Approach
Experimental Characterization Methodology Error Analysis and Management Main Characterization Results Retention-Aware Error Management Threshold Voltage and Program Interference Analysis Read Reference Voltage Prediction Neighbor-Assisted Error Correction Read Disturb Error Handling Retention Error Handling Large Scale Field Analysis 3D NAND Flash Memory Reliability Summary

93 Large Scale Field Analysis of Flash Memory Errors

94 SSD Error Analysis of Facebook Systems
First large-scale field study of flash memory errors Justin Meza, Qiang Wu, Sanjeev Kumar, and Onur Mutlu, "A Large-Scale Study of Flash Memory Errors in the Field" Proceedings of the ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), Portland, OR, June [Slides (pptx) (pdf)] [Coverage at ZDNet] [Coverage on The Register] [Coverage on TechSpot] [Coverage on The Tech Report]

95 A few SSDs cause most errors

96 A few SSDs cause most errors

97 New reliability trends
Summary SSD lifecycle New reliability trends Access pattern dependence Read disturbance Temperature

98 New reliability trends
Summary SSD lifecycle New reliability trends Early detection lifecycle period distinct from hard disk drive lifecycle. Access pattern dependence Read disturbance Temperature

99 New reliability trends
SSD lifecycle New reliability trends Access pattern dependence Read disturbance Temperature

100 bathtub curve Storage lifecycle background: the for disk drives
Failure rate Usage [Schroeder+,FAST'07]

101 bathtub curve Storage lifecycle background: the for disk drives
Early failure period Wearout period Failure rate Useful life period Usage [Schroeder+,FAST'07]

102 Do SSDs display similar lifecycle periods?
Storage lifecycle background: bathtub curve the for disk drives Early failure period Wearout period Do SSDs display similar lifecycle periods? Failure rate Useful life period Usage [Schroeder+,FAST'07]

103 data written to flash Use to examine SSD lifecycle
(time-independent utilization metric) We use data written to flash to examine SSD lifecycle trends. This is because data written to flash is a time-independent metric for SSD utilization.

104 720GB, 1 SSD 720GB, 2 SSDs 40 80 Data written (TB)

105 720GB, 1 SSD 720GB, 2 SSDs Wearout period Useful life period
Early failure period 40 80 Data written (TB)

106 Early detection period Useful life period
720GB, 1 SSD 720GB, 2 SSDs Wearout period Early detection period Useful life period Early failure period 40 80 Data written (TB)

107 New reliability trends
SSD lifecycle New reliability trends Early detection lifecycle period distinct from hard disk drive lifecycle. Access pattern dependence Read disturbance Temperature

108 New reliability trends
SSD lifecycle New reliability trends Access pattern dependence Read disturbance Temperature

109

110 Temperature sensor

111 720GB, 1 SSD 720GB, 2 SSDs

112 High temperature: may throttle or shut down

113 1.2TB, 1 SSD 3.2TB, 1 SSD Throttling

114 New reliability trends
SSD lifecycle New reliability trends Access pattern dependence Throttling SSD usage helps mitigate temperature-induced errors. Read disturbance Temperature

115 New reliability trends
Summary SSD lifecycle New reliability trends We do not observe the effects of read disturbance errors in the field. Access pattern dependence Read disturbance Temperature

116 New reliability trends
Summary SSD lifecycle New reliability trends Access pattern dependence Throttling SSD usage helps mitigate temperature-induced errors. Read disturbance Temperature

117 New reliability trends
Summary SSD lifecycle New reliability trends We quantify the effects of the page cache and write amplification in the field. Access pattern dependence Read disturbance Temperature

118 Large-Scale SSD Error Analysis [SIGMETRICS’15]
First large-scale field study of flash memory errors Justin Meza, Qiang Wu, Sanjeev Kumar, and Onur Mutlu, "A Large-Scale Study of Flash Memory Errors in the Field" Proceedings of the ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), Portland, OR, June [Slides (pptx) (pdf)] [Coverage at ZDNet] [Coverage on The Register] [Coverage on TechSpot] [Coverage on The Tech Report]

119 Other Works on NAND Flash Memory Modeling & Issues

120 Flash Memory Programming Vulnerabilities
Yu Cai, Saugata Ghose, Yixin Luo, Ken Mai, Onur Mutlu, and Erich F. Haratsch, "Vulnerabilities in MLC NAND Flash Memory Programming: Experimental Analysis, Exploits, and Mitigation Techniques" Proceedings of the 23rd International Symposium on High-Performance Computer Architecture (HPCA) Industrial Session, Austin, TX, USA, February [Slides (pptx) (pdf)] [Lightning Session Slides (pptx) (pdf)]

121 Accurate and Online Channel Modeling
Yixin Luo, Saugata Ghose, Yu Cai, Erich F. Haratsch, and Onur Mutlu, "Enabling Accurate and Practical Online Flash Channel Modeling for Modern MLC NAND Flash Memory" to appear in IEEE Journal on Selected Areas in Communications (JSAC), 2016.

122 Agenda Background, Motivation and Approach
Experimental Characterization Methodology Error Analysis and Management Main Characterization Results Retention-Aware Error Management Threshold Voltage and Program Interference Analysis Read Reference Voltage Prediction Neighbor-Assisted Error Correction Read Disturb Error Handling Retention Error Handling Large Scale Field Analysis 3D NAND Flash Memory Reliability Summary

123 3D NAND Flash Memory

124 3D NAND Flash Reliability I [HPCA’18]
Yixin Luo, Saugata Ghose, Yu Cai, Erich F. Haratsch, and Onur Mutlu, "HeatWatch: Improving 3D NAND Flash Memory Device Reliability by Exploiting Self-Recovery and Temperature-Awareness"  Proceedings of the 24th International Symposium on High-Performance Computer Architecture (HPCA), Vienna, Austria, February 2018.  [Lightning Talk Video]  [Slides (pptx) (pdf)] [Lightning Session Slides (pptx) (pdf)]

125 3D NAND Flash Reliability II [SIGMETRICS’18]
Yixin Luo, Saugata Ghose, Yu Cai, Erich F. Haratsch, and Onur Mutlu, "Improving 3D NAND Flash Memory Lifetime by Tolerating Early Retention Loss and Process Variation"  Proceedings of the ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), Irvine, CA, USA, June 2018.  [Abstract]

126 NAND Flash Memory Lifetime Problem
Gen N+1 ECC Limit Gen N Lifetime ECC Limit Raw Bit Error Rate (RBER) Lifetime Flash lifetime decreases in each generation despite increased ECC strength NAND flash memory is a type of storage media widely used in today’s SSDs. NAND flash memory wears out as we write more data to it. This means that the raw bit error rate (on the y-axis) increases as the program/erase cycle (on the x-axis) increases, as we show on this figure. These errors are tolerated by ECC code, which can correct raw bit errors up to a certain limit. When RBER goes beyond this limit, flash lifetime end. In each new generation, we trade off reliability for higher storage density. So flash lifetime decreases despite the increased ECC strength and ECC limit. Wearout (Program/Erase Cycles, or PEC)

127 Planar vs. 3D NAND Flash Memory
Planar NAND Flash Memory To further increase storage density, manufacturers developed 3D NAND technology, which stacks dozens of layers of flash memory on top of each other. This new technology changes a lot of designs inside flash memory, from circuit-level design to flash cell organization. But at this point, due to its recent development, the implication of 3D NAND technology on flash reliability is not well studied. Reduce flash cell size, Reduce distance b/w cells Scaling Increase # of layers Not well studied! Reliability Scaling hurts reliability

128 Charge Trap Based 3D Flash Cell
Cross-section of a charge trap transistor

129 2D vs. 3D Flash Cell Design Substrate S D Substrate D S Control Gate
Charge Trap (Insulator) Control Gate e Gate Oxide Tunnel Oxide Substrate D S Control Gate Floating Gate (Conductor) Gate Oxide Tunnel Oxide e 2D Floating-Gate Cell 3D Charge-Trap Cell

130 3D NAND Flash Memory Organization

131 More Background and State-of-the-Art
Yu Cai, Saugata Ghose, Erich F. Haratsch, Yixin Luo, and Onur Mutlu, "Errors in Flash-Memory-Based Solid-State Drives: Analysis, Mitigation, and Recovery" Invited Book Chapter in Inside Solid State Drives, 2018.  [Preliminary arxiv.org version] 

132 3D vs. Planar NAND Errors: Comparison

133 Yixin Luo Saugata Ghose Yu Cai Erich F. Haratsch Onur Mutlu
Improving 3D NAND Flash Memory Lifetime by Tolerating Early Retention Loss and Process Variation Yixin Luo Saugata Ghose Yu Cai Erich F. Haratsch Onur Mutlu Thank you for the introduction. Today, I will present our work about …

134 Executive Summary Problem: 3D NAND error characteristics are not well studied Goal: Understand & mitigate 3D NAND errors to improve lifetime Contribution 1: Characterize real 3D NAND flash chips Process variation: 21× error rate difference across layers Early retention loss: Error rate increases by 10× after 3 hours Retention interference: Not observed before in planar NAND Contribution 2: Model RBER and threshold voltage RBER (raw bit error rate) variation model Retention loss model Contribution 3: Mitigate 3D NAND flash errors LaVAR: Layer Variation Aware Reading LI-RAID: Layer-Interleaved RAID ReMAR: Retention Model Aware Reading Improve flash lifetime by 1.85× or reduce ECC overhead by 78.9% In this work, our goal is to … In the xth contribution, we …

135 Agenda Background & Introduction
Contribution 1: Characterize real 3D NAND flash chips Contribution 2: Model RBER and threshold voltage Contribution 3: Mitigate 3D NAND flash errors Conclusion

136 Agenda Background & Introduction
Contribution 1: Characterize real 3D NAND flash chips Process variation Early retention loss Retention interference Contribution 2: Model RBER and threshold voltage Contribution 3: Mitigate 3D NAND flash errors Conclusion

137 Process Variation Across Layers
Flash Cell Layer M Layer 1 Layer 0 Flash cells on different layers may have different error characteristics This figure shows how flash cells are organized in a 3-dimensional manner, with M+1 layers. Each layer of a vertical bitline is a flash cell. The key takeaway is the flash cells on different layers may have different error characteristics, because of the systematic process variation caused by the way how these vertical bitlines are manufactured. So our goal is to experimentally characterize this phenomenon using real chips. BL N BL 1 BL 0

138 Characterization Methodology
Modified firmware version in the flash controller Controls the read reference voltage of the flash chip Bypasses ECC to get raw data (with raw bit errors) Analysis and post-processing of the data on the server Server Here is the methodology we use. Our characterization platform consists of a form-factor SSD connected to a server machine. We use a modified firmware version in the flash controller, which allows us to control the read reference voltage of the flash chip, and to bypass error correction to get raw NAND data. SSD

139 Layer-to-Layer Process Variation
Max RBER 21× 2.4× Here is the characterization result. X-axis shows …, y-axis shows … We make several observations from the data. Large variation in RBER across different layers, this phenomenon is called layer-to-layer RBER variation caused by LSB-MSB page, we call this LSB-MSB RBER variation Max RBER in middle layers Consistent across chips

140 Layer-to-Layer Process Variation
Here is the characterization result. X-axis shows …, y-axis shows … We make several observations from the data. Large variation in RBER caused by layers RBER variation caused by LSB-MSB page Max RBER in middle layers Consistent across chips The key takeaway is that due to layer-to-layer process variation, there are large RBER variations. Large RBER variation across layers and LSB-MSB pages

141 Retention Loss Phenomenon
Planar NAND Cell 3D NAND Cell Control Gate Substrate Charge Trap (Insulator) Gate Oxide Source Floating Gate (Conductor) Control Gate Tunnel Oxide Gate Oxide Substrate Source Drain Drain Tunnel Oxide The second phenomenon we characterize is retention loss. NAND flash cell stores data in the form of electric charges. Retention loss refers to the phenomenon that charge leaks away over time, leading to retention errors. Prior work has shown that retention error is the most dominant type of error in planar NAND flash memory. So how about 3D NAND, which uses a new cell design? Will this change the behavior of retention loss? That’s what we are about to find out. Most dominant type of error in planar NAND. Is this true for 3D NAND as well?

142 Retention errors increase quickly immediately after programming
Early Retention Loss 3 years 10x 10x 11 days 10x 3 hours Here is the characterization result. X-axis shows …, y-axis shows … blue curve … orange curve …. We make several observations from the data. 1. Retention errors increase much faster in 3D NAND than planar NAND 2. Retention errors increase quickly immediately after programming Retention errors increase quickly immediately after programming

143 Characterization Summary
Layer-to-layer process variation Large RBER variation across layers and LSB-MSB pages  Need new mechanisms to tolerate RBER variation! Early retention loss RBER increases quickly after programming  Need new mechanisms to tolerate retention errors! Retention interference Amount of retention loss correlated with neighbor cells’ states  Need new mechanisms to tolerate retention interference! More threshold voltage and RBER results in the paper: 3D NAND P/E cycling, program interference, read disturb, read variation, bitline-to-bitline process variation Our approach based on insights developed via our experimental characterization: Develop error models, and build online error mitigation mechanisms using the models As a summary, we have discussed layer-to-layer process variation and early retention loss. Both are new and different in 3D NAND. In our paper, we also show retention interference phenomenon where the amount of …, which is also new in 3D NAND. So these three new phenomena in 3D NAND motivate us to design new mechanisms to tolerate 3D NAND errors. We have many other results regarding threshold voltage and RBER, which we don’t have time to talk about. Next, we will discuss our approach to model and mitigate these new 3D NAND errors characteristics.

144 Agenda Background & Introduction
Contribution 1: Characterize real 3D NAND flash chips Contribution 2: Model RBER and threshold voltage Retention loss model RBER variation model Contribution 3: Mitigate 3D NAND flash errors Conclusion

145 Threshold Voltage Distribution
What Do We Model? Va Vb Vc Read Reference Voltages Probability Threshold Voltage (Vth) Threshold Voltage Distribution 11 01 00 10 MSB LSB First, let’s look at what are the things that we modeled. NAND flash memory stores data as the threshold voltage of a flash cell. The flash chips that we study in this work utilizes the multi-level cell technology, which divides the threshold voltage range into 4 possible states, so that each cell can store a 2-bit value, consisting of MSB and LSB. We model the threshold voltage distribution of each state as a Gaussian distribution, with mean and variance. The bit values stored in flash memory can be read by applying three read reference voltages at the boundary of the 4 states, Va Vb and Vc. The cells in each distribution that goes across these reference voltages will cause raw bit errors, which changes over time. Raw Bit Errors

146 Optimal Read Reference Voltage
Va Vb Vc Probability Raw bit errors are decided by two things. First is the distribution shifts quickly due to retention loss. If we keep the read reference voltage same, raw bit errors will increase. Read reference voltage also affects raw bit errors. If we shift the read reference voltage accordingly, we can minimize raw bit errors. We call this the optimal read reference voltage. Our goal is to model the shift accurately to reduce raw bit errors. Threshold Voltage (Vth) Raw Bit Errors

147 Retention Loss Model Early retention loss can be modeled as
This figure shows a part of our retention model. X-axis shows …, y-axis shows the optimal read reference voltage, each subfigure shows Va Vb Vc respectively Blue dots are the actual optimal read reference voltages from measured data. We find that the voltage shift due to retention loss fits a linear trend of the logarithm of retention time. In fact, we find that other parameters also follows a linear function. Early retention loss can be modeled as a simple linear function of log(retention time)

148 Retention Loss Model Goal: Develop a simple linear model that can be used online Models Optimal read reference voltage (Vb and 𝑽𝒄) Raw bit error rate (𝒍𝒐𝒈⁡(𝑹𝑩𝑬𝑹)) Mean and standard deviation of threshold voltage distribution (𝝁 and 𝝈) As a function of Retention time (𝒍𝒐𝒈⁡(𝒕)) P/E cycle count (𝑷𝑬𝑪) e.g., 𝑽𝒐𝒑𝒕 = 𝜶×𝑷𝑬𝑪+𝜷 ×𝒍𝒐𝒈 𝒕 +𝜸×𝑷𝑬𝑪+𝜹 Model error <1 step for Vb and 𝑽𝒄 Adjusted R2 > 89% Here is an overview of our full model. We refer to our paper for more detailed information. Our retention loss model predicts … And it models these variables as a function of … Here is an example of how simple our model is to predict the optimal read reference voltage Our model is highly accurate, …

149 RBER distribution follows gamma distribution
RBER Variation Model This figure shows the distribution of per-page RBER within a flash block. X-axis shows …, y-axis shows …. The bars show the actual RBER distribution. We find that this variation can be accurately modeled by a gamma distribution, regardless of which read reference voltage is used. Variation-aware Vopt improves both average and worst-case RBER RBER distribution follows gamma distribution: KL-divergence error = 0.09 Large gap between worst-case RBER and average RBER  Need new mechanisms to narrow the RBER distribution! Variation-agnostic Vopt Same Vref for all layers optimized for the entire block Variation-aware Vopt Different Vref optimized for each layer KL-divergence error = 0.09 RBER distribution follows gamma distribution

150 Agenda Background & Introduction
Contribution 1: Characterize real 3D NAND flash chips Contribution 2: Model RBER and threshold voltage Contribution 3: Mitigate 3D NAND flash errors LaVAR: Layer Variation Aware Reading LI-RAID: Layer-Interleaved RAID ReMAR: Retention Model Aware Reading Conclusion

151 LaVAR: Layer Variation Aware Reading
Layer-to-layer process variation Error characteristics are different in each layer Goal: Adjust read reference voltage for each layer Key Idea: Learn a voltage offset (Offset) for each layer 𝑽 𝒐𝒑𝒕 𝑳𝒂𝒚𝒆𝒓 𝒂𝒘𝒂𝒓𝒆 = 𝑽 𝒐𝒑𝒕 𝑳𝒂𝒚𝒆𝒓 𝒂𝒈𝒏𝒐𝒔𝒕𝒊𝒄 +𝑶𝒇𝒇𝒔𝒆𝒕 Mechanism Offset: Learned once for each chip & stored in a table Uses (𝟐×𝑳𝒂𝒚𝒆𝒓𝒔) Bytes memory per chip 𝑽 𝒐𝒑𝒕 𝑳𝒂𝒚𝒆𝒓 𝒂𝒈𝒏𝒐𝒔𝒕𝒊𝒄 : Predicted by any existing Vopt model E.g., ReMAR [Luo+Sigmetrics’18], HeatWatch [Luo+HPCA’18], OFCM [Luo+JSAC’16], ARVT [Papandreou+GLSVLSI’14] Reduces RBER on average by 43% (based on our characterization data)

152 LI-RAID: Layer-Interleaved RAID
Layer-to-layer process variation Worst-case RBER much higher than average RBER Goal: Significantly reduce worst-case RBER Key Idea Group flash pages on less reliable layers with pages on more reliable layers Group MSB pages with LSB pages Mechanism Reorganize RAID layout to eliminate worst-case RBER <0.8% storage overhead

153 Worst-case RBER in any layer limits the lifetime of conventional RAID
Wordline # Layer # Page Chip 0 Chip 1 Chip 2 Chip 3 MSB Group 0 LSB Group 1 1 Group 2 Group 3 2 Group 4 Group 5 3 Group 6 Group 7 Worst-case RBER in any layer limits the lifetime of conventional RAID

154 LI-RAID: Layer-Interleaved RAID
Wordline # Layer # Page Chip 0 Chip 1 Chip 2 Chip 3 MSB Group 0 Blank Group 4 Group 3 LSB Group 1 Group 5 Group 2 1 2 3 Any page with worst-case RBER can be corrected by other reliable pages in the RAID group

155 LI-RAID: Layer-Interleaved RAID
Layer-to-layer process variation Worst-case RBER much higher than average RBER Goal: Significantly reduce worst-case RBER Key Idea Group flash pages on less reliable layers with pages on more reliable layers Group MSB pages with LSB pages Mechanism Reorganize RAID layout to eliminate worst-case RBER <0.8% storage overhead Reduces worst-case RBER by 66.9% (based on our characterization data)

156 ReMAR: Retention Model Aware Reading
Early retention loss Threshold voltage shifts quickly after programming Goal: Adjust read reference voltages based on retention loss Key Idea: Learn and use a retention loss model online Mechanism Periodically characterize and learn retention loss model online Retention time = Read timestamp - Write timestamp Uses 800 KB memory to store program time of each block Predict retention-aware Vopt using the model Reduces RBER on average by 51.9% (based on our characterization data)

157 Impact on System Reliability
ECC Limit 85% longer flash lifetime 79% lower ECC storage overhead LaVAR, LI-RAID, and ReMAR improve flash lifetime or reduce ECC overhead significantly

158 Error Mitigation Techniques Summary
LaVAR: Layer Variation Aware Reading Learn a Vopt offset for each layer and apply layer-aware Vopt LI-RAID: Layer-Interleaved RAID Group flash pages on less reliable layers with pages on more reliable layers Group MSB pages with LSB pages ReMAR: Retention Model Aware Reading Learn retention loss model and apply retention-aware Vopt Benefits: Improve flash lifetime by 1.85× or reduce ECC overhead by 78.9% ReNAC (in paper): Reread a failed page using Vopt based on the retention interference induced by neighbor cell

159 Agenda Background & Introduction
Contribution 1: Characterize real 3D NAND flash chips Contribution 2: Model RBER and threshold voltage Contribution 3: Mitigate 3D NAND flash errors Conclusion

160 Conclusion Problem: 3D NAND error characteristics are not well studied
Goal: Understand & mitigate 3D NAND errors to improve lifetime Contribution 1: Characterize real 3D NAND flash chips Process variation: 21× error rate difference across layers Early retention loss: Error rate increases by 10× after 3 hours Retention interference: Not observed before in planar NAND Contribution 2: Model RBER and threshold voltage RBER (raw bit error rate) variation model Retention loss model Contribution 3: Mitigate 3D NAND flash errors LaVAR: Layer Variation Aware Reading LI-RAID: Layer-Interleaved RAID ReMAR: Retention Model Aware Reading Improve flash lifetime by 1.85× or reduce ECC overhead by 78.9% In this work, our goal is to … In the xth contribution, we …

161 Yixin Luo Saugata Ghose Yu Cai Erich F. Haratsch Onur Mutlu
Improving 3D NAND Flash Memory Lifetime by Tolerating Early Retention Loss and Process Variation Yixin Luo Saugata Ghose Yu Cai Erich F. Haratsch Onur Mutlu Thank you for the introduction. Today, I will present our work about …

162 3D NAND Flash Reliability II [SIGMETRICS’18]
Yixin Luo, Saugata Ghose, Yu Cai, Erich F. Haratsch, and Onur Mutlu, "Improving 3D NAND Flash Memory Lifetime by Tolerating Early Retention Loss and Process Variation"  Proceedings of the ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), Irvine, CA, USA, June 2018.  [Abstract]

163 One More Idea

164 Improving NAND Flash Memory Lifetime with Write-hotness Aware Retention Management
WARM Yixin Luo, Yu Cai, Saugata Ghose, Jongmoo Choi*, Onur Mutlu Carnegie Mellon University, *Dankook University

165 Executive Summary Flash memory can achieve 50x endurance improvement by relaxing retention time using refresh [Cai+ ICCD ’12] Problem: Frequent refresh consumes the majority of endurance improvement Goal: Reduce refresh overhead to increase flash memory lifetime Key Observation: Refresh is unnecessary for write-hot data Key Ideas of Write-hotness Aware Retention Management (WARM) Physically partition write-hot pages and write-cold pages within the flash drive Apply different policies (garbage collection, wear-leveling, refresh) to each group Key Results WARM w/o refresh improves lifetime by 3.24x WARM w/ adaptive refresh improves lifetime by 12.9x (1.21x over refresh only) As you know, flash memory has limited write endurance.

166 Conventional Write-Hotness Oblivious Management
Flash Memory Hot Page 1 Page 1 Page 0 Page 2 Page 255 …… Page 257 Page 256 Page 258 Page 511 Page M+1 Page M Page M+2 Page M+255 Hot Page 1 Erase Read Cold Page 2 Hot Page 4 Write Hot Page 1 Cold Page 2 Cold Page 3 Cold Page 3 Hot Page 4 Cold Page 4 Cold Page 5 Hot Page 4 Unable to relax retention time for blocks with write-hot and cold pages Flash Controller

167 Key Idea: Write-Hotness Aware Management
Flash Memory Hot Page 1 Page 1 Page 0 Page 2 Page 255 …… Page 257 Page 256 Page 258 Page 511 Page M+1 Page M Page M+2 Page M+255 Cold Page 2 Hot Page 4 Hot Page 1 Cold Page 3 Hot Page 1 Hot Page 4 Cold Page 5 Hot Page 4 Hot Page 1 Hot Page 4 Hot Page 1 Can relax retention time for blocks with write-hot pages only Flash Controller

168 Write-Hotness Aware Retention Management
Yixin Luo, Yu Cai, Saugata Ghose, Jongmoo Choi, and Onur Mutlu, "WARM: Improving NAND Flash Memory Lifetime with Write-hotness Aware Retention Management" Proceedings of the 31st International Conference on Massive Storage Systems and Technologies (MSST), Santa Clara, CA, June 2015.  [Slides (pptx) (pdf)] [Poster (pdf)]

169 Agenda Background, Motivation and Approach
Experimental Characterization Methodology Error Analysis and Management Main Characterization Results Retention-Aware Error Management Threshold Voltage and Program Interference Analysis Read Reference Voltage Prediction Neighbor-Assisted Error Correction Read Disturb Error Handling Retention Error Handling Large Scale Field Analysis 3D NAND Flash Memory Reliability Summary

170 Summary of Key Works Yu Cai, Saugata Ghose, Erich F. Haratsch, Yixin Luo, and Onur Mutlu, "Error Characterization, Mitigation, and Recovery in Flash Memory Based Solid State Drives" Proceedings of the IEEE, September 2017. Cai+, “Error Patterns in MLC NAND Flash Memory: Measurement, Characterization, and Analysis,” DATE 2012. Cai+, “Flash Correct-and-Refresh: Retention-Aware Error Management for Increased Flash Memory Lifetime,” ICCD 2012. Cai+, “Threshold Voltage Distribution in MLC NAND Flash Memory: Characterization, Analysis and Modeling,” DATE 2013. Cai+, “Error Analysis and Retention-Aware Error Management for NAND Flash Memory,” Intel Technology Journal 2013. Cai+, “Program Interference in MLC NAND Flash Memory: Characterization, Modeling, and Mitigation,” ICCD 2013. Cai+, “Neighbor-Cell Assisted Error Correction for MLC NAND Flash Memories,” SIGMETRICS 2014. Cai+,”Data Retention in MLC NAND Flash Memory: Characterization, Optimization and Recovery,” HPCA 2015. Cai+, “Read Disturb Errors in MLC NAND Flash Memory: Characterization and Mitigation,” DSN 2015. Luo+, “WARM: Improving NAND Flash Memory Lifetime with Write-hotness Aware Retention Management,” MSST 2015. Meza+, “A Large-Scale Study of Flash Memory Errors in the Field,” SIGMETRICS 2015. Luo+, “Enabling Accurate and Practical Online Flash Channel Modeling for Modern MLC NAND Flash Memory,” IEEE JSAC 2016. Cai+, “Vulnerabilities in MLC NAND Flash Memory Programming: Experimental Analysis, Exploits, and Mitigation Techniques,” HPCA 2017. Fukami+, “Improving the Reliability of Chip-Off Forensic Analysis of NAND Flash Memory Devices,” DFRWS EU 2017. Luo+, “HeatWatch: Improving 3D NAND Flash Memory Device Reliability by Exploiting Self-Recovery and Temperature-Awareness," HPCA 2018. Luo+, “Improving 3D NAND Flash Memory Lifetime by Tolerating Early Retention Loss and Process Variation," SIGMETRICS 2018. Cai+, “Error Characterization, Mitigation, and Recovery in Flash Memory Based Solid State Drives,” Proc. IEEE 2017.

171 NAND Flash Vulnerabilities [HPCA’17]
HPCA, Feb. 2017

172 NAND Flash Errors: A Modern Survey
Proceedings of the IEEE, Sept. 2017

173 More Up-to-date Version
Yu Cai, Saugata Ghose, Erich F. Haratsch, Yixin Luo, and Onur Mutlu, "Errors in Flash-Memory-Based Solid-State Drives: Analysis, Mitigation, and Recovery" Invited Book Chapter in Inside Solid State Drives, 2018.  [Preliminary arxiv.org version] 

174 Prof. Onur Mutlu ETH Zürich Fall 2018 14 November 2018
Computer Architecture Lecture 15: Flash Memory and Solid-State Drives (II) Prof. Onur Mutlu ETH Zürich Fall 2018 14 November 2018

175 Other Works on Flash Memory

176 Yixin Luo Saugata Ghose Yu Cai Erich F. Haratsch Onur Mutlu
HeatWatch Improving 3D NAND Flash Memory Device Reliability by Exploiting Self-Recovery and Temperature Awareness Yixin Luo Saugata Ghose Yu Cai Erich F. Haratsch Onur Mutlu Thank you for the introduction. Today, I will present our work about …

177 Storage Technology Drivers - 2018
3D NAND Flash Memory Store large amounts of data reliably for months to years Stacked layers Many applications today requires storing large amount of data reliably in the storage for a really long time, which can be up to several years. (click) In order to keep up with this need, manufactures have introduced 3D NAND technology that stacks multiple flash cell arrays vertically on top of each other.

178 Executive Summary 3D NAND flash memory susceptible to retention errors
Charge leaks out of flash cell Two unreported factors: self-recovery and temperature We study self-recovery and temperature effects We develop a new technique to improve flash reliability Experimental characterization of real 3D NAND chips Unified Self-Recovery and Temperature (URT) Model Predicts impact of retention loss, wearout, self-recovery, temperature on flash cell voltage Low prediction error rate: 4.9% However, the problem with 3D NAND flash memory is that it is susceptible to retention errors, where charge leaks out of flash cell over time. Also, we find there are two unknown factors of retention errors that are understudied in prior work. They are self-recovery and temperature. In this work, we develop a new understanding of these two factors through experimental characterization of real 3D NAND chips. Using this characterization, we develop a new unified model that captures retention, wearout, self-recovery and temperature. We validate that this new model is highly accurate, with an error rate of only 4.9% We show that this new understanding is beneficial because it enables new reliability techniques. For example, we develop Heatwatch which uses our model to adapt 3D NAND flash memory to these effects. This improves lifetime by 3.85x. HeatWatch Uses URT model to find optimal read voltages for 3D NAND flash Improves flash lifetime by 3.85x

179 Outline Executive Summary Background on NAND Flash Reliability
Characterization of Self-Recovery and Temperature Effect on Real 3D NAND Flash Memory Chips URT: Unified Self-Recovery and Temperature Model HeatWatch Mechanism Conclusion First, let me introduce some background about 3D NAND flash memory in order to understand self-recovery and temperature effects.

180 3D NAND Flash Memory Background
Charge = Threshold Voltage 3D NAND Flash Memory Higher Voltage State Lower Voltage State Data Value = 0 Data Value = 1 Flash Cell Read Reference Voltage 3D NAND flash memory consists of flash cells, each of which holds different amount of charge which determines the threshold voltage of the cell. And the threshold voltage determines the data value stored in that cell. For example, for a cell that stores one bit of data, 0 can be assigned to higher voltage state, and 1 assigned to lower voltage state. We can sense the data value by applying a read reference voltage to the cell that can distinguish between these two states.

181 Program/Erase (P/E)  Wearout Wearout Introduces Errors
Flash Wearout Program/Erase (P/E)  Wearout Wearout Effects: 1. Retention Loss (voltage shift over time) Insulator 2. Program Variation (init. voltage difference b/w states) A flash cell can hold the data for much longer than DRAM as it conceptually uses a layer of insulator to prevent the charge from leaking away. However, this means that when we program or erase new data to the cell, we have to force the charge to run through this insulator, which causes wearout. This wearout has two consequences. First, it speeds up retention loss. When the insulator is damaged, charge can be trapped within the insulator and form pathways for other charge to leak away from the flash cell, shifting a cell from higher voltage state to lower voltage state. Secondly, it decreases program variation. Program variation is the initial voltage gap between states. So larger program variation helps distinguish between different states. As more and more charge gets permanently trapped within the insulator, it increases the threshold voltage of the lower voltage state. This makes the difference between the higher and lower voltage states smaller. Both effects make it harder to distinguish between different states, and increases the error rate in NAND flash memory. At some point, the flash cell can no longer reliable hold the data. We call the duration before this point flash lifetime, which is typically counted in P/E cycles. (next slide) There are two ways to potentially improve flash lifetime Wearout Introduces Errors Voltage

182 Improving Flash Lifetime
Errors introduced by wearout limit flash lifetime (measured in P/E cycles) Exploiting the Self-Recovery Effect Two Ways to Improve Flash Lifetime Exploiting the Temperature Effect

183 Exploiting the Self-Recovery Effect
Partially repairs damage due to wearout P/E Dwell Time: Idle Time Between P/E Cycles Longer Dwell Time: More Self-Recovery P/E Flash cell partially recovers from damage during the dwell time, which is the idle time between P/E cycles. When the dwell time is longer, the flash cell recovers more. This reduces retention loss in the flash cell, improving reliability. Reduces Retention Loss

184 Exploiting the Temperature Effect
High Program Temperature Voltage Increases Program Variation The second effect is the temperature effect, which changes electron mobility. During program time, a high temperature allows more charge to be programmed or erased, increasing the program variation. During retention time, a high temperature makes the charge stored in the flash cell to leak away faster. So different temperatures affects reliability differently. High Storage Temperature Accelerates Retention Loss

185 Prior Studies of Self-Recovery/Temperature
Planar (2D) NAND 3D NAND x Self-Recovery Effect Mielke 2006 x Temperature Effect There are only a few prior works that study self-recovery and temperature effects in planar NAND. And JEDEC only provides a guideline for temperature effect without characterization. And these two error characteristics are completely unreported in 3D NAND. JEDEC 2010 (no characterization)

186 Outline Executive Summary Background on NAND Flash Reliability
Characterization of Self-Recovery and Temperature Effect on Real 3D NAND Flash Memory Chips URT: Unified Self-Recovery and Temperature Model HeatWatch Mechanism Conclusion Next, I will introduce our characterization, which is the first characterization of these effects on real 3D NAND chips.

187 Characterization Methodology
Modified firmware version in the flash controller Control the read reference voltage of the flash chip Bypass ECC to get raw NAND data (with raw bit errors) Control temperature with a heat chamber Server Heat Chamber Here is the methodology we use. This is a toy diagram of our characterization platform, which consists of a form-factor SSD connected to a server machine. We use a modified firmware version in the flash controller, which allows us to control the read reference voltage of the flash chip, and to bypass error correction to get raw NAND data. And, when necessary, we used a heat chamber to control SSD temperature. SSD

188 Characterized Devices
Real Layer 3D MLC NAND Flash Chips 2-bit MLC 01 01 01 01 The devices that we characterize are real 3D NAND flash chips from a major flash vendor. Each chip has 30 to 39 stacked layers. And use MLC flash so each flash cell stores 2 bits of data. 30- to 39-layer

189 MLC Threshold Voltage Distribution Background
Lowest Voltage State Highest Voltage State Probability Read Reference Voltage Read Reference Voltage Read Reference Voltage Each MLC cell has 4 threshold voltage states. We assign a different 2-bit value to each state.. And they are read by three read reference voltages instead of one. To better understand the reliability of the cell, we typically model the probability distribution (point) of the threshold voltage (point) of each state as a probability density function. We call this the threshold voltage distribution. 11 10 00 01 Threshold Voltage Distribution Threshold Voltage

190 Characterization Goal
Retention Loss Speed (how fast voltage shifts over time) Program Variation (initial voltage difference between states) Characterized Metrics Using this distribution, we characterize the retention loss speed as how fast the distribution shifts over time, And the program variation as the difference between the distributions of the highest and the lowest voltage state. So the goal of our characterization is to evaluate how self-recovery and temperature affects 3D NAND reliability using these two metrics. In the next few slides, I will show our characterization results that show how self-recovery affects retention time, And how temperature effects affect both. Self-Recovery Effect Temperature Effect Characterized Phenomena

191 Self-Recovery Effect Characterization Results
1 minute 2.3 hour First, to understand the self-recovery effect, we compare the retention loss speed under different dwell times. The x-axis is the dwell time, which is the idle time between P/E cycles. This correlates with self-recovery effect. The y-axis is the retention loss speed. By comparing the highest and lowest points, we conclude that increasing dwell time from 1 minute to 2.3 hours slows down retention loss speed by 40%. Dwell time: Idle time between P/E cycles Increasing dwell time from 1 minute to 2.3 hours slows down retention loss speed by 40%

192 Program Temperature Effect Characterization Results
Second, to understand program temperature effect, we compare the program variation under different temperatures. The x-axis is program temperature. The y-axis is the program variation. By comparing the highest and lowest points, we conclude that increasing temperature from 0 C to 70 C improves program variation by 21% Increasing program temperature from 0°C to 70°C improves program variation by 21%

193 Storage Temperature Effect Characterization Results
Third, to understand retention temperature effect, we compare the retention loss speed under different temperatures. The x-axis is the retention temperature. The y-axis is the retention loss speed. By comparing the highest and lowest points, we conclude that lowering temperature from 70 C to 0 C slows down retention loss speed by 58%. Lowering storage temperature from 70°C to 0°C slows down retention loss speed by 58%

194 Characterization Summary
Unified Model Major Results: Self-recovery affects retention loss speed Program temperature affects program variation Storage temperature affects retention loss speed Other Characterizations Methods in the Paper: More detailed results on self-recovery and temperature Effects on error rate Effects on threshold voltage distribution Effects of recovery cycle (P/E cycles with long dwell time) on retention loss speed We conclude that self-recovery and temperature significantly affects flash reliability, and in different ways. And we provide a new unified model to combine them together. we have other characterization results in the paper that we won't talk about here

195 Outline Executive Summary Background on NAND Flash Reliability
Characterization of Self-Recovery and Temperature Effect on Real 3D NAND Flash Memory Chips URT: Unified Self-Recovery and Temperature Model HeatWatch Mechanism Conclusion Next, I will talk about our unified model.

196 Minimizing 3D NAND Errors
Optimal Read Ref. Voltage Read Ref. Voltage 00 01 Probability The goal of this model is to provide guidelines for minimizing 3D NAND errors. Now let’s look at how we can do that. Ideally, the read reference voltage is set at the boundary of two neighboring states. Because of retention loss, the charge leaks away from the flash cell, decreasing its threshold voltage. So the distribution is shifted. But the flash controller is unaware of that. So some cells in the shaded part of the distribution went across the read reference voltage, and they become retention errors. One way to reduce this error is to shift the read reference voltage with the distribution. This minimizes the errors so we call it the optimal read reference voltage. Retention Errors Threshold Voltage Optimal read reference voltage minimizes 3D NAND errors

197 Predicting the Mean Threshold Voltage
Our URT Model: V = V0 + ΔV Mean Threshold Voltage Initial Voltage Before Retention (Program Variation) Voltage Shift Due to Retention Loss The URT model that we propose can predict the mean threshold voltage of ***each state*** so that the flash controller can predict the optimal read reference voltage. URT predicts the mean voltage as the sum of the initial voltage V0 before retention loss, and the voltage shift due to retention loss.

198 V = V0 + ΔV URT Model Overview 1. Program Variation Component tr Tr td
3. Temperature Scaling Component tr,eff PEC td,eff 2. Self-Recovery and Retention Component PEC Tp V = V0 + ΔV Here is an overview of this model. The initial voltage is predicted by the program variation component. The voltage shift is predicted by two components: self-recovery and retention component and temperature scaling component, which scales the effect of 2 based on temperature Next, we will go through them one by one. Initial Voltage Before Retention Voltage Shift Due to Retention Loss

199 1. Program Variation Component
Program Temperature P/E Cycle PEC Tp V0 Initial Voltage First, the program variation component. According to our new understanding, we add program temperature awareness to an existing model. And we find that the initial voltage is linear with both the P/E cycle count and the program temperature We validate this model using statistical analysis, and show that it fits real results nicely. V0 Validation: R2 = 91.7%

200 2. Self-Recovery and Retention Component
Retention Time P/E Cycle Dwell Time tr PEC Td ΔV Retention Shift Second the self-recovery component. Based on our new understanding, we add dwell time awareness to an existing model. And we find that the correlation follows this equation, more details are in the paper. Our validation shows that this component is 3x more accurate on 3D NAND than the existing model designed for planar NAND. ΔV Validation: 3x more accurate than state-of-the-art model

201 3. Temperature Scaling Component
Actual Retention Time Actual Dwell Time Storage Temp. Dwell Temp. tr Tr td Td tr,eff td,eff Effective Retention Time Effective Dwell Time Third is the temperature scaling component. Using the Arrhenius equation recommended by the JEDEC standard, we scale the retention time and dwell time to an effective time that has equivalent effect under room temperature. However, during validation, we find that because the cell structure is drastically changed in 3D NAND, we can’t use the existing parameters. In fact, we had to adjust the activation energy Ea from 1.1eV to 1.04eV for 3D NAND. Arrhenius Equation: Validation: Adjust an important parameter, Ea, from 1.1 eV to 1.04 eV

202 V = V0 + ΔV URT Model Summary Validation: Prediction Error Rate = 4.9%
1. Program Variation Component tr Tr tr,eff td Td td,eff 3. Temperature Scaling Component PEC 2. Self-Recovery and Retention Component PEC Tp V = V0 + ΔV Putting everything together, we have the program variation component that models the initial voltage. We have the effective retention/dwell time component that models the retention and dwell time. And we have the self-recovery component that models the self-recovery and retention effects and predicts the retention shift. We validate the overall model, the prediction error rate of URT is only 4.9%. Initial Voltage Before Retention Voltage Shift Due to Retention Loss Validation: Prediction Error Rate = 4.9%

203 Outline Executive Summary Background on NAND Flash Reliability
Characterization of Self-Recovery and Temperature Effect on Real 3D NAND Flash Memory Chips URT: Unified Self-Recovery and Temperature Model HeatWatch Mechanism Conclusion Using this model, we develop HeatWatch to improve flash lifetime

204 HeatWatch Mechanism Key Idea
Predict change in threshold voltage distribution by using the URT model Adapt read reference voltage to near-optimal (Vopt) based on predicted change in voltage distribution The key idea is to adapt the optimal read reference voltage using the URT model. HeatWatch has two groups of components. The first group is the tracking components whose job is to efficiently track URT parameters, which are the bubbles that we show in the previous slide. The second group is the prediction components which uses the flash controller to accurately predict the optimal read reference voltage. Next, I will go through these components one by one.

205 HeatWatch Mechanism Overview
Tracking Components SSD Temperature Dwell Time P/E Cycles & Retention Time URT Prediction Components Vopt Prediction Fine-Tuning URT Parameters

206 Tracking SSD Temperature
Tracking Components SSD Temperature Dwell Time P/E Cycles & Retention Time Use existing sensors in the SSD Precompute temperature scaling factor at logarithmic time intervals URT Prediction Components Vopt Prediction Fine-Tuning URT Parameters

207 Tracking Dwell Time Tracking Components Prediction Components
SSD Temperature Dwell Time P/E Cycles & Retention Time Only need to log the timestamps of last 20 full drive writes Self-recovery effect diminishes after 20 P/E cycles URT Prediction Components Vopt Prediction Fine-Tuning URT Parameters

208 Tracking P/E Cycles and Retention Time
Tracking Components SSD Temperature Dwell Time P/E Cycles & Retention Time P/E cycle count already recorded by SSD Log write timestamp for each block Retention time = read timestamp – write timestamp URT Prediction Components Vopt Prediction Fine-Tuning URT Parameters

209 Predicting Optimal Read Reference Voltage
Tracking Components SSD Temperature Dwell Time P/E Cycles & Retention Time Calculate URT using tracked information Modeling error: 4.9% URT Prediction Components Vopt Prediction Fine-Tuning URT Parameters

210 Fine-Tuning URT Parameters Online
Tracking Components SSD Temperature Dwell Time P/E Cycles & Retention Time Accommodates chip-to-chip variation Uses periodic sampling URT Prediction Components Vopt Prediction Fine-Tuning URT Parameters

211 HeatWatch Mechanism Summary
Tracking Components SSD Temperature Dwell Time P/E Cycles & Retention Time Storage Overhead: 0.16% of DRAM in 1TB SSD URT Prediction Components Vopt Prediction Fine-Tuning URT Parameters Latency Overhead: < 1% of flash read latency

212 HeatWatch Evaluation Methodology
28 real workload storage traces MSR-Cambridge We use real dwell time, retention time values obtained from traces Temperature Model: Trigonometric function + Gaussian noise Represents periodic temperature variation in each day Includes small transient temperature variation To evaluate the benefit of HeatWatch, we use 28 real-workload traces from MSR-Cambridge, to provide real dwell time and retention time information. And we use a temperature model that captures both the periodic temperature variation during a day, and the small transient temperature variations due to a high CPU load.

213 HeatWatch Greatly Improves Flash Lifetime
3.85x over Fixed Vref Fixed Vref Oracle Error Rate State-of-the-art ECC limit HeatWatch 24% over state-of-the-art Lifetime (P/E Cycles) Here is the result. The x-axis is P/E cycles, y-axis is the error rate. The dotted line shows the error correction limit. Flash lifetime ends when the error rate goes across this limit. The blue curve shows the baseline where we apply a fixed read reference voltage because we do not know the retention time. The orange curve shows the conventional model. The green curve shows the HeatWatch. The Oracle shows the ideal case that we magically applies the actual optimal read reference voltage. We see that HeatWatch is very close to the oracle. Overall, HeatWatch improves lifetime by 24% over conventional, and by 3.85 times over the baseline. HeatWatch improves lifetime by capturing the effect of retention, wearout, self-recovery, temperature

214 Outline Executive Summary Background on NAND Flash Reliability
Characterization of Self-Recovery and Temperature Effect on Real 3D NAND Flash Memory Chips URT: Unified Self-Recovery and Temperature Model HeatWatch Mechanism Conclusion

215 Conclusion 3D NAND flash memory susceptible to retention errors
Charge leaks out of flash cell Two unreported factors: self-recovery and temperature We study self-recovery and temperature effects We develop a new technique to improve flash reliability Experimental characterization of real 3D NAND chips Unified Self-Recovery and Temperature (URT) Model Predicts impact of retention loss, wearout, self-recovery, temperature on flash cell voltage Low prediction error rate: 4.9% As a conclusion, we have looked at two previously unknown factors that affects 3D NAND retention errors: self-recovery and temperature. We present an experimental characterization of these two effects on real 3D NAND chips. And we develop a highly-accurate model that achieves 4.9% error rate. Using this model, we develop a new technique called HeatWatch that improves flash lifetime by 3.85x. HeatWatch Uses URT model to find optimal read voltages for 3D NAND flash Improves flash lifetime by 3.85x

216 Yixin Luo Saugata Ghose Yu Cai Erich F. Haratsch Onur Mutlu
HeatWatch Improving 3D NAND Flash Memory Device Reliability by Exploiting Self-Recovery and Temperature Awareness Yixin Luo Saugata Ghose Yu Cai Erich F. Haratsch Onur Mutlu Thank you, I will be happy to take any questions


Download ppt "Prof. Onur Mutlu ETH Zürich Fall November 2018"

Similar presentations


Ads by Google