Neighbor-Cell Assisted Error Correction for MLC NAND Flash Memories SIGMETRICS’14
Summary Problem: as P/E cycle increases, raw BER significantly increases beyond the fixed ECC capability Goal: this paper tries to extend lifetime by reducing # of bit errors in a page (to the extent which ECC can fix) How to reduce # of bit errors: when reading a page, they use multiple sets of reference voltages, instead of conventional single set of reference voltages Rationale of multiple sets of reference voltages: they observed “the threshold voltage distributions based on the values in the neighboring cell”, and reading with “the new reference voltage from the threshold voltage distributions” brings less error data Solution: When read page fails to pass ECC, Neighbor-cell Assisted Correction (NAC) mechanism reads a page several times using multiple reference voltages, which makes # of errors in the page drop by the degree ECC can correct
Background: Program Interference Prior works characterized and modeled this error type Threshold voltage of a cell (victim cell) can change when its neighbor cells (aggressor cells) are being programmed Program interference from neighbor cells on the same WL is negligible Program interference on a victim cell C(n, j) due to the aggressor cell on the WL that is immediately above the victim WL “C(n+1, j)” is dominant
Optimizing Read Reference Voltage Neighboring states (Pi and Pi+1) are overlapped By a reference voltage (Vref), blue is error due to “Pi misread as Pi+1” and read is error due to “Pi+1 misread as Pi” F(x) and g(x) are probability density functions (PDF) of cells programmed into state Pi and Pi+1, respectively Optimal read reference voltage to minimize above BER is at cross-point of neighbor distributions
Modeling Raw BER Algebraic manipulation with a set of assumptions Threshold voltage distributions (f(x) and g(x)) follow Gaussian distribution The Gaussian distribution has equal variance (σ1 = σ2 = σ) Random data are programmed (P0 = P1) Optimum read reference voltage is used, v= (μ1+μ2)/2 from previous slide Q(x) is a function of raw BER When x = (μ2-μ1)/2σ, Q(x) (or raw BER) can be minimum As x increases, Q(x) (or raw BER) monotonically decreases Higher value of (μ2-μ1)/2σ is desirable for minimizing raw BER Larger threshold voltage distance (μ2-μ1) between neighboring distributions Smaller variance (σ) of threshold voltage distribution, narrower distributions
Observations on Voltage Distribution [After] [Before Aggress WL is programmed] We want to read “victim page” (WL) “with minimum raw bit error” Before aggressor page (WL) is programmed, two neighboring distributions of victim page are easy to distinguish After aggressor page is programmed, program interference cause the distributions to overlap, increasing raw BER The threshold voltage distributions of all cells (overall distribution) can be further divided into four different threshold voltage distributions (conditional distribution) based on the values of aggressor cells
Overall vs Conditional Distribution Overall distribution is the sum of all four conditional distributions In perspective of minimizing raw BER Threshold voltage distance between neighboring distributions Variance of threshold voltage distribution Distance: overall distribution ≈ conditional distribution Variance: overall distribution > conditional distribution Using conditional distribution to read a page, instead of overall distribution can minimize raw BER Distance of conditional distribution pairs Variance of overall distribution Variance of conditional distribution
Multiple Sets of Reference Voltages Optimal read reference voltage is (μ1+μ2)/2 from the previous model REFx is the single read reference voltage for overall distribution REFx11 is the read reference voltage for conditional distribution whose neighbor (aggressor) cell is programmed with value “11” For 2-bit MLC flash, there can be additionally four different read voltages for conditional distributions Due to the small variance (or narrower distribution), using the multiple sets of reference voltages (REFx11, REFx00, REFx10, REFx01), instead of single set of reference voltage (REFx), can minimize raw BER
Measurement Analysis Conditional distributions ≈ > < > SNR (Signal to Noise Ratio): x = (μ2-μ1)/2σ, making Q(x) minimum Due to small variance (narrower distribution), conditional distribution is more likely to generate less error when reading threshold voltage
Neighbor-cell Assisted Correction (1) read target page using overall read reference (REFx) (2) check ECC, if it fails, NAC works (3) firstly read neighbor (aggressor) pages (MSB/LSB) using REFx (4) then read target page using conditional read references (REF00,11,10,01) (5) when partially corrected, try to run ECC again (6) if it fails, try to use another read reference (7) if ECC continuously fails until all conditional references, return error
NAC Implementation Page-to-be-Corrected Buffer: to store final read data Neighbor LSB/MSB Page Buffers: to store aggressor page data Bit1/Bit2: to determine which conditional reference voltage is used Local-Optimum-Read Buffer: to store temporary page read with one of four conditional references
Prioritized NAC P/E cycle increases NAC degrades latency due to the increased reads (up to +6) The first read with overall read reference (1) The read for neighbor MSB/LSB page (2) The read with conditional read reference (4) Observation reveals that specific errors are dominant Pi+1(11)->Pi : the cell in state Pi+1 whose neighbor cell is 11 misread as Pi Try to use REFx11 first among four conditional references If ECC still fails, then use REFx10 REFx01 REFx00
Lifetime Extension NAC with different strengths (different # of conditional references) Lifetime can be largely increased by lowering raw BER (to be corrected by ECC)
Performance Analysis Low P/E cycles: performance a little improved since neighbor MSB/LSB page read of NAC generates hits in SSD buffer due to good locality of some workloads 18K~24K P/E cycles: less than 5% degradation while providing 33% lifetime improvement Over 25K P/E cycles: sharply increased latency because to one of every 3 reads requires NAC due to ECC failure