MACAU: A Markov Model for Reliability Evaluations of Caches Under Single-bit and Multi-bit Upsets Jinho Suh Murali Annavaram Michel Dubois
Outline Definitions Model-based Soft-Error Reliability Evaluations Modeling Multiple bit Upsets MACAU Describing Markov Chain Measuring Intrinsic Reliability Benchmarking Realistic Reliability Evaluations Summary HPCA-182
Definitions Domain Set of bits bundled together; protection domain Spatial MBU (Multi-Bit Upset; spatial q-BU) Multiple (q-)bits are flipped due to one particle hit cf. spatial MBE Temporal MBU Multiple bits are flipped due to more than one particle hit HPCA-183
Model-based Soft-Error Reliability Evaluations 4 Comparison [Saleh’90, Reviriego’09, Mukherjee’03, Suh’11] Projection from industry [ITRS’07, ITRS’10] Intrinsic MTTF AVFPARMAMACAU Masking effectX OOO Single-Bit UpsetO OOO Spatial MBUO XXO Temporal MBU-- XOO Protection codeO XOO Variable domain sizeX XOX ComputationExtremely cheapCheapExpensive Year or production Feature size [nm] Gate Length [nm] SER [FIT per Mb]1,2501,3001,3501,400 % MBU in SEU64%100%
Challenges in Modeling Spatial MBUs Multiple spatial MBUs may leave complex patterns of flipped bits Protection domain: word Dealing with MBUs spanning domains vertically/horizontally MBUs spanning domains vertically: easy to address if isolated in multiple protection domains MBUs spanning domains horizontally (edge-effect): negligible impact 5 …… PD#1 PD#5 PD#2 PD#3 PD#4 PD#6PD#7PD# HPCA-18
Assumptions in MACAU 1.Spatial MBUs happen always in contiguous patterns Recent studies [Georgakos2007, Mahatme2011] report that (a) happens much more frequently than (b)/(c) (b)/(c) can be approximated by contiguous patterns framed by the red dotted rectangles as (a) (with minimal overestimation) 2.At most one SEU strike a protection domain in one cycle, and at most two SEUs flip bits on the same protection domain Soft errors are extremely rare 3.Edge effect is ignored Only on small number of bits next to the border of two protection domains HPCA-186 (a) Contiguous (b) Disjoint (c) Diagonal
Distribution of Spatial MBUs in a SEU Probability distribution of MBUs for omni-directional galactic cosmic rays [Tipton’08] in the cache We concentrate on the MBU patterns included in the dotted square 7 0.5% HPCA-18
Probability of a SEU in One Cycle 8 SEU Model p SEU_PD Probability that one SEU arrives during one cycle period in the protection domain Poisson probability mass function gives p SEU_PD λ SEU_PD : Poisson rate of SEUs in 32bit word protection domain ex) 3× nm 3GHz CPU Probability of having a spatial q-BU in PD due to one SEU Including the effect of vertical spatial MBUs HPCA-18
Modeling Spatial Effects with Markov Chain Markov states Transient state (non-recurrent state) After departing from the state, probability of not returning is nonzero Absorbing state No more transition to other states is possible once the state is visited Markov chain State expresses the number of flipped, incorrect bits SBU only SBU and 2BU 0q-1 q q+1 Up to qBU …… HPCA-18
Markov Chain and Overlap of SEUs Number of overlapped bits 10 k-bit flipped by 1 st SEU: Current Markov state = k q-BU arrives with 2 nd SEU: Next Markov state = k + d Overlap (o ): k’ = k + d = k + q – 2o HPCA-18kqodk'overlap no 83119partial 8327partial full d = q – 2o d is even if and only if q is even d is odd if and only if q is odd o =1, 2, …, min(k, q) d = q – 2o d is even if and only if q is even d is odd if and only if q is odd o =1, 2, …, min(k, q)
N-bit protection domain Overlap of SEUs In N-bit protection domain with contiguous k-bit fault SEU of spatial q-BU arrives: 1.Full overlap (0 < o = q) 2.Partial overlap (0 < o < q) 3.No overlap (o = 0) HPCA-1811 k-bit fault Spatial q-BU
Markov Transition Matrix T: Example for the Case of D Building T with overlapping probabilities Example: matrix D for MBUs with up to 3 horizontal BUs and up to 2 vertical BUs 12HPCA-18
Markov Transition Matrix T: General Case HPCA-1813 T contains the probabilities of transition between any two states in one cycle
Using MACAU Measuring intrinsic MTTFBenchmarking HPCA-1814 Build T Add transitions on T for scrubbing Calculate mean first- passage time from state 0 to failing states Build T Manage VCC Calculate probability of failure on word from T VCC Program starts Word consumed? Yes No Accumulate E[#fail] Program ends? No Calculate failure rate Yes
Calculating intrinsic MTTF Mean first-passage time gives intrinsic MTTF 1.Make the states that cause failure absorbing states With b-bit error correcting code, states > b are failing states 2.Measure the transition time from state 0 (clean state) to failing state With transition matrix T: Without scrubbing First-passage time from state 0 to any absorbing state gives the intrinsic MTTF With (stochastic) scrubbing with scrubbing interval of L States that can be scrubbed has extra transitions to state 0 with probability = 1/L Then first-passage time gives intrinsic MTTF 15HPCA-18
Benchmarking FIT rate with T Whenever an access is made: 1.Measure VCC 2.Calculate S = T VCC to get the transition matrix after VCC 3.Add to the expected number of failures by summing the probabilities of reaching failure states Failure probability is obtained from state transition probability in S 16 ProtectionNo protectionOdd-paritySECDEDDECTEDTECQED SDC DUE--- HPCA-18
Evaluations Intrinsic MTTFs Differences in MTTF: MACAU addresses ‘overlapping effect’ which Saleh/Reviriego ignores [Ming’11] FIT rates from benchmarking MACAU differs by ≤ 0.015% from PARMA when benchmarking SBUs only 17 SEUs Protection on a word Model 32b-word No scrub Once/yearOnce/monthOnce/day SBUs onlySEC MACAU6.715E E E E+15 Saleh6.245E E E E+15 1BU+2BU (0.5:0.5) DEC MACAU8.012E E E E+15 Reviriego7.211E E E E+15 D DEC MACAU9.700E E+08 Reviriego--- TEC MACAU1.330E E E E+16 Reviriego1.700E E E E+16 No-protectionOdd-paritySECDEDDECTEDTECQED DUE TRUEN/A E-16 FALSEN/A E-15 SDC E E-17 HPCA-18
Summary MACAU Model for temporal/spatial MBU effects Capable of evaluating various protection schemes Useful for quick evaluation of caches, by measuring intrinsic MTTFs Useful for rigorously benchmarking FIT rates in caches under MBUs and SBUs Future work Refining model for addressing edge effect Spatial MBU model for arbitrarily shaped patterns Model for TAG and meta-bit vulnerability Application to processor buffers (ROB, LSQ, IFQ) 18HPCA-18
THANK YOU!
(Some) References [Biswas’10] Arijit Biswas, Charles Recchia, Shubhendu S. Mukherjee, Vinod Ambrose, Leo Chan, Aamer Jaleel, Mike Plaster, and Norbert Seifert, “Explaining Cache SER Anomaly Using Relative DUE AVF Measurement,” HPCA [Li’05] X. Li, S. Adve, P. Bose, and J.A. Rivers. SoftArch: An Architecture Level Tool for Modeling and Analyzing Soft Errors. In Proceedings of the International Conference on Dependable Systems and Networks, , [Mahatme’11] Mahatme, N., Bhuva, B., Fang, Y., and Oates, A. Analysis of multiple cell upsets due to neutrons in srams for a deep-n-well process. In Reliability Physics Symposium (IRPS), 2011 IEEE International (April 2011), pp. SE.7.1 – 6. [Ming’11] Ming, Z., Yi, X. L., Chang, L., and Wei, Z. J. Reliability of memories protected by multibit error correction codes against mbus. Nuclear Science, IEEE Transactions on 58, 1 (feb. 2011), 289 – 295. [Mukherjee’03] S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt, and T. Austin. A systematic methodology to calculate the architectural vulnerability factors for a high-performance microprocessor. In Proceedings of the 36th International Symposium on Microarchitecture, pages 29-40, [Mukherjee’04] S. S. Mukherjee, J. Emer, T. Fossum, and S. K. Reinhardt. Cache Scrubbing in Microprocessors: Myth or Necessity? In Proceedings of the 10th IEEE Pacific Rim Symposium on Dependable Computing, 37-42, [Reviriego’09] Reviriego, P., and Maestro, J. A. Study of the effects of multibit error correction codes on the reliability of memories in the presence of mbus. IEEE Transactions on Device and Materials Reliability 9 (2009), [Saleh’90] A. M. Saleh, J. J. Serrano, and J. H. Patel. Reliability of Scrubbing Recovery Techniques for Memory Systems. In IEEE Transactions on Reliability, 39(1), , [Tipton’08] Tipton, A. D., Pellish, J. A., Hutson, J. M., Baumann, R., Deng, X., Marshall, A., Xapsos, M. A., Kim, H. S., Friendlich, M. R., Campola, M. J., Seidleck, C. M., LaBel, K. A., Mendenhall, M. H., Reed, R. A., Schrimpf, R. D., Weller, R. A., and Black, J. D. Device-orientation eects on multiple-bit upset in 65 nm srams. IEEE Transactions on Nuclear Science 55 (2008), [Ziegler] J. F. Ziegler and H. Puchner, “SER – History, Trends and Challenges,” Cypress Semiconductor Corp 20HPCA-18
ADDENDUM HPCA-1821
Definitions Domain Set of bits bundled together; protection domain Fault Incorrect state in a domain Error A manifested fault, propagated outside the original domain Failure Visible error Consumption An event resulting in the change of architectural state SEU (Single Event Upset) State change in memory due to one particle hit SBU (Single-Bit Upset) One bit is flipped due to one SEU Spatial MBU (Multi-Bit Upset; spatial q-BU) Multiple (q-)bits are flipped due to one SEU Temporal MBU Multiple bits are flipped due to more than one SEU Vulnerability clock cycles (VCCs) Time in cycles that a bit is exposed to particle hits HPCA-1822
Model-based Soft-Error Reliability Evaluations (Intrinsic) Mean-Time-To-Failure [Saleh’90, Reviriego’09] + Fast, first-cut estimation of circuit-level reliability − Highly pessimistic No consideration of masking effects − Unclear for protected memories No consideration of cleaning effects on accesses AVF (Architectural Vulnerability Factor) [Mukherjee’03] + Quickly calculates SDC without protection or DUE under parity due to SBUs − Ignores temporal/spatial MBUs − Cannot account for error detection/correction schemes PARMA (Precise Analytical Reliability Model for Architectures) [Suh’11] + Addresses temporal MBUs caused by multiple SBUs + Evaluates FIT (Failures-In-Time) of protected caches − Cannot account for spatial MBUs 23HPCA-18 Intrinsic Reliability Benchmarked Reliability MACAU models spatial MBUs and their temporal effects to evaluate soft-error vulnerabilities on cache data bit-cells
Vulnerability Clocks Cycles (VCCs) Common assumption in model based studies We measure bit’s exposure time with VCCs VCCs are equivalent to ACE (Architectural Correct Execution) cycles in AVF methods Managing VCCs is similar to (reliability-)lifetime analysis in AVF 24HPCA-18
Two Basic Models in Soft Error Benchmarking 1.Fault generation model 2.Fault propagation model Observing consumption for tracking failures due to SEUs Accumulating expected number of (total system) failures whenever consumption happens 25 Probability distribution of having k faulty bit(s) in a domain (set of bits) during vulnerability clock cycles Benchmarking measures: Generated faults Errors (propagated faults) Expected number of failures Failure rate Poisson Single Event Upset model HPCA-18
Temporal & Spatial Effects Temporal effects Requires for evaluating/quantifying reliability in the presence of protection codes Example: If SBU is dominant, temporal effects should be addressed for evaluating SECDED protected caches Evaluating DUE FIT rates require quantifying failures with 2 flipped bits Evaluating SDC FIT rates require quantifying failures with >2 flipped bits Spatial effects Growing concerns with future technologies All SEUs are expected to be spatial MBUs in near future [ITRS’07] Radiation hardened/interleaving design may not be possible always 26HPCA-18
Spatial MBUs and Layout Circuit layout determines the population of spatial MBUs Deep-N-well process is commonly used by TSMC, infineon, etc Parasitic bipolar transistors contribute to spatial MBUs With deep-N-well process, only parasitic NPN transistors are turned on At most two bit flips are observed in the same direction of wells [Mahatme’11] 27HPCA-18
Markov Chain and Overlap of SEUs Consider how many bits are overlapped 28 k-bit flipped by 1 st SEU: Current Markov state = k q-BU arrives with 2 nd SEU: Next Markov state = k + d Overlap (o ): k’ = k + d = k + q – 2okqodk' d = q – 2o d is even if and only if q is even d is odd if and only if q is odd o =1, 2, …, min(k, q) d = q – 2o d is even if and only if q is even d is odd if and only if q is odd o =1, 2, …, min(k, q) HPCA-18
Set-up for Markov Chain Overlapping probabilities o overlapped bits when q-BU hits a word with already k flipped bits 1.If 0 < o = q 2.If 0 < o < q 3.Else o = 0 29HPCA-18
MACAU: Computing S = T Nc Major computation bottleneck in MACAU Square-and-multiply method for efficient matrix multiplication Matrix computation is also data-parallel computation 30
AVF vs PARMA vs MACAU 31 Comparison VCC = N c MACAU is capable of addressing all the soft error related situations Both PARMA and MACAU are much slower than AVF Is the computation overhead an overkill for practical use? No Reliability-aware sampling for accelerating reliability simulations AVFPARMAMACAU SBU OOO Spatial MBU XXO Temporal MBU XOO Protection code XOO Variable domain size (n-bit) XOX Computation complexity O(1)O(n 3 )O(n 3 ×log(N c ))