MACAU: A Markov Model for Reliability Evaluations of Caches Under Single-bit and Multi-bit Upsets Jinho Suh Murali Annavaram Michel Dubois.

Slides:



Advertisements
Similar presentations
IHP Im Technologiepark Frankfurt (Oder) Germany IHP Im Technologiepark Frankfurt (Oder) Germany ©
Advertisements

LEVERAGING ACCESS LOCALITY FOR THE EFFICIENT USE OF MULTIBIT ERROR-CORRECTING CODES IN L2 CACHE By Hongbin Sun, Nanning Zheng, and Tong Zhang Joseph Schneider.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.
Scrubbing Approaches for Kintex-7 FPGAs
Discussion of: “Terrestrial-based Radiation Upsets: A Cautionary Tale” CprE 583 Tony Kuker 12/06/05.
Multi-Bit Upsets in the Virtex Devices Heather Quinn, Paul Graham, Jim Krone, Michael Caffrey Los Alamos National Laboratory Gary Swift, Jeff George, Fayez.
MURI Neutron-Induced Multiple-Bit Upset Alan D. Tipton 1, Jonathan A. Pellish 1, Patrick R. Fleming 1, Ronald D. Schrimpf.
HPEC 2012 Scrubbing Optimization via Availability Prediction (SOAP) for Reconfigurable Space Computing Quinn Martin Alan George.
April 30, Cost efficient soft-error protection for ASICs Tuvia Liran; Ramon Chips Ltd.
Microprocessor Reliability
Soft Error Benchmarking of L2 Caches with PARMA Jinho Suh Mehrtash Manoochehri Murali Annavaram Michel Dubois.
IVF: Characterizing the Vulnerability of Microprocessor Structures to Intermittent Faults Songjun Pan 1,2, Yu Hu 1, and Xiaowei Li 1 1 Key Laboratory of.
Using Hardware Vulnerability Factors to Enhance AVF Analysis Vilas Sridharan RAS Architecture and Strategy AMD, Inc. International Symposium on Computer.
® 1 ISCA 2004 Shubu Mukherjee, FACT Group, MMDC, Intel Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor Techniques to Reduce.
® 1 Shubu Mukherjee, FACT Group Cache Scrubbing in Microprocessors: Myth or Necessity? Practical Experience Report Shubu Mukherjee Joel Emer, Tryggve Fossum,
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
SAFER: Stuck-At-Fault Error Recovery for Memories Nak Hee Seong † Dong Hyuk Woo † Vijayalakshmi Srinivasan ‡ Jude A. Rivers ‡ Hsien-Hsin S. Lee † ‡†
March 16-18, 2008SSST'20081 Soft Error Rate Determination for Nanometer CMOS VLSI Circuits Fan Wang Vishwani D. Agrawal Department of Electrical and Computer.
Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,
Carlos Arthur Lang Lisbôa, Erik Schüler, Luigi Carro SRC TechCon 2005 Dealing with Multiple Simultaneous Faults in Future Technologies in Future Technologies.
Cost-Efficient Soft Error Protection for Embedded Microprocessors
Unreliable Silicon: Myth or Reality? Shubu Mukherjee Principal Engineer Director, SPEARS Group (SPEARS = Simulation & Pathfinding of Efficient And Reliable.
Barcelona, Spain November 13, 2005 WAR-1: Assessing SEU Vulnerability Via Circuit-Level Timing Analysis 1 Assessing SEU Vulnerability via Circuit-Level.
BIST vs. ATPG.
University of Michigan Electrical Engineering and Computer Science 1 A Microarchitectural Analysis of Soft Error Propagation in a Production-Level Embedded.
1 Dependability Benchmarking of VLSI Circuits Cristian Constantinescu Intel Corporation.
1 Efficient Analytical Determination of the SEU- induced Pulse Shape Rajesh Garg Sunil P. Khatri Department of ECE Texas A&M University College Station,
Modern VLSI Design 4e: Chapter 4 Copyright  2008 Wayne Wolf Topics n Interconnect design. n Crosstalk. n Power optimization.
Roza Ghamari Bogazici University.  Current trends in transistor size, voltage, and clock frequency, future microprocessors will become increasingly susceptible.
Lecture 03: Fundamentals of Computer Design - Trends and Performance Kai Bu
Optimal Degree Distribution for LT Codes with Small Message Length Esa Hyytiä, Tuomas Tirronen, Jorma Virtamo IEEE INFOCOM mini-symposium
IBM S/390 Parallel Enterprise Server G5 fault tolerance: A historical perspective by L. Spainhower & T.A. Gregg Presented by Mahmut Yilmaz.
IVEC: Off-Chip Memory Integrity Protection for Both Security and Reliability Ruirui Huang, G. Edward Suh Cornell University.
Copyright © 2008 UCI ACES Laboratory Kyoungwoo Lee 1, Aviral Shrivastava 2, Nikil Dutt 1, and Nalini Venkatasubramanian 1.
SiLab presentation on Reliable Computing Combinational Logic Soft Error Analysis and Protection Ali Ahmadi May 2008.
Soft errors in adder circuits Rajaraman Ramanarayanan, Mary Jane Irwin, Vijaykrishnan Narayanan, Yuan Xie Penn State University Kerry Bernstein IBM.
Self-* Systems CSE 598B Paper title: Dynamic ECC tuning for caches Presented by: Niranjan Soundararajan.
Modern VLSI Design 3e: Chapter 4 Copyright  1998, 2002 Prentice Hall PTR Topics n Interconnect design. n Crosstalk. n Power optimization.
Analytical Approach for Soft Error Rate Estimation of SRAM-Based FPGAs Ghazanfar (Hossein) Asadi and Mehdi B. Tahoori Why Soft Error Rate (SER) Estimation?
Yun-Chung Yang TRB: Tag Replication Buffer for Enhancing the Reliability of the Cache Tag Array Shuai Wang; Jie Hu; Ziavras S.G; Dept. of Electr. & Comput.
Spring 2008 CSE 591 Compilers for Embedded Systems Aviral Shrivastava Department of Computer Science and Engineering Arizona State University.
Implicit-Storing and Redundant- Encoding-of-Attribute Information in Error-Correction-Codes Yiannakis Sazeides 1, Emre Ozer 2, Danny Kershaw 3, Panagiota.
11 Online Computing and Predicting Architectural Vulnerability Factor of Microprocessor Structures Songjun Pan Yu Hu Xiaowei Li {pansongjun, huyu,
Using Memory to Cope with Simultaneous Transient Faults Authors: Universidade Federal do Rio Grande do Sul Programa de Pós-Graduação em Engenharia Elétrica.
Copyright © 2010 Houman Homayoun Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California.
1 System-Level Vulnerability Estimation for Data Caches.
Eduardo L. Rhod, Álisson Michels, Carlos A. L. Lisbôa, Luigi Carro ETS 2006 Fault Tolerance Against Multiple SEUs using Memory-Based Circuits to Improve.
Architectural Vulnerability Factor (AVF) Computation for Address-Based Structures Arijit Biswas, Paul Racunas, Shubu Mukherjee FACT Group, DEG, Intel Joel.
Harnessing Soft Computation for Low-Budget Fault Tolerance Daya S Khudia Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan,
Methodology to Compute Architectural Vulnerability Factors Chris Weaver 1, 2 Shubhendu S. Mukherjee 1 Joel Emer 1 Steven K. Reinhardt 1, 2 Todd Austin.
Gill 1 MAPLD 2005/234 Analysis and Reduction Soft Delay Errors in CMOS Circuits Balkaran Gill, Chris Papachristou, and Francis Wolff Department of Electrical.
A Novel, Highly SEU Tolerant Digital Circuit Design Approach By: Rajesh Garg Sunil P. Khatri Department of Electrical and Computer Engineering, Texas A&M.
EE 653: Group #3 Impact of Drowsy Caches on SER Arjun Bir Singh Mohammad Abdel-Majeed Sameer G Kulkarni.
Winter Semester 2010 ”Politehnica” University of Timisoara Course No. 5: Expanding Bio-Inspiration: Towards Reliable MuxTree  Memory Arrays – Part 2 –
MAPLD 2005/213Kakarla & Katkoori Partial Evaluation Based Redundancy for SEU Mitigation in Combinational Circuits MAPLD 2005 Sujana Kakarla Srinivas Katkoori.
Improving Multi-Core Performance Using Mixed-Cell Cache Architecture
Raghuraman Balasubramanian Karthikeyan Sankaralingam
SE-Aware HPC Extension : Selective Data Protection for reducing failures due to soft errors 7/20/2006 Kyoungwoo Lee.
Micro-programmed Control
nZDC: A compiler technique for near-Zero silent Data Corruption
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
UnSync: A Soft Error Resilient Redundant Multicore Architecture
Dynamic Prediction of Architectural Vulnerability
Dynamic Prediction of Architectural Vulnerability
Design of a ‘Single Event Effect’ Mitigation Technique for Reconfigurable Architectures SAJID BALOCH Prof. Dr. T. Arslan1,2 Dr.Adrian Stoica3.
Analytical Approach for Soft Error Rate Estimation of SRAM-Based FPGAs
RAID Redundant Array of Inexpensive (Independent) Disks
Use ECP, not ECC, for hard failures in resistive memories
Guihai Yan, Yinhe Han, and Xiaowei Li
Memory System Performance Chapter 3
Presentation transcript:

MACAU: A Markov Model for Reliability Evaluations of Caches Under Single-bit and Multi-bit Upsets Jinho Suh Murali Annavaram Michel Dubois

Outline Definitions Model-based Soft-Error Reliability Evaluations Modeling Multiple bit Upsets MACAU Describing Markov Chain Measuring Intrinsic Reliability Benchmarking Realistic Reliability Evaluations Summary HPCA-182

Definitions Domain Set of bits bundled together; protection domain Spatial MBU (Multi-Bit Upset; spatial q-BU) Multiple (q-)bits are flipped due to one particle hit cf. spatial MBE Temporal MBU Multiple bits are flipped due to more than one particle hit HPCA-183

Model-based Soft-Error Reliability Evaluations 4 Comparison [Saleh’90, Reviriego’09, Mukherjee’03, Suh’11] Projection from industry [ITRS’07, ITRS’10] Intrinsic MTTF AVFPARMAMACAU Masking effectX OOO Single-Bit UpsetO OOO Spatial MBUO XXO Temporal MBU-- XOO Protection codeO XOO Variable domain sizeX XOX ComputationExtremely cheapCheapExpensive Year or production Feature size [nm] Gate Length [nm] SER [FIT per Mb]1,2501,3001,3501,400 % MBU in SEU64%100%

Challenges in Modeling Spatial MBUs Multiple spatial MBUs may leave complex patterns of flipped bits Protection domain: word Dealing with MBUs spanning domains vertically/horizontally MBUs spanning domains vertically: easy to address if isolated in multiple protection domains MBUs spanning domains horizontally (edge-effect): negligible impact 5 …… PD#1 PD#5 PD#2 PD#3 PD#4 PD#6PD#7PD# HPCA-18

Assumptions in MACAU 1.Spatial MBUs happen always in contiguous patterns Recent studies [Georgakos2007, Mahatme2011] report that (a) happens much more frequently than (b)/(c) (b)/(c) can be approximated by contiguous patterns framed by the red dotted rectangles as (a) (with minimal overestimation) 2.At most one SEU strike a protection domain in one cycle, and at most two SEUs flip bits on the same protection domain Soft errors are extremely rare 3.Edge effect is ignored Only on small number of bits next to the border of two protection domains HPCA-186 (a) Contiguous (b) Disjoint (c) Diagonal

Distribution of Spatial MBUs in a SEU Probability distribution of MBUs for omni-directional galactic cosmic rays [Tipton’08] in the cache We concentrate on the MBU patterns included in the dotted square 7 0.5% HPCA-18

Probability of a SEU in One Cycle 8 SEU Model p SEU_PD Probability that one SEU arrives during one cycle period in the protection domain Poisson probability mass function gives p SEU_PD λ SEU_PD : Poisson rate of SEUs in 32bit word protection domain ex) 3× nm 3GHz CPU Probability of having a spatial q-BU in PD due to one SEU Including the effect of vertical spatial MBUs HPCA-18

Modeling Spatial Effects with Markov Chain Markov states Transient state (non-recurrent state) After departing from the state, probability of not returning is nonzero Absorbing state No more transition to other states is possible once the state is visited Markov chain State expresses the number of flipped, incorrect bits SBU only SBU and 2BU 0q-1 q q+1 Up to qBU …… HPCA-18

Markov Chain and Overlap of SEUs Number of overlapped bits 10 k-bit flipped by 1 st SEU: Current Markov state = k q-BU arrives with 2 nd SEU: Next Markov state = k + d Overlap (o ): k’ = k + d = k + q – 2o HPCA-18kqodk'overlap no 83119partial 8327partial full d = q – 2o d is even if and only if q is even d is odd if and only if q is odd o =1, 2, …, min(k, q) d = q – 2o d is even if and only if q is even d is odd if and only if q is odd o =1, 2, …, min(k, q)

N-bit protection domain Overlap of SEUs In N-bit protection domain with contiguous k-bit fault SEU of spatial q-BU arrives: 1.Full overlap (0 < o = q) 2.Partial overlap (0 < o < q) 3.No overlap (o = 0) HPCA-1811 k-bit fault Spatial q-BU

Markov Transition Matrix T: Example for the Case of D Building T with overlapping probabilities Example: matrix D for MBUs with up to 3 horizontal BUs and up to 2 vertical BUs 12HPCA-18

Markov Transition Matrix T: General Case HPCA-1813 T contains the probabilities of transition between any two states in one cycle

Using MACAU Measuring intrinsic MTTFBenchmarking HPCA-1814 Build T Add transitions on T for scrubbing Calculate mean first- passage time from state 0 to failing states Build T Manage VCC Calculate probability of failure on word from T VCC Program starts Word consumed? Yes No Accumulate E[#fail] Program ends? No Calculate failure rate Yes

Calculating intrinsic MTTF Mean first-passage time gives intrinsic MTTF 1.Make the states that cause failure absorbing states With b-bit error correcting code, states > b are failing states 2.Measure the transition time from state 0 (clean state) to failing state With transition matrix T: Without scrubbing First-passage time from state 0 to any absorbing state gives the intrinsic MTTF With (stochastic) scrubbing with scrubbing interval of L States that can be scrubbed has extra transitions to state 0 with probability = 1/L Then first-passage time gives intrinsic MTTF 15HPCA-18

Benchmarking FIT rate with T Whenever an access is made: 1.Measure VCC 2.Calculate S = T VCC to get the transition matrix after VCC 3.Add to the expected number of failures by summing the probabilities of reaching failure states Failure probability is obtained from state transition probability in S 16 ProtectionNo protectionOdd-paritySECDEDDECTEDTECQED SDC DUE--- HPCA-18

Evaluations Intrinsic MTTFs Differences in MTTF: MACAU addresses ‘overlapping effect’ which Saleh/Reviriego ignores [Ming’11] FIT rates from benchmarking MACAU differs by ≤ 0.015% from PARMA when benchmarking SBUs only 17 SEUs Protection on a word Model 32b-word No scrub Once/yearOnce/monthOnce/day SBUs onlySEC MACAU6.715E E E E+15 Saleh6.245E E E E+15 1BU+2BU (0.5:0.5) DEC MACAU8.012E E E E+15 Reviriego7.211E E E E+15 D DEC MACAU9.700E E+08 Reviriego--- TEC MACAU1.330E E E E+16 Reviriego1.700E E E E+16 No-protectionOdd-paritySECDEDDECTEDTECQED DUE TRUEN/A E-16 FALSEN/A E-15 SDC E E-17 HPCA-18

Summary MACAU Model for temporal/spatial MBU effects Capable of evaluating various protection schemes Useful for quick evaluation of caches, by measuring intrinsic MTTFs Useful for rigorously benchmarking FIT rates in caches under MBUs and SBUs Future work Refining model for addressing edge effect Spatial MBU model for arbitrarily shaped patterns Model for TAG and meta-bit vulnerability Application to processor buffers (ROB, LSQ, IFQ) 18HPCA-18

THANK YOU!

(Some) References [Biswas’10] Arijit Biswas, Charles Recchia, Shubhendu S. Mukherjee, Vinod Ambrose, Leo Chan, Aamer Jaleel, Mike Plaster, and Norbert Seifert, “Explaining Cache SER Anomaly Using Relative DUE AVF Measurement,” HPCA [Li’05] X. Li, S. Adve, P. Bose, and J.A. Rivers. SoftArch: An Architecture Level Tool for Modeling and Analyzing Soft Errors. In Proceedings of the International Conference on Dependable Systems and Networks, , [Mahatme’11] Mahatme, N., Bhuva, B., Fang, Y., and Oates, A. Analysis of multiple cell upsets due to neutrons in srams for a deep-n-well process. In Reliability Physics Symposium (IRPS), 2011 IEEE International (April 2011), pp. SE.7.1 – 6. [Ming’11] Ming, Z., Yi, X. L., Chang, L., and Wei, Z. J. Reliability of memories protected by multibit error correction codes against mbus. Nuclear Science, IEEE Transactions on 58, 1 (feb. 2011), 289 – 295. [Mukherjee’03] S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt, and T. Austin. A systematic methodology to calculate the architectural vulnerability factors for a high-performance microprocessor. In Proceedings of the 36th International Symposium on Microarchitecture, pages 29-40, [Mukherjee’04] S. S. Mukherjee, J. Emer, T. Fossum, and S. K. Reinhardt. Cache Scrubbing in Microprocessors: Myth or Necessity? In Proceedings of the 10th IEEE Pacific Rim Symposium on Dependable Computing, 37-42, [Reviriego’09] Reviriego, P., and Maestro, J. A. Study of the effects of multibit error correction codes on the reliability of memories in the presence of mbus. IEEE Transactions on Device and Materials Reliability 9 (2009), [Saleh’90] A. M. Saleh, J. J. Serrano, and J. H. Patel. Reliability of Scrubbing Recovery Techniques for Memory Systems. In IEEE Transactions on Reliability, 39(1), , [Tipton’08] Tipton, A. D., Pellish, J. A., Hutson, J. M., Baumann, R., Deng, X., Marshall, A., Xapsos, M. A., Kim, H. S., Friendlich, M. R., Campola, M. J., Seidleck, C. M., LaBel, K. A., Mendenhall, M. H., Reed, R. A., Schrimpf, R. D., Weller, R. A., and Black, J. D. Device-orientation eects on multiple-bit upset in 65 nm srams. IEEE Transactions on Nuclear Science 55 (2008), [Ziegler] J. F. Ziegler and H. Puchner, “SER – History, Trends and Challenges,” Cypress Semiconductor Corp 20HPCA-18

ADDENDUM HPCA-1821

Definitions Domain Set of bits bundled together; protection domain Fault Incorrect state in a domain Error A manifested fault, propagated outside the original domain Failure Visible error Consumption An event resulting in the change of architectural state SEU (Single Event Upset) State change in memory due to one particle hit SBU (Single-Bit Upset) One bit is flipped due to one SEU Spatial MBU (Multi-Bit Upset; spatial q-BU) Multiple (q-)bits are flipped due to one SEU Temporal MBU Multiple bits are flipped due to more than one SEU Vulnerability clock cycles (VCCs) Time in cycles that a bit is exposed to particle hits HPCA-1822

Model-based Soft-Error Reliability Evaluations (Intrinsic) Mean-Time-To-Failure [Saleh’90, Reviriego’09] + Fast, first-cut estimation of circuit-level reliability − Highly pessimistic No consideration of masking effects − Unclear for protected memories No consideration of cleaning effects on accesses AVF (Architectural Vulnerability Factor) [Mukherjee’03] + Quickly calculates SDC without protection or DUE under parity due to SBUs − Ignores temporal/spatial MBUs − Cannot account for error detection/correction schemes PARMA (Precise Analytical Reliability Model for Architectures) [Suh’11] + Addresses temporal MBUs caused by multiple SBUs + Evaluates FIT (Failures-In-Time) of protected caches − Cannot account for spatial MBUs 23HPCA-18 Intrinsic Reliability Benchmarked Reliability MACAU models spatial MBUs and their temporal effects to evaluate soft-error vulnerabilities on cache data bit-cells

Vulnerability Clocks Cycles (VCCs) Common assumption in model based studies We measure bit’s exposure time with VCCs VCCs are equivalent to ACE (Architectural Correct Execution) cycles in AVF methods Managing VCCs is similar to (reliability-)lifetime analysis in AVF 24HPCA-18

Two Basic Models in Soft Error Benchmarking 1.Fault generation model 2.Fault propagation model Observing consumption for tracking failures due to SEUs Accumulating expected number of (total system) failures whenever consumption happens 25 Probability distribution of having k faulty bit(s) in a domain (set of bits) during vulnerability clock cycles Benchmarking measures: Generated faults  Errors (propagated faults)  Expected number of failures  Failure rate Poisson Single Event Upset model HPCA-18

Temporal & Spatial Effects Temporal effects Requires for evaluating/quantifying reliability in the presence of protection codes Example: If SBU is dominant, temporal effects should be addressed for evaluating SECDED protected caches Evaluating DUE FIT rates require quantifying failures with 2 flipped bits Evaluating SDC FIT rates require quantifying failures with >2 flipped bits Spatial effects Growing concerns with future technologies All SEUs are expected to be spatial MBUs in near future [ITRS’07] Radiation hardened/interleaving design may not be possible always 26HPCA-18

Spatial MBUs and Layout Circuit layout determines the population of spatial MBUs Deep-N-well process is commonly used by TSMC, infineon, etc Parasitic bipolar transistors contribute to spatial MBUs With deep-N-well process, only parasitic NPN transistors are turned on At most two bit flips are observed in the same direction of wells [Mahatme’11] 27HPCA-18

Markov Chain and Overlap of SEUs Consider how many bits are overlapped 28 k-bit flipped by 1 st SEU: Current Markov state = k q-BU arrives with 2 nd SEU: Next Markov state = k + d Overlap (o ): k’ = k + d = k + q – 2okqodk' d = q – 2o d is even if and only if q is even d is odd if and only if q is odd o =1, 2, …, min(k, q) d = q – 2o d is even if and only if q is even d is odd if and only if q is odd o =1, 2, …, min(k, q) HPCA-18

Set-up for Markov Chain Overlapping probabilities o overlapped bits when q-BU hits a word with already k flipped bits 1.If 0 < o = q 2.If 0 < o < q 3.Else o = 0 29HPCA-18

MACAU: Computing S = T Nc Major computation bottleneck in MACAU Square-and-multiply method for efficient matrix multiplication Matrix computation is also data-parallel computation 30

AVF vs PARMA vs MACAU 31 Comparison VCC = N c MACAU is capable of addressing all the soft error related situations Both PARMA and MACAU are much slower than AVF Is the computation overhead an overkill for practical use? No Reliability-aware sampling for accelerating reliability simulations AVFPARMAMACAU SBU OOO Spatial MBU XXO Temporal MBU XOO Protection code XOO Variable domain size (n-bit) XOX Computation complexity O(1)O(n 3 )O(n 3 ×log(N c ))