® 1 Shubu Mukherjee, FACT Group Cache Scrubbing in Microprocessors: Myth or Necessity? Practical Experience Report Shubu Mukherjee Joel Emer, Tryggve Fossum,

Slides:



Advertisements
Similar presentations
IHP Im Technologiepark Frankfurt (Oder) Germany IHP Im Technologiepark Frankfurt (Oder) Germany ©
Advertisements

LEVERAGING ACCESS LOCALITY FOR THE EFFICIENT USE OF MULTIBIT ERROR-CORRECTING CODES IN L2 CACHE By Hongbin Sun, Nanning Zheng, and Tong Zhang Joseph Schneider.
Scrubbing Approaches for Kintex-7 FPGAs
Discussion of: “Terrestrial-based Radiation Upsets: A Cautionary Tale” CprE 583 Tony Kuker 12/06/05.
MACAU: A Markov Model for Reliability Evaluations of Caches Under Single-bit and Multi-bit Upsets Jinho Suh Murali Annavaram Michel Dubois.
MURI Neutron-Induced Multiple-Bit Upset Alan D. Tipton 1, Jonathan A. Pellish 1, Patrick R. Fleming 1, Ronald D. Schrimpf.
® 1 Shubu Mukherjee, FACT Group Radiation-Induced Soft Errors: An Architectural Perspective Shubu Mukherjee 1, Joel Emer 2, & Steven. K Reinhardt 1,3 “If.
Sp09 CMPEN 411 L16 S.1 CMPEN 411 VLSI Digital Circuits Spring 2009 Lecture 16: Introduction to Soft Errors [Adapted from Rabaey’s Digital Integrated Circuits,
April 30, Cost efficient soft-error protection for ASICs Tuvia Liran; Ramon Chips Ltd.
Microprocessor Reliability
2007 MURI Review The Effect of Voltage Fluctuations on the Single Event Transient Response of Deep Submicron Digital Circuits Matthew J. Gadlage 1,2, Ronald.
Using Hardware Vulnerability Factors to Enhance AVF Analysis Vilas Sridharan RAS Architecture and Strategy AMD, Inc. International Symposium on Computer.
® 1 ISCA 2004 Shubu Mukherjee, FACT Group, MMDC, Intel Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor Techniques to Reduce.
CMPE 421 Parallel Computer Architecture MEMORY SYSTEM.
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
SAFER: Stuck-At-Fault Error Recovery for Memories Nak Hee Seong † Dong Hyuk Woo † Vijayalakshmi Srinivasan ‡ Jude A. Rivers ‡ Hsien-Hsin S. Lee † ‡†
March 16-18, 2008SSST'20081 Soft Error Rate Determination for Nanometer CMOS VLSI Circuits Fan Wang Vishwani D. Agrawal Department of Electrical and Computer.
1 Lecture 26: Storage Systems Topics: Storage Systems (Chapter 6), other innovations Final exam stats:  Highest: 95  Mean: 70, Median: 73  Toughest.
Lecture 3: A Case for RAID (Part 1) Prof. Shahram Ghandeharizadeh Computer Science Department University of Southern California.
Unreliable Silicon: Myth or Reality? Shubu Mukherjee Principal Engineer Director, SPEARS Group (SPEARS = Simulation & Pathfinding of Efficient And Reliable.
Spring 2008 CSE 591 Compilers for Embedded Systems Aviral Shrivastava Department of Computer Science and Engineering Arizona State University.
Two or more disks Capacity is the same as the total capacity of the drives in the array No fault tolerance-risk of data loss is proportional to the number.
Penn ESE370 Fall DeHon 1 ESE370: Circuit-Level Modeling, Design, and Optimization for Digital Systems Day 32: November 24, 2010 Uncorrelated Noise.
Reducing Cache Power with Low-Cost, Multi-Bit Error-Correcting Codes Chris Wilkerson, Alaa R. Alameldeen, Zeshan Chishti, Wei Wu, Dinesh Somasekhar, Shih-Lien.
1 Dependability Benchmarking of VLSI Circuits Cristian Constantinescu Intel Corporation.
Roza Ghamari Bogazici University.  Current trends in transistor size, voltage, and clock frequency, future microprocessors will become increasingly susceptible.
Computer Basics COMPUTER TECHNOLOG Y 1. What Is a Computer? An electronic device Accepts data and instructions Manipulates, processes, and displays the.
Transient Fault Detection via Simultaneous Multithreading Shubhendu S. Mukherjee VSSAD, Alpha Technology Compaq Computer Corporation.
Lecture 03: Fundamentals of Computer Design - Trends and Performance Kai Bu
EEL 5708 Main Memory Organization Lotzi Bölöni Fall 2003.
IVEC: Off-Chip Memory Integrity Protection for Both Security and Reliability Ruirui Huang, G. Edward Suh Cornell University.
Copyright © 2008 UCI ACES Laboratory Kyoungwoo Lee 1, Aviral Shrivastava 2, Nikil Dutt 1, and Nalini Venkatasubramanian 1.
Energy-Efficient Cache Design Using Variable-Strength Error-Correcting Codes Alaa R. Alameldeen, Ilya Wagner, Zeshan Chishti, Wei Wu,
Space Radiation and Fox Satellites 2011 Space Symposium AMSAT Fox.
SiLab presentation on Reliable Computing Combinational Logic Soft Error Analysis and Protection Ali Ahmadi May 2008.
Soft errors in adder circuits Rajaraman Ramanarayanan, Mary Jane Irwin, Vijaykrishnan Narayanan, Yuan Xie Penn State University Kerry Bernstein IBM.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Self-* Systems CSE 598B Paper title: Dynamic ECC tuning for caches Presented by: Niranjan Soundararajan.
Yun-Chung Yang SimTag: Exploiting Tag Bits Similarity to Improve the Reliability of the Data Caches Jesung Kim, Soontae Kim, Yebin Lee 2010 DATE(The Design,
Embedded System Lab. Daeyeon Son Neighbor-Cell Assisted Error Correction for MLC NAND Flash Memories Yu Cai 1, Gulay Yalcin 2, Onur Mutlu 1, Erich F. Haratsch.
Yun-Chung Yang TRB: Tag Replication Buffer for Enhancing the Reliability of the Cache Tag Array Shuai Wang; Jie Hu; Ziavras S.G; Dept. of Electr. & Comput.
Spring 2008 CSE 591 Compilers for Embedded Systems Aviral Shrivastava Department of Computer Science and Engineering Arizona State University.
Implicit-Storing and Redundant- Encoding-of-Attribute Information in Error-Correction-Codes Yiannakis Sazeides 1, Emre Ozer 2, Danny Kershaw 3, Panagiota.
Redundant Multithreading Techniques for Transient Fault Detection Shubu Mukherjee Michael Kontz Steve Reinhardt Intel HP (current) Intel Consultant, U.
Eduardo L. Rhod, Álisson Michels, Carlos A. L. Lisbôa, Luigi Carro ETS 2006 Fault Tolerance Against Multiple SEUs using Memory-Based Circuits to Improve.
Architectural Vulnerability Factor (AVF) Computation for Address-Based Structures Arijit Biswas, Paul Racunas, Shubu Mukherjee FACT Group, DEG, Intel Joel.
Methodology to Compute Architectural Vulnerability Factors Chris Weaver 1, 2 Shubhendu S. Mukherjee 1 Joel Emer 1 Steven K. Reinhardt 1, 2 Todd Austin.
Copyright 2005, M. Tahoori1 Soft Error Modeling and Mitigation Mehdi B. Tahoori Northeastern University
ECE/CS 552: Main Memory and ECC © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and.
Lecture 5: Memory Performance. Types of Memory Registers L1 cache L2 cache L3 cache Main Memory Local Secondary Storage (local disks) Remote Secondary.
1 Lecture 27: Disks Today’s topics:  Disk basics  RAID  Research topics.
FAULTSIM: A FAST, CONFIGURABLE MEMORY-RESILIENCE SIMULATOR DAVID A. ROBERTS, AMD RESEARCH PRASHANT J. NAIR, GEORGIA INSTITUTE OF TECHNOLOGY
Efficient Scrub Mechanisms for Error-Prone Emerging Memories Manu Awasthi ǂ, Manjunath Shevgoor⁺, Kshitij Sudan⁺, Rajeev Balasubramonian⁺, Bipin Rajendran.
1 Lecture 3: Pipelining Basics Today: chapter 1 wrap-up, basic pipelining implementation (Sections C.1 - C.4) Reminders:  Sign up for the class mailing.
EE 653: Group #3 Impact of Drowsy Caches on SER Arjun Bir Singh Mohammad Abdel-Majeed Sameer G Kulkarni.
Cache Advanced Higher.
A Case for Redundant Arrays of Inexpensive Disks (RAID) -1988
Rakan Maddah1, Sangyeun2,1 Cho and Rami Melhem1
Rad (radiation) Hard Devices used in Space, Military Applications, Nuclear Power in-situ Instrumentation Savanna Krassau 4/21/2017 Abstract: Environments.
SE-Aware HPC Extension : Selective Data Protection for reducing failures due to soft errors 7/20/2006 Kyoungwoo Lee.
Vladimir Stojanovic & Nicholas Weaver
Error Correcting Code.
Exam 2 Review Two’s Complement Arithmetic Ripple carry ALU logic and performance Look-ahead techniques, performance and equations Basic multiplication.
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
NVIDIA Fermi Architecture
BIC 10503: COMPUTER ARCHITECTURE
Dynamic Prediction of Architectural Vulnerability
Dynamic Prediction of Architectural Vulnerability
and its effect on various processes
RAID Redundant Array of Inexpensive (Independent) Disks
Presentation transcript:

® 1 Shubu Mukherjee, FACT Group Cache Scrubbing in Microprocessors: Myth or Necessity? Practical Experience Report Shubu Mukherjee Joel Emer, Tryggve Fossum, & Steven K. Reinhardt* Fault Aware Computing Technology (FACT) Group Massachusetts Microprocessor Design Center, Intel Corporation 10th IEEE International Symposium Pacific Rim Dependable Computing, French Polynesia, March 3-5, 2004 * Also, University of Michigan, Ann Arbor

® 2 Shubu Mukherjee, FACT Group Summary SECDED ECC (single error correction, double error detection) SECDED ECC (single error correction, double error detection) Øcommonly used in on-chip caches Øinterleaving converts spatial multi-bit errors to multiple single bit errors Scrubbing Scrubbing Øperiodically read cache blocks and correct all single bit errors Øthis prevents single bit errors from accumulating, thereby avoiding temporal double bit errors Our conclusion: given detected error target of 10 year MTTF Our conclusion: given detected error target of 10 year MTTF ØScrubbing necessary only for very large caches (e.g., 100s of megabytes to gigabytes)

® 3 Shubu Mukherjee, FACT Group Origin of Cosmic Rays Cosmic rays come from deep space Cosmic rays come from deep space Earth’s Surface p n p p n n p p n n n

® 4 Shubu Mukherjee, FACT Group Impact of Neutron Strike on a Si Device Secondary source of upsets: alpha particles from packaging Secondary source of upsets: alpha particles from packaging Strikes release electron & hole pairs that can be absorbed by source & drain to alter the state of the device Transistor Device source drain neutron strike

® 5 Shubu Mukherjee, FACT Group Strike Changes State of a Single Bit 0 1 Example Solution Example Solution ØError correction codes (ECC) for single bit correction ØOverhead = 7 bits for 64 bits of data

® 6 Shubu Mukherjee, FACT Group Strike Changes State of Two Adjacent Bits Spatial Double Bit Error Example solution Example solution ØSECDED ECC (single error correction, double error detection)  8 bits of code per 64 bits of data ØInterleaving for the more general case …

® 7 Shubu Mukherjee, FACT Group Interleaving bits Interleaving converts Interleaving converts Øspatial multi-bit error  multiple single bit errors bits X X X X = covered with single ECC code = covered with different ECC code / / / 0 0 0

® 8 Shubu Mukherjee, FACT Group Two Separate Strikes on Different Bits Temporal Double Bit Errors SECDED ECC (single error correction, double error detection) SECDED ECC (single error correction, double error detection) Øcould detect error, but cannot correct the error Øif errors accumulate –single bit correctable error becomes a double bit detectable error Cycle 100 Cycle 1,000,000

® 9 Shubu Mukherjee, FACT Group Solutions for Temporal Double Bit Errors Natural Effects Natural Effects Øwhenever a processor reads a cache block, we can correct the single bit error Øcheck for errors when cache blocks are replaced from the cache More Powerful ECC More Powerful ECC ØSECDED ECC requires 8 bits per 64 bits –7 bits for single bit correction –8 th bit for double bit detection –Overhead = 13% ØECC with two bit correction requires 12 bits per 64 bits –Overhead = 19% Scrubbing Scrubbing ØPeriodically read memory and correct all single bit errors ØDisallows accumulation of temporal double bit errors ØStandard technique in main memories (DRAMs) ØOur calculations (later) will assume the worst case for soft errors –cache blocks don’t get scrubbed naturally

® 10 Shubu Mukherjee, FACT Group Memory Hierarchy of a Processor Do we need to scrub on-chip caches? Do we need to scrub on-chip caches? Ødepends on the size of these caches L1 Cache CPU L2 Cache Main Memory (gigabytes) megabytes kilobytes

® 11 Shubu Mukherjee, FACT Group Detected Unrecoverable Error (DUE) Interval-based Interval-based ØMTTF = Mean Time to Failure ØE.g., goal = 10 years MTTF for application crash  Bossen, IRPS 2002 Rate-based Rate-based ØFIT = Failure in Time = 1 failure in a billion hours Ø10 year MTTF = 10 9 / (24 * 365 * 10) FIT = 11,415 FITs Total of 210 FIT + Cache: 62 FIT IQ: 100 FIT FU: 58 FIT + Hypothetical Example

® 12 Shubu Mukherjee, FACT Group MTTF calculations: probabilities 1 quadword = 64 bits + 8 bits = 72 bits of data + SECDED ECC 1 quadword = 64 bits + 8 bits = 72 bits of data + SECDED ECC Q = # quadwords in cache memory Q = # quadwords in cache memory P d [n] = probability that a sequence of n strikes causes n – 1 single bit errors, followed by a double bit error on the n th strike P d [n] = probability that a sequence of n strikes causes n – 1 single bit errors, followed by a double bit error on the n th strike P d [1] = 0 P d [1] = 0 P d [2] = 1 / Q P d [2] = 1 / Q First Strike, Probability = Q / Q Second Strike, Probability = 1 / Q P d [2] = (Q/Q) * (1/Q) = 1/Q

® 13 Shubu Mukherjee, FACT Group MTTF calculations: probabilities 1 quadword = 64 bits + 8 bits = 72 bits of SECDED ECC 1 quadword = 64 bits + 8 bits = 72 bits of SECDED ECC Q = # quadwords in cache memory Q = # quadwords in cache memory P d [n] = probability that a sequence of n strikes causes n – 1 single bit errors, followed by a double bit error on the n th strike P d [n] = probability that a sequence of n strikes causes n – 1 single bit errors, followed by a double bit error on the n th strike P d [3] = [ (Q-1)/Q ] * [2/Q] P d [3] = [ (Q-1)/Q ] * [2/Q] First Strike, Probability = Q / Q Second Strike, Probability = (Q-1) / QThird Strike, Probability = 2/Q P d [3] = (Q/Q) * (Q-1/Q) * (2/Q)

® 14 Shubu Mukherjee, FACT Group MTTF calculations: probabilities 1 quadword = 64 bits + 8 bits = 72 bits of SECDED ECC 1 quadword = 64 bits + 8 bits = 72 bits of SECDED ECC Q = # quadwords in cache memory Q = # quadwords in cache memory P d [n] = probability that a sequence of n strikes causes n – 1 single bit errors, followed by a double bit error on the n th strike P d [n] = probability that a sequence of n strikes causes n – 1 single bit errors, followed by a double bit error on the n th strike P d [1] = 0 P d [1] = 0 P d [2] = 1 / Q P d [2] = 1 / Q P d [3] = [ (Q-1)/Q ] * [2/Q] P d [3] = [ (Q-1)/Q ] * [2/Q] P d [4] = [ (Q-1)/Q ] * [ (Q-2)/Q ] * [3/Q] P d [4] = [ (Q-1)/Q ] * [ (Q-2)/Q ] * [3/Q] … P d [n] = [ (Q-1/Q ] * [ (Q-2)/Q ] * [ (Q-3)/Q ] * … * [ (Q-n+2)/Q ] * [ (n-1)/Q ] P d [n] = [ (Q-1/Q ] * [ (Q-2)/Q ] * [ (Q-3)/Q ] * … * [ (Q-n+2)/Q ] * [ (n-1)/Q ]

® 15 Shubu Mukherjee, FACT Group MTTF calculations: Equation M = mean # of single bit errors to get a double bit error M = mean # of single bit errors to get a double bit error = Expected value of random variable with P d [n] as the = Expected value of random variable with P d [n] as the probability distribution function probability distribution function M can be easily generated using a computer program M can be easily generated using a computer program MTTF (double bit error) = M * MTTF (single bit error) MTTF (double bit error) = M * MTTF (single bit error)  For a 32 megabyte cache & FIT/bit = [Normand 1996, Tosaka 1996] MTTF (double bit error) = M * MTTF (single bit error) MTTF (double bit error) = M * MTTF (single bit error) = 2567 * (1 / Cache FIT) = 2567 * (1 / Cache FIT) = 2567 * (10 9 / (0.001 * 2 22 * 72 * 24 * 365)) = 2567 * (10 9 / (0.001 * 2 22 * 72 * 24 * 365)) = 970 years = 970 years Saleh, et al.’s, 1990 closed form equation Saleh, et al.’s, 1990 closed form equation ØMTTF (double bit error) = [ 1 / (72 * f)] * sqrt(  / 2Q) = 970 years, f = FIT/bit = 970 years, f = FIT/bit

® 16 Shubu Mukherjee, FACT Group Temporal Double Bit MTTF variations with cache size FIT/bit = – 0.01 (Normand 1996, Tosaka 1996) FIT/bit = – 0.01 (Normand 1996, Tosaka 1996) Øhigher at higher altitudes (e.g., 3-5x at 1.5km in Denver) Temporal double bit error has very small contribution to DUE rate Temporal double bit error has very small contribution to DUE rate Øcompared to a goal of 10 years DUE MTTF

® 17 Shubu Mukherjee, FACT Group MTTF with Scrubbing I = scrubbing interval, scrub at the end of each interval I I = scrubbing interval, scrub at the end of each interval I N = # scrubbing intervals to reach MTTF N = # scrubbing intervals to reach MTTF = Expected value of random variable with probability distribution = Expected value of random variable with probability distribution function: (1-pf) N * pf, where pf = probability of a temporal double bit function: (1-pf) N * pf, where pf = probability of a temporal double bit error at the end of an interval error at the end of an interval Assuming 16 GB cache, FIT/bit = (Normand 1996, Tosaka 1996), scrub once a year (I = 1 year) MTTF(double bit error) = N * I MTTF(double bit error) = N * I = 2281 * 1 = 2281 years = 2281 * 1 = 2281 years Saleh, et al closed form equation Saleh, et al closed form equation Ø2 / [Q * I * (f * 72) 2 ] = 2341 years, f = FIT/bit I I I

® 18 Shubu Mukherjee, FACT Group Impact of Scrubbing on Temporal Double Bit MTTF FIT/bit = – 0.01 (Normand 1996, Tosaka 1996) FIT/bit = – 0.01 (Normand 1996, Tosaka 1996) Øhigher at higher altitudes (e.g., 3-5x at 1.5km in Denver) For 16 gigabytes of cache, scrubbing can help For 16 gigabytes of cache, scrubbing can help Øcompared to a DUE MTTF goal of 10 years 16 Gigabyte Cache

® 19 Shubu Mukherjee, FACT Group Summary SECDED ECC (single error correction, double error detection) SECDED ECC (single error correction, double error detection) Øcommonly used in on-chip caches Øinterleaving converts spatial multi-bit errors to multiple single bit errors Scrubbing Scrubbing Øperiodically read cache blocks and correct all single bit errors Øthis prevents single bit errors from accumulating, thereby avoiding temporal double bit errors Our conclusion: given detected error target of 10 year MTTF Our conclusion: given detected error target of 10 year MTTF ØScrubbing necessary only for very large caches (e.g., 100s of megabytes to gigabytes)

® 20 Shubu Mukherjee, FACT Group BACKUPS

® 21 Shubu Mukherjee, FACT Group Raw soft error rate: – FIT/bit Y.Tosaka, S.Satoh, K.Suzuki, T.Suguii, H.Ehara, G.A.Woffinden, and S.A.Wender, “Impact of Cosmic Ray Neutron Induced Soft Errors, on Advanced Submicron CMOS circuits,” VLSI Symposium on VLSI Technology Digest of Technical Papers, Y.Tosaka, S.Satoh, K.Suzuki, T.Suguii, H.Ehara, G.A.Woffinden, and S.A.Wender, “Impact of Cosmic Ray Neutron Induced Soft Errors, on Advanced Submicron CMOS circuits,” VLSI Symposium on VLSI Technology Digest of Technical Papers, Normand, “Single Event Upset at Ground Level,” IEEE Transactions on Nuclear Science, Vol. 43, No. 6, December Normand, “Single Event Upset at Ground Level,” IEEE Transactions on Nuclear Science, Vol. 43, No. 6, December 1996.