® 1 Shubu Mukherjee, FACT Group Radiation-Induced Soft Errors: An Architectural Perspective Shubu Mukherjee 1, Joel Emer 2, & Steven. K Reinhardt 1,3 “If.

Slides:



Advertisements
Similar presentations
IHP Im Technologiepark Frankfurt (Oder) Germany IHP Im Technologiepark Frankfurt (Oder) Germany ©
Advertisements

Discussion of: “Terrestrial-based Radiation Upsets: A Cautionary Tale” CprE 583 Tony Kuker 12/06/05.
Sp09 CMPEN 411 L16 S.1 CMPEN 411 VLSI Digital Circuits Spring 2009 Lecture 16: Introduction to Soft Errors [Adapted from Rabaey’s Digital Integrated Circuits,
April 30, Cost efficient soft-error protection for ASICs Tuvia Liran; Ramon Chips Ltd.
Microprocessor Reliability
1 Saad Arrabi 2/24/2010 CS  Definition of soft errors  Motivation of the paper  Goals of this paper  ACE and un-ACE bits  Results  Conclusion.
2007 MURI Review The Effect of Voltage Fluctuations on the Single Event Transient Response of Deep Submicron Digital Circuits Matthew J. Gadlage 1,2, Ronald.
Evaluating Impact of Soft-Errors in an Embedded System - Vijay Sheshadri Graduate Student Dept. of Electrical Engineering.
IVF: Characterizing the Vulnerability of Microprocessor Structures to Intermittent Faults Songjun Pan 1,2, Yu Hu 1, and Xiaowei Li 1 1 Key Laboratory of.
Using Hardware Vulnerability Factors to Enhance AVF Analysis Vilas Sridharan RAS Architecture and Strategy AMD, Inc. International Symposium on Computer.
CHALLENGES IN EMBEDDED MEMORY DESIGN AND TEST History and Trends In Embedded System Memory.
® 1 ISCA 2004 Shubu Mukherjee, FACT Group, MMDC, Intel Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor Techniques to Reduce.
® 1 Shubu Mukherjee, FACT Group Cache Scrubbing in Microprocessors: Myth or Necessity? Practical Experience Report Shubu Mukherjee Joel Emer, Tryggve Fossum,
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
CS 7810 Lecture 25 DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design T. Austin Proceedings of MICRO-32 November 1999.
Mitigating the Performance Degradation due to Faults in Non-Architectural Structures Constantinos Kourouyiannis Veerle Desmet Nikolas Ladas Yiannakis Sazeides.
DRACO Architecture Research Group. DSN, Edinburgh UK, Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance.
March 16-18, 2008SSST'20081 Soft Error Rate Determination for Nanometer CMOS VLSI Circuits Fan Wang Vishwani D. Agrawal Department of Electrical and Computer.
Embedded Systems Laboratory Informatics Institute Federal University of Rio Grande do Sul Porto Alegre – RS – Brazil SRC TechCon 2005 Portland, Oregon,
Cost-Efficient Soft Error Protection for Embedded Microprocessors
Unreliable Silicon: Myth or Reality? Shubu Mukherjee Principal Engineer Director, SPEARS Group (SPEARS = Simulation & Pathfinding of Efficient And Reliable.
Spring 10, Jan 13ELEC 7770: Advanced VLSI Design (Agrawal)1 ELEC 7770 Advanced VLSI Design Spring 2010 VLSI Yield and Moore’s Law Vishwani D. Agrawal James.
University of Michigan Electrical Engineering and Computer Science 1 A Microarchitectural Analysis of Soft Error Propagation in a Production-Level Embedded.
Spring 2008 CSE 591 Compilers for Embedded Systems Aviral Shrivastava Department of Computer Science and Engineering Arizona State University.
Penn ESE370 Fall DeHon 1 ESE370: Circuit-Level Modeling, Design, and Optimization for Digital Systems Day 32: November 24, 2010 Uncorrelated Noise.
1 Dependability Benchmarking of VLSI Circuits Cristian Constantinescu Intel Corporation.
Low Power Techniques in Processor Design
Transient Fault Detection via Simultaneous Multithreading Shubhendu S. Mukherjee VSSAD, Alpha Technology Compaq Computer Corporation.
Lecture 03: Fundamentals of Computer Design - Trends and Performance Kai Bu
Copyright © 2008 UCI ACES Laboratory Kyoungwoo Lee 1, Aviral Shrivastava 2, Nikil Dutt 1, and Nalini Venkatasubramanian 1.
Chapter 1 An Introduction to Processor Design 부산대학교 컴퓨터공학과.
1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University
SiLab presentation on Reliable Computing Combinational Logic Soft Error Analysis and Protection Ali Ahmadi May 2008.
Soft errors in adder circuits Rajaraman Ramanarayanan, Mary Jane Irwin, Vijaykrishnan Narayanan, Yuan Xie Penn State University Kerry Bernstein IBM.
Self-* Systems CSE 598B Paper title: Dynamic ECC tuning for caches Presented by: Niranjan Soundararajan.
Part.1.1 In The Name of GOD Welcome to Babol (Nooshirvani) University of Technology Electrical & Computer Engineering Department.
1 A Cost-effective Substantial- impact-filter Based Method to Tolerate Voltage Emergencies Songjun Pan 1,2, Yu Hu 1, Xing Hu 1,2, and Xiaowei Li 1 1 Key.
Computer Organization & Assembly Language © by DR. M. Amer.
ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability Copyright 2004 Daniel J. Sorin Duke University.
Relyzer: Exploiting Application-level Fault Equivalence to Analyze Application Resiliency to Transient Faults Siva Hari 1, Sarita Adve 1, Helia Naeimi.
11 Online Computing and Predicting Architectural Vulnerability Factor of Microprocessor Structures Songjun Pan Yu Hu Xiaowei Li {pansongjun, huyu,
Eduardo L. Rhod, Álisson Michels, Carlos A. L. Lisbôa, Luigi Carro ETS 2006 Fault Tolerance Against Multiple SEUs using Memory-Based Circuits to Improve.
Architectural Vulnerability Factor (AVF) Computation for Address-Based Structures Arijit Biswas, Paul Racunas, Shubu Mukherjee FACT Group, DEG, Intel Joel.
Harnessing Soft Computation for Low-Budget Fault Tolerance Daya S Khudia Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan,
Methodology to Compute Architectural Vulnerability Factors Chris Weaver 1, 2 Shubhendu S. Mukherjee 1 Joel Emer 1 Steven K. Reinhardt 1, 2 Todd Austin.
Copyright 2005, M. Tahoori1 Soft Error Modeling and Mitigation Mehdi B. Tahoori Northeastern University
Low-cost Program-level Detectors for Reducing Silent Data Corruptions Siva Hari †, Sarita Adve †, and Helia Naeimi ‡ † University of Illinois at Urbana-Champaign,
Static Analysis to Mitigate Soft Errors in Register Files Jongeun Lee, Aviral Shrivastava Compiler Microarchitecture Lab Arizona State University, USA.
Spring 2008 CSE 591 Compilers for Embedded Systems Aviral Shrivastava Department of Computer Science and Engineering Arizona State University.
CS203 – Advanced Computer Architecture
Gill 1 MAPLD 2005/234 Analysis and Reduction Soft Delay Errors in CMOS Circuits Balkaran Gill, Chris Papachristou, and Francis Wolff Department of Electrical.
A Novel, Highly SEU Tolerant Digital Circuit Design Approach By: Rajesh Garg Sunil P. Khatri Department of Electrical and Computer Engineering, Texas A&M.
CS203 – Advanced Computer Architecture Dependability & Reliability.
EE 653: Group #3 Impact of Drowsy Caches on SER Arjun Bir Singh Mohammad Abdel-Majeed Sameer G Kulkarni.
University of Michigan Electrical Engineering and Computer Science 1 Low Cost Control Flow Protection Using Abstract Control Signatures Daya S Khudia and.
ALPHA 21164PC. Alpha 21164PC High-performance alternative to a Windows NT Personal Computer.
QUANTUM COMPUTING: Quantum computing is an attempt to unite Quantum mechanics and information science together to achieve next generation computation.
Rad (radiation) Hard Devices used in Space, Military Applications, Nuclear Power in-situ Instrumentation Savanna Krassau 4/21/2017 Abstract: Environments.
Raghuraman Balasubramanian Karthikeyan Sankaralingam
SE-Aware HPC Extension : Selective Data Protection for reducing failures due to soft errors 7/20/2006 Kyoungwoo Lee.
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
Embedded Computer Architecture 5SAI0 Technology
Dynamic Prediction of Architectural Vulnerability
Dynamic Prediction of Architectural Vulnerability
ISCA 2000 Panel Slow Wires, Hot Chips, and Leaky Transistors: New Challenges in the New Millennium Moderator: Shubu Mukherjee VSSAD, Alpha Technology Compaq.
R.W. Mann and N. George ECE632 Dec. 2, 2008
ELEC 7770 Advanced VLSI Design Spring 2014 VLSI Yield and Moore’s Law
Embedded Computer Architecture 5SAI0 Technology
COMS 361 Computer Organization
SDC is in the eye of the beholder: A Survey and preliminary study
Presentation transcript:

® 1 Shubu Mukherjee, FACT Group Radiation-Induced Soft Errors: An Architectural Perspective Shubu Mukherjee 1, Joel Emer 2, & Steven. K Reinhardt 1,3 “If a problem has no solution, it may not be a problem, but a FACT, not to be solved, but to be coped with over time,” Shimon Peres, Nobel Laureate Fault Aware Computing Technology (FACT) Group, Intel 2 VSSAD, Intel 3 University of Michigan, Ann Arbor 11th International Symposium on High-Performance Computer Architecture (HPCA), 2005

® 2 Shubu Mukherjee, FACT Group Evidence of Cosmic Ray Strikes Documented strikes in large servers found in error logs Documented strikes in large servers found in error logs ØNormand, “Single Event Upset at Ground Level,” IEEE Transactions on Nuclear Science, Vol. 43, No. 6, December Sun Microsystems, 2000 (R. Baumann, Workshop talk) Sun Microsystems, 2000 (R. Baumann, Workshop talk) ØCosmic ray strikes on L2 cache with defective error protection –caused Sun’s flagship servers to suddenly and mysteriously crash! ØCompanies affected –Baby Bell (Atlanta), America Online, Ebay, & dozens of other corporations –Verisign moved to IBM Unix servers (for the most part)

® 3 Shubu Mukherjee, FACT Group Reactions from Companies Typical server system data corruption target around 1000 years MTBF Typical server system data corruption target around 1000 years MTBF Øvery hard to achieve this goal in a cost-effective way ØBossen, 2002 IRPS Workshop Talk Fujitsu SPARC in 130 nm technology (2003) Fujitsu SPARC in 130 nm technology (2003) Ø80% of 200k latches protected with parity Øcompare with very few latches protected in Mckinley ØISSCC, 2003

® 4 Shubu Mukherjee, FACT Group Evolution of a Product’s Team’s Psyche Shock Ø Ø“SER is the crabgrass in the lawn of computer design” Denial Ø Ø“We will do the SER work two months before tapeout” Anger Ø Ø“Our reliability target is too ambitious” Acceptance Ø Ø“You can deny physics only for so long”

® 5 Shubu Mukherjee, FACT Group Outline Faults from Cosmic Rays Faults from Cosmic Rays Terminology Terminology Computing a chip’s Soft Error Rate Computing a chip’s Soft Error Rate The Soft Error Opportunity The Soft Error Opportunity Summary Summary

® 6 Shubu Mukherjee, FACT Group Strike Changes State of a Single Bit 0 1

® 7 Shubu Mukherjee, FACT Group Impact of Neutron Strike on a Si Device Secondary source of upsets: alpha particles from packaging Secondary source of upsets: alpha particles from packaging Strikes release electron & hole pairs that can be absorbed by source & drain to alter the state of the device Transistor Device source drain neutron strike

® 8 Shubu Mukherjee, FACT Group Cosmic Rays Come From Deep Space Earth’s Surface p n p p n n p p n n n Neutron flux is higher in higher altitudes

® 9 Shubu Mukherjee, FACT Group Impact of Elevation Figure 8, Ziegler, et al., “IBM experiments in soft fails in computer electronics ( ),” IBM J. of R. & D., Vol. 40, No. 1, Jan x - 5x increase in Denver at 5,000 feet 3x - 5x increase in Denver at 5,000 feet 100x increase in airplanes at 30,000+ feet 100x increase in airplanes at 30,000+ feet

® 10 Shubu Mukherjee, FACT Group Physical Solutions are hard Shielding? Shielding? ØNo practical absorbent (e.g., approximately > 10 ft of concrete) Øunlike Alpha particles Technology solution: SOI? Technology solution: SOI? ØPartially-depleted SOI of some help, effect on logic unclear ØFully-depleted SOI may help, hard to manufacture in high volumes Radiation-hardened cells? Radiation-hardened cells? Ø10x improvement possible with significant penalty in performance, area, cost Ø2-4x improvement may be possible with less penalty We think some of these techniques will help alleviate the impact of Soft Errors, but not completely remove it We think some of these techniques will help alleviate the impact of Soft Errors, but not completely remove it

® 11 Shubu Mukherjee, FACT Group Outline Faults from Cosmic Rays Faults from Cosmic Rays Terminology Terminology Computing a chip’s Soft Error Rate Computing a chip’s Soft Error Rate The Soft Error Opportunity The Soft Error Opportunity Summary Summary

® 12 Shubu Mukherjee, FACT Group Strike Changes State of a Single Bit 0 1

® 13 Shubu Mukherjee, FACT Group Strike on state bit (e.g., in register file) Bit Read Bit has error protection Error is only detected (e.g., parity + no recovery) Error can be corrected (e.g, ECC) yes no Does bit matter? Silent Data Corruption (SDC) yes no Detected, but unrecoverable error (DUE) no error yes no benign fault no error benign fault no error

® 14 Shubu Mukherjee, FACT Group Definitions 1 SDC = Silent Data Corruption SDC = Silent Data Corruption DUE = Detected & unrecoverable error DUE = Detected & unrecoverable error SER = Soft Error Rate = Total of SDC & DUE SER = Soft Error Rate = Total of SDC & DUE

® 15 Shubu Mukherjee, FACT Group Definitions 2 Interval-based Interval-based ØMTTF = Mean Time to Failure ØMTTR = Mean Time to Repair ØMTBF = Mean Time Between Failures = MTTF + MTTR ØAvailability = MTTF / MTBF Rate-based Rate-based ØFIT = Failure in Time = 1 failure in a billion hours Ø1 year MTTF = 10 9 / (24 * 365) FIT = 114,155 FIT ØSER FIT = SDC FIT + DUE FIT Total of 158K FIT + Cache: 0 FIT IQ: 100K FIT FU: 58K FIT + Hypothetical Example

® 16 Shubu Mukherjee, FACT Group Typical Server System Reliability Goals (D.C.Bossen, 2002 IRPS Tutorial Reliability Notes) Error Type System MTBF Goal SDC (Silent Data Corruption) 1000 years (114 FIT) DUE for system crash 25 years DUE for application crash 10 years

® 17 Shubu Mukherjee, FACT Group Outline Faults from Cosmic Rays Faults from Cosmic Rays Terminology Terminology Computing a chip’s Soft Error Rate Computing a chip’s Soft Error Rate The Soft Error Opportunity The Soft Error Opportunity Summary Summary

® 18 Shubu Mukherjee, FACT Group Measuring a Chip’s FIT Like performance measurement Like performance measurement Chip Physically bombard with neutrons in neutron accelerators Expose to alpha particles in radioactive foils Chip Study error logs of running machines Circuit Models + RTL Obtain raw error rate Statistical fault injection Circuit Models + Performance Model Obtain raw error rate Work in progress in FACT group

® 19 Shubu Mukherjee, FACT Group Computing FIT rate of a Chip FIT Rate Law: FIT rate of a system is the sum of the FIT rates of its individual components FIT Rate Law: FIT rate of a system is the sum of the FIT rates of its individual components Vulnerable Bit Law: FIT rate of a chip is the sum of the FIT rate of vulnerable bits in that chip! Vulnerable Bit Law: FIT rate of a chip is the sum of the FIT rate of vulnerable bits in that chip! Total Soft Error FIT =  (for each vulnerable device i) (intrinsic error rate i * vulnerability factor i ) Ø ØVulnerability Factor = fraction of faults that become errors ØVulnerability Factor is also known as “derating factor” and “soft error sensitivity (SES).”

® 20 Shubu Mukherjee, FACT Group FIT Equation: Raw Soft Error Rate FIT =  (for each vulnerable device i) (intrinsic error rate i * vulnerability factor i ) SRAM cells SRAM cells ØFIT/bit decreasing slightly across generations w/ usu. voltage scaling ØFIT/chip increasing overall Latch cells Latch cells ØFIT/bit constant across generations w/ usu. voltage scaling Static Logic Gates Static Logic Gates Øsee later Dynamic Logic Dynamic Logic Økeeper similar to latches, but extra reduction due to specific function implemented

® 21 Shubu Mukherjee, FACT Group FIT Equation: Vulnerability Factors FIT =  (for each vulnerable device i) (intrinsic error rate i * vulnerability factor i ) Vulnerability Factor = Timing Vulnerability Factor * Architectural Vulnerability Factor  Timing Vulnerability Factor  fraction of time bit is vulnerable  Architectural Vulnerability Factor (AVF)  fraction of time bit matters for final output of a program

® 22 Shubu Mukherjee, FACT Group Timing Vulnerability Factor SRAM cells SRAM cells Ø Ø100% Latch cells Ø Ø~ 50% Ø Ødepends on min. delay of signal propagation through logic chain (ref: Norbert Seifert, Intel) Static Logic Gates Ø ØShivakumar, et al. (DSN 2002) predict near zero today Ø Øsignal attenuation, latch window, & logical masking Ø Ømay be a problem in future Dynamic Logic Ø Øsame as latches

® 23 Shubu Mukherjee, FACT Group Architectural Vulnerability Factor Does a bit matter? Branch Predictor Branch Predictor Ø Doesn’t matter at all (AVF = 0%) Program Counter Program Counter Ø Almost always matters (AVF ~ 100%) Computing AVF for complex structures Computing AVF for complex structures ØStatistical Fault Injection ØACE Analysis (next) ØOther methods being researched

® 24 Shubu Mukherjee, FACT Group Architecturally Correct Execution (ACE) ACE path requires only a subset of values to flow correctly through the program’s data flow graph (and the machine) ACE path requires only a subset of values to flow correctly through the program’s data flow graph (and the machine) Anything else (un-ACE path ) can be derated away Anything else (un-ACE path ) can be derated away Program Input Program Outputs

® 25 Shubu Mukherjee, FACT Group Example of un-ACE instruction: Dynamically Dead Instruction Dynamically Dead Instruction Most bits of an un-ACE instruction do not affect program output

® 26 Shubu Mukherjee, FACT Group Dynamic Instruction Breakdown Average across Spec2K slices

® 27 Shubu Mukherjee, FACT Group Mapping ACE & un-ACE Instructions to the Instruction Queue Architectural un-ACEMicro-architectural un-ACE Wrong- Path Inst Idle NOPPrefetch ACE Inst Ex- ACE Inst

® 28 Shubu Mukherjee, FACT Group Instruction Queue ACE percentage = AVF = 29%

® 29 Shubu Mukherjee, FACT Group Punchline: Simple Conceptual Model FIT rate = sum of FIT rate of “vulnerable” bits FIT rate = sum of FIT rate of “vulnerable” bits Vulnerable bits (RAM & latch cells) Vulnerable bits (RAM & latch cells) Øfor SDC, this means unprotected bits Rule of thumb: vulnerability factor Rule of thumb: vulnerability factor Øarchitectural vulnerability factor ~= 20% Øtiming vulnerability factor = 50% for latches & 13% dynamic Rule of thumb: raw FIT rate Rule of thumb: raw FIT rate Ø0.001 – FIT/bit (Normand 1996, Tosaka 1996)

® 30 Shubu Mukherjee, FACT Group # Vulnerable Bits Growing with Moore’s Law Fujitsu SPARC has 20% of 200k latches vulnerable in 2003 Fujitsu SPARC has 20% of 200k latches vulnerable in 2003 Øaggressive designs have significantly higher number of vulnerable latches Additional SDC FIT from RAM cells, static logic, & dynamic logic Additional SDC FIT from RAM cells, static logic, & dynamic logic Higher SDC FIT in multiprocessor systems Higher SDC FIT in multiprocessor systems ØGap ~= 100x for 8 processor system! ØA data center with 300 such systems will encounter a data corruption almost every week 12x GAP

® 31 Shubu Mukherjee, FACT Group Outline Faults from Cosmic Rays Faults from Cosmic Rays Terminology Terminology Computing a chip’s Soft Error Rate Computing a chip’s Soft Error Rate The Soft Error Opportunity The Soft Error Opportunity Summary Summary

® 32 Shubu Mukherjee, FACT Group The Soft Error Opportunity Key differences with classical fault tolerance Key differences with classical fault tolerance ØFIT budget 100x – 1000x more than Tandem-style machines ØTraditional “big hammer” solutions too expensive for volume market & can be an overkill Why architecture plays a critical role? Why architecture plays a critical role? Øerror often defined in architecture & microarchitecture –e.g., strike on a branch predictor doesn’t cause an error Øarchitectural solutions are often more cost-effective –one bit of parity can protect 64 bits, overhead < 2% –radiation-hardened cells can have overhead around 20-40%

® 33 Shubu Mukherjee, FACT Group Research Directions 1. AVF characterization of processor structures  architectural abstraction for soft errors 2. AVF reduction techniques & tradeoff with performance  reduce exposure  reduce false errors  fault detection & recovery techniques 3. Protecting un-core components  data flows unchanged  microarchitectural state changes 4. Software solutions  e.g., the Princeton CRAFT approach  but, software doesn’t have full visibility into hardware 5. AVF vs. AF (activity factor) tradeoff  structures with high AF and low AVF may require a closer look 6. Other sources of soft errors, definitions carry over  timing errors, Vcc reduction errors, etc.

® 34 Shubu Mukherjee, FACT Group Summary Soft Errors: real problem today Soft Errors: real problem today ØPrimary culprit: neutrons from deep space ØIndustry seeing this now Major problem in next few technology generations Major problem in next few technology generations ØProblem scales with Moore’s Law, die size, & system size ØIndustry will have a hard time making chips reliable SER effort across Intel SER effort across Intel Ønumber of projects aimed at modeling, measuring, detecting, and correcting soft errors

® 35 Shubu Mukherjee, FACT Group BACKUPS FOLLOW

® 36 Shubu Mukherjee, FACT Group Faults, Errors, Failures (From Pradhan, “Fault-Tolerant Computer System Design”) Fault Fault Ødefect in hardware or software component Ødefect for cosmic ray = upset from high-energy neutron strike Error Error Ømanifestation of a fault, resulting in deviation from accuracy Øfaults cause errors (but, not vice versa) Øa masked fault is not an error! Øvulnerability factor = fraction of faults that cause errors (Intel term) Failure Failure Ønon-performance of expected action Øerrors cause failures (but not vice versa) Øa corrected error doesn’t cause a failure

® 37 Shubu Mukherjee, FACT GroupReferences Documented Strikes Documented Strikes Ø(Sun Microsystems) R. Baumann, “Soft Errors in Commercial Semiconductor Technology,” 2002 IRPS Tutorial Notes ØNormand, “Single Event Upset at Ground Level,” IEEE Transactions on Nuclear Science, Vol. 43, No. 6, December Raw soft error rate: – FIT/bit Raw soft error rate: – FIT/bit ØY.Tosaka, S.Satoh, K.Suzuki, T.Suguii, H.Ehara, G.A.Woffinden, and S.A.Wender, “Impact of Cosmic Ray Neutron Induced Soft Errors, on Advanced Submicron CMOS circuits,” VLSI Symposium on VLSI Technology Digest of Technical Papers, ØNormand, “Single Event Upset at Ground Level,” IEEE Transactions on Nuclear Science, Vol. 43, No. 6, December Typical Server System Goals Typical Server System Goals ØD.C.Bossen, “CMOS Soft Errors and Server Design,” IEEE 2002 Reliability Physics Tutorial Notes, Reliability Fundamentals, pp. 121_07.1 – 121_07.6, April 7, 2002.

® 38 Shubu Mukherjee, FACT Group FIT/bit for SRAM Cells decreasing Shivakumar, et al., “Modeling the Effect of Technology Trends on the Soft Error Rate of Combinatorial Logic,” DSN, Shivakumar, et al., “Modeling the Effect of Technology Trends on the Soft Error Rate of Combinatorial Logic,” DSN, ØFIT/bit decreasing, FIT/chip increasing Hareland, et al., “Impact of CMOS Process Scaling and SOI on the soft error rates of logic processes,” 2001 Symposium on VLSI Technlogy Digest of Technical papers Hareland, et al., “Impact of CMOS Process Scaling and SOI on the soft error rates of logic processes,” 2001 Symposium on VLSI Technlogy Digest of Technical papers ØFIT/bit decreasing R.Baumann, 2002 IRPS Tutorial Notes R.Baumann, 2002 IRPS Tutorial Notes ØFIT/bit decreasing because of voltage saturation ØFIT/bit increasing in products with B10

® 39 Shubu Mukherjee, FACT Group FIT/bit for Latches Constant Shivakumar, et al., “Modeling the Effect of Technology Trends on the Soft Error Rate of Combinatorial Logic,” DSN, Shivakumar, et al., “Modeling the Effect of Technology Trends on the Soft Error Rate of Combinatorial Logic,” DSN, Øprediction using models ØFIT/bit constant (within 2x error range) Karnik, et al., “Scaling Trends of Cosmic Rays induced Soft Errors in Static Latches beyond 0.18 ,” 2001 Symposium on VLSI Circuits Digest of Technical Papers Karnik, et al., “Scaling Trends of Cosmic Rays induced Soft Errors in Static Latches beyond 0.18 ,” 2001 Symposium on VLSI Circuits Digest of Technical Papers ØNeutron beam experiment ØFIT/bit constant

® 40 Shubu Mukherjee, FACT Group Raw FIT Equation Raw Neutron FIT rate Raw Neutron FIT rate Ø  Neutron Flux * Area * e -(Qcrit/Qs) When Qcrit >> Qs When Qcrit >> Qs Øexponential dominates Øwe are still in this region When Qcrit <= Qs When Qcrit <= Qs Øreached saturation Øarea dominates, so FIT/bit will continue to decrease with area

® 41 Shubu Mukherjee, FACT Group e -Qcrit/Qs trends (Shivakumar et al., DSN 2002) exp(-Qcrit/Qs) increasing area decreasing quadratically

® 42 Shubu Mukherjee, FACT Group SRAM: FIT/bit decreasing Source: Shivakumar, et al., DSN 2002 Source: Shivakumar, et al., DSN 2002

® 43 Shubu Mukherjee, FACT Group Latch: FIT/bit roughly constant Source: Shivakumar, et al., DSN 2002 Source: Shivakumar, et al., DSN 2002

® 44 Shubu Mukherjee, FACT Group Timing vulnerability Factor for latches Timing vulnerability factor = latch time / clock time ~= 50% Timing vulnerability factor = latch time / clock time ~= 50% flow-through latch data setup time hold time

® 45 Shubu Mukherjee, FACT Group Energy Spectrum of Cosmic Ray Particles Neutrons constitute > 96% of cosmic ray particles at sea level Neutrons constitute > 96% of cosmic ray particles at sea level Higher # of lower energy particles (significant) Higher # of lower energy particles (significant) Figure 4, Ziegler, et al., “Terrestrial Cosmic Rays,” IBM J. of R. & D., Vol. 40, No. 1, Jan

® 46 Shubu Mukherjee, FACT Group SFI vs. ACE analysis SFIACE Accuracy of Microarchitectural un-ACE Better than ACE analysis Conservative Accuracy of Architectural un-ACEConservative Better than SFI (e.g., covers dynamically dead instructions) Insight Per-structure insights harder Little’s Law & per- structure breakdown easier # of experiments Large # required to be statistically significant Small # of experiments can give good accuracy