Copyright 2005, M. Tahoori1 Soft Error Modeling and Mitigation Mehdi B. Tahoori Northeastern University

Slides:



Advertisements
Similar presentations
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
Advertisements

+ CS 325: CS Hardware and Software Organization and Architecture Internal Memory.
Discussion of: “Terrestrial-based Radiation Upsets: A Cautionary Tale” CprE 583 Tony Kuker 12/06/05.
® 1 Shubu Mukherjee, FACT Group Radiation-Induced Soft Errors: An Architectural Perspective Shubu Mukherjee 1, Joel Emer 2, & Steven. K Reinhardt 1,3 “If.
Sp09 CMPEN 411 L16 S.1 CMPEN 411 VLSI Digital Circuits Spring 2009 Lecture 16: Introduction to Soft Errors [Adapted from Rabaey’s Digital Integrated Circuits,
Single Event Upsets (SEUs) – Soft Errors By: Rajesh Garg Sunil P. Khatri Department of Electrical and Computer Engineering, Texas A&M University, College.
April 30, Cost efficient soft-error protection for ASICs Tuvia Liran; Ramon Chips Ltd.
Microprocessor Reliability
1 Saad Arrabi 2/24/2010 CS  Definition of soft errors  Motivation of the paper  Goals of this paper  ACE and un-ACE bits  Results  Conclusion.
2007 MURI Review The Effect of Voltage Fluctuations on the Single Event Transient Response of Deep Submicron Digital Circuits Matthew J. Gadlage 1,2, Ronald.
Using Hardware Vulnerability Factors to Enhance AVF Analysis Vilas Sridharan RAS Architecture and Strategy AMD, Inc. International Symposium on Computer.
® 1 ISCA 2004 Shubu Mukherjee, FACT Group, MMDC, Intel Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor Techniques to Reduce.
® 1 Shubu Mukherjee, FACT Group Cache Scrubbing in Microprocessors: Myth or Necessity? Practical Experience Report Shubu Mukherjee Joel Emer, Tryggve Fossum,
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
CS 7810 Lecture 25 DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design T. Austin Proceedings of MICRO-32 November 1999.
Mitigating the Performance Degradation due to Faults in Non-Architectural Structures Constantinos Kourouyiannis Veerle Desmet Nikolas Ladas Yiannakis Sazeides.
March 16-18, 2008SSST'20081 Soft Error Rate Determination for Nanometer CMOS VLSI Circuits Fan Wang Vishwani D. Agrawal Department of Electrical and Computer.
A Delay-efficient Radiation-hard Digital Design Approach Using Code Word State Preserving (CWSP) Elements Charu Nagpal Rajesh Garg Sunil P. Khatri Department.
TH EDA NTHU-CS VLSI/CAD LAB 1 Re-synthesis for Reliability Design Shih-Chieh Chang Department of Computer Science National Tsing Hua University.
1 Oct 24-26, 2006 ITC'06 Fault Coverage Estimation for Non-Random Functional Input Sequences Soumitra Bose Intel Corporation, Design Technology, Folsom,
Cost-Efficient Soft Error Protection for Embedded Microprocessors
Unreliable Silicon: Myth or Reality? Shubu Mukherjee Principal Engineer Director, SPEARS Group (SPEARS = Simulation & Pathfinding of Efficient And Reliable.
COMPUTER ARCHITECTURE & OPERATIONS I Instructor: Hao Ji.
Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from.
BIST vs. ATPG.
1 paper I design and implementation of the aegis single-chip secure processor using physical random functions, isca’05 nuno alves 28/sep/06.
Chapter 6 Memory and Programmable Logic Devices
University of Michigan Electrical Engineering and Computer Science 1 A Microarchitectural Analysis of Soft Error Propagation in a Production-Level Embedded.
Spring 2008 CSE 591 Compilers for Embedded Systems Aviral Shrivastava Department of Computer Science and Engineering Arizona State University.
Penn ESE370 Fall DeHon 1 ESE370: Circuit-Level Modeling, Design, and Optimization for Digital Systems Day 32: November 24, 2010 Uncorrelated Noise.
1 Efficient Analytical Determination of the SEU- induced Pulse Shape Rajesh Garg Sunil P. Khatri Department of ECE Texas A&M University College Station,
Low Power Techniques in Processor Design
Power Reduction for FPGA using Multiple Vdd/Vth
Lecture 03: Fundamentals of Computer Design - Trends and Performance Kai Bu
A F AST AND A CCURATE M ULTI -C YCLE S OFT E RROR R ATE E STIMATION A PPROACH TO R ESILIENT E MBEDDED S YSTEMS D ESIGN Department of Computer Engineering.
Copyright © 2008 UCI ACES Laboratory Kyoungwoo Lee 1, Aviral Shrivastava 2, Nikil Dutt 1, and Nalini Venkatasubramanian 1.
SiLab presentation on Reliable Computing Combinational Logic Soft Error Analysis and Protection Ali Ahmadi May 2008.
Soft errors in adder circuits Rajaraman Ramanarayanan, Mary Jane Irwin, Vijaykrishnan Narayanan, Yuan Xie Penn State University Kerry Bernstein IBM.
Self-* Systems CSE 598B Paper title: Dynamic ECC tuning for caches Presented by: Niranjan Soundararajan.
EEE2243 Digital System Design Chapter 7: Advanced Design Considerations by Muhazam Mustapha, extracted from Intel Training Slides, April 2012.
1 A Cost-effective Substantial- impact-filter Based Method to Tolerate Voltage Emergencies Songjun Pan 1,2, Yu Hu 1, Xing Hu 1,2, and Xiaowei Li 1 1 Key.
ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability Copyright 2004 Daniel J. Sorin Duke University.
Analytical Approach for Soft Error Rate Estimation of SRAM-Based FPGAs Ghazanfar (Hossein) Asadi and Mehdi B. Tahoori Why Soft Error Rate (SER) Estimation?
Yun-Chung Yang TRB: Tag Replication Buffer for Enhancing the Reliability of the Cache Tag Array Shuai Wang; Jie Hu; Ziavras S.G; Dept. of Electr. & Comput.
COMP203/NWEN Memory Technologies 0 Plan for Memory Technologies Topic Static RAM (SRAM) Dynamic RAM (DRAM) Memory Hierarchy DRAM Accelerating Techniques.
ECEN 248: INTRODUCTION TO DIGITAL SYSTEMS DESIGN Dr. Shi Dept. of Electrical and Computer Engineering.
11 Online Computing and Predicting Architectural Vulnerability Factor of Microprocessor Structures Songjun Pan Yu Hu Xiaowei Li {pansongjun, huyu,
Copyright © 2010 Houman Homayoun Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California.
1 System-Level Vulnerability Estimation for Data Caches.
Architectural Vulnerability Factor (AVF) Computation for Address-Based Structures Arijit Biswas, Paul Racunas, Shubu Mukherjee FACT Group, DEG, Intel Joel.
In-Place Decomposition for Robustness in FPGA Ju-Yueh Lee, Zhe Feng, and Lei He Electrical Engineering Dept., UCLA Presented by Ju-Yueh Lee Address comments.
Methodology to Compute Architectural Vulnerability Factors Chris Weaver 1, 2 Shubhendu S. Mukherjee 1 Joel Emer 1 Steven K. Reinhardt 1, 2 Todd Austin.
IPR: In-Place Reconfiguration for FPGA Fault Tolerance Zhe Feng 1, Yu Hu 1, Lei He 1 and Rupak Majumdar 2 1 Electrical Engineering Department 2 Computer.
1 Lecture 3: Pipelining Basics Today: chapter 1 wrap-up, basic pipelining implementation (Sections C.1 - C.4) Reminders:  Sign up for the class mailing.
Spring 2008 CSE 591 Compilers for Embedded Systems Aviral Shrivastava Department of Computer Science and Engineering Arizona State University.
Gill 1 MAPLD 2005/234 Analysis and Reduction Soft Delay Errors in CMOS Circuits Balkaran Gill, Chris Papachristou, and Francis Wolff Department of Electrical.
A Novel, Highly SEU Tolerant Digital Circuit Design Approach By: Rajesh Garg Sunil P. Khatri Department of Electrical and Computer Engineering, Texas A&M.
EE 653: Group #3 Impact of Drowsy Caches on SER Arjun Bir Singh Mohammad Abdel-Majeed Sameer G Kulkarni.
Chandrasekhar 1 MAPLD 2005/204 Reduced Triple Modular Redundancy for Tolerating SEUs in SRAM based FPGAs Vikram Chandrasekhar, Sk. Noor Mahammad, V. Muralidharan.
Raghuraman Balasubramanian Karthikeyan Sankaralingam
SE-Aware HPC Extension : Selective Data Protection for reducing failures due to soft errors 7/20/2006 Kyoungwoo Lee.
Exam 2 Review Two’s Complement Arithmetic Ripple carry ALU logic and performance Look-ahead techniques, performance and equations Basic multiplication.
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
Soft Error Rates with Inertial and Logical Masking
BIC 10503: COMPUTER ARCHITECTURE
Design of a ‘Single Event Effect’ Mitigation Technique for Reconfigurable Architectures SAJID BALOCH Prof. Dr. T. Arslan1,2 Dr.Adrian Stoica3.
Analytical Approach for Soft Error Rate Estimation of SRAM-Based FPGAs
InCheck: An In-application Recovery Scheme for Soft Errors
FPGA Glitch Power Analysis and Reduction
R.W. Mann and N. George ECE632 Dec. 2, 2008
Presentation transcript:

Copyright 2005, M. Tahoori1 Soft Error Modeling and Mitigation Mehdi B. Tahoori Northeastern University

Copyright 2005, M. Tahoori2 Outline Soft Error Introduction Soft Error Modeling for Memory Hierarchy Soft Error Modeling in Random Logic Combinational logic Sequential logic More Issues

Copyright 2005, M. Tahoori3 Soft Error: Introduction

Copyright 2005, M. Tahoori4 Evidence of Cosmic Ray Strikes Documented strikes in large servers found in error logs Normand, “Single Event Upset at Ground Level,” IEEE Transactions on Nuclear Science, Vol. 43, No. 6, December Sun Microsystems, 2000 (R. Baumann, 2002 IRPS Workshop talk) Cosmic ray strikes on L2 cache with defective error protection caused Sun’s flagship servers to suddenly and mysteriously crash! Companies affected Baby Bell (Atlanta), America Online, Ebay, & dozens of other corporations Verisign moved to IBM Unix servers (for the most part)

Copyright 2005, M. Tahoori5 Reactions from Companies Fujitsu SPARC in 130 nm technology 80% of 200k latches protected with parity compare with very few latches protected in Mckinley ISSCC, 2003 IBM declared 1000 years system MTBF as product goal for Power4 line very hard to achieve this goal in a cost-effective way Bossen, 2002 IRPS Workshop Talk

Copyright 2005, M. Tahoori6 Figure 3, Ziegler, et al., “IBM experiments in soft fails in computer electronics ( ),” IBM J. of R. & D., Vol. 40, No. 1, Jan Impact of Neutron Strike on a Si Device Strike creates electron-hole pairs that can be absorbed by source/diffusion areas to change state of device

Copyright 2005, M. Tahoori7 Figure 8, Ziegler, et al., “IBM experiments in soft fails in computer electronics ( ),” IBM J. of R. & D., Vol. 40, No. 1, Jan Impact of Elevation 3x - 5x increase in Denver at 5,000 feet 100x increase in airplanes at 30,000+ feet

Copyright 2005, M. Tahoori8 Physical Solutions are hard Shielding? No practical absorbent (e.g., approximately > 10 ft of concrete) unlike Alpha particles Technology solution: SOI? Partially-depleted SOI probably no help in 250 nm and beyond Fully-depleted SOI can help, but very hard to manufacture in high volumes Radiation-hardened cells? 10x improvement possible with significant penalty in performance, area, cost 2-4x improvement may be possible with less penalty Some of these techniques will help alleviate the impact of Soft Errors, but not completely remove it

Copyright 2005, M. Tahoori9 Bit Read Bit has error protection Error is only detected (e.g., parity + no recovery) Error can be corrected (e.g, ECC) yes no Does bit matter? Silent Data Corruption (SDC) yes no Detected, but unrecoverable error (DUE) no error yes no benign fault no error benign fault no error Strike on state bit (e.g., in register file)

Copyright 2005, M. Tahoori10 Definitions Interval-based MTTF = Mean Time to Failure MTTR = Mean Time to Repair MTBF = Mean Time Between Failures = MTTF + MTTR Availability = MTTF / MTBF Rate-based FIT = Failure in Time = 1 failure in a billion hours 1 year MTTF = 109 / (24 * 365) FIT = 114,155 FIT SER FIT = SDC FIT + DUE FIT Total of 198K FIT + Cache: 40K FIT IQ: 100K FIT FU: 58K FIT + Hypothetical Example

Copyright 2005, M. Tahoori11 # Vulnerable Bits Growing with Moore’s Law Fujitsu SPARC has 20% of 200k latches vulnerable in 2003 Additional SDC FIT from RAM cells, static logic, & dynamic logic Higher SDC FIT in multiprocessor systems Gap ~= 100x for 8 processor system! A data center with 300 such systems will encounter a data corruption almost every week 12x GAP

Copyright 2005, M. Tahoori12 Soft Error Issues 1. Why is soft error a problem today? Industry is at the cross-over point Future is worse, IF we don’t do anything 2. What about system FIT contribution? System FIT decreased dramatically (e.g., RAID, ECC on DRAM) Large part of system moving on-chip (e.g., memory controller) 3. Is this a server problem or a desktop problem? Definitely a server (e.g., data center) problem Desktop problem from IT manager’s point of view 4. How do software bugs compare to soft error rates? Limited # of bugs in mature software (e.g., servers, company environment) If we don’t do anything, soft errors will be your dominant failure rate

Copyright 2005, M. Tahoori13 Balancing Reliability and Performance in Memory Hierarchy

Copyright 2005, M. Tahoori14 Goals Accurately estimate reliability in cache memory early in the design cycle Methods to increase reliability of cache memory Minimize power / performance impacts

Copyright 2005, M. Tahoori15 Motivation Memory elements most vulnerable components to soft errors (Gaisler 97) Previous method: Fault Injection (Faure 03) Software (during design cycle) time-consuming Radiation-based only after chip fab

Copyright 2005, M. Tahoori16 Cache Reliability Model Separately measures reliability of: Data array Tag array Status bits Valid bit Dirty bit Can be extended to other status bits (coherency, etc) Model provides an upper bound on error rate for a given workload

Copyright 2005, M. Tahoori17 Errors in Data RAM Critical Words (CW) Read by CPU or written to memory Critical Time (CT)

Copyright 2005, M. Tahoori18 Reliability Computation (I) Vulnerability factor Fraction of faults that become errors M: Cache size TT: Total execution time CT: Critical time

Copyright 2005, M. Tahoori19 Reliability Computation (II) Define vulnerability as follows: Independent of environment and cache size Goal decrease vulnerability while not impacting power and performance

Copyright 2005, M. Tahoori20 Experimental Setup SimpleScalar 4.0 SPEC2000 benchmarks Programs run for 500M instructions

Copyright 2005, M. Tahoori21 Experiments Examined three methods to reduce vulnerability Flushing periodically flush entire cache Write Policy change write-thru policy (from write-back) Refreshing periodically refetch cache blocks from L2 cache

Copyright 2005, M. Tahoori22 D-Cache: Flushing 4x reduction in vulnerability

Copyright 2005, M. Tahoori23 D-Cache: Write Policy 10x reduction in vulnerability

Copyright 2005, M. Tahoori24 D-Cache: Refresh 3x reduction in vulnerability using write-thru (30x total)

Copyright 2005, M. Tahoori25 Summary Reliability estimation of cache hierarchy Based on critical words and critical times Several methods to reduce vulnerability Flushing Write-thru policy Refreshing 30x decrease in vulnerability with minimal IPC impact

Copyright 2005, M. Tahoori26 Soft Error Modeling at Logic-Level

Copyright 2005, M. Tahoori27 Exponential increase of Soft Errors e-Qcrit/Qs trend with technology scaling (Shivakumar, DSN 2002) Qcrit: the critical charge (depend on characteristics of the circuit) Qs: the charge collection efficiency of a particle strike on the device Particles of lower energies occur far frequently than particles of higher energy

Copyright 2005, M. Tahoori28 Soft Error Rate Error rate of node n Nominal FIT  Logic Derating  Timing Derating Norminal FIT Occurrence rate of SEUs at node n causing a glitch Logic Derating Propagation of error from node n to system bistables or outputs Timing Derating Propagated transient captured in system bistables

Copyright 2005, M. Tahoori29 Combinational Logic (Logic Derating)

Copyright 2005, M. Tahoori30 SER Estimation in Combinational Logic The main idea: Traversing structural paths from faults origin to POs Using signal probabilities for SER estimation

Copyright 2005, M. Tahoori31 Example: Simple Path EPP(gate C) = 1  0.2=0.2 EPP(gate D) = 0.2  (1-SPB)= 0.2  0.7=0.14

Copyright 2005, M. Tahoori32 Propagation Rules P a (U i ) + P ā (U i ) + P 1 (U i ) + P 0 (U i ) = 1 Need 4 logic values 0, 1 : no propagation a, ā : propagation (same and opposite polarities)

Copyright 2005, M. Tahoori33 Algorithm For any gate, g i : 1. Extract all on-path signals (and gates) from g i to any reachable primary output PO j and/or flip-flop ff j 2. Levelize signals on these paths 3. Traverse the paths in order Use signal probabilities for off-path signals Use propagation rules for on-path signals

Copyright 2005, M. Tahoori34 Example: Reconvergent Fanouts

Copyright 2005, M. Tahoori35 Sequential Logic (Timing Derating)

Copyright 2005, M. Tahoori36 Glitch Propagation: Simple Path Duration and time of propagated glitch Depends on propagation and transition delays along the path Glitch propagation probabilitiy Depends on signal probabilities of off-path signals Error propagation probability (EPP) Propagation probability (PP)  Latching Probability (LP)

Copyright 2005, M. Tahoori37 Latching Probability LP = (S+H+W)/T S,H : Setup and Hold time W: glitch width T : clock period

Copyright 2005, M. Tahoori38 Reconvergent Paths Propagated waveforms Multiple waveforms, not simple glitches

Copyright 2005, M. Tahoori39 Approach Find all possible propagated waveforms Enhanced static timing analysis All possible transitions at each reachable gate Due to glitch at error site Find the probability of each waveform Using an approach similar to logic derating Compute time-logic derating at each node Compute overall SER

Copyright 2005, M. Tahoori40 Example

Copyright 2005, M. Tahoori41 Validation Monte-Carlo simulation Inject glitches at the outputs of random gates At random time Perform timing-accurate simulation Identify if error captured in a flip-flop Compute soft error rate Stop if the computed value reaches confidence interval E.g. 3% error margin Or, if simulation doesn’t converge after N iterations Too time-consuming

Copyright 2005, M. Tahoori42 Results: Timing Derating Run time(sec)Speedupw=50nsw=70ns Circuit#GatesMC SimProposed% DiffSim Status% DiffSim Status s C5.03NC s C3.3NC s C0.51C s NC1.01NC s NC3.48NC s C1.47C s C0.92C s C2.25C s C1.29C s N/A537--NC- s N/A671--NC- s N/A645--NC- average

Copyright 2005, M. Tahoori43 SER vs Glitch Width

Copyright 2005, M. Tahoori44 Summary Analytical method for SER estimation at logic-level Logic and timing derating Based on signal probabilities Traversing topological paths in the netlist Very fast and accurate Compared to Monte-Carlo Fault Injection 4-5 orders of magnitude faster More than 96% accurate Application Reliability measurement Cost-effective soft error hardening

Copyright 2005, M. Tahoori45 More Issues Soft error hardening For individual gate Gate sizing Using isolation device (c-pass transistors) … For entire design Balancing reliability and overheads Area, power, delay

Copyright 2005, M. Tahoori46 References J. Kumar, M.B. Tahoori, “Use Of Pass Transistor Logic To Minimize The Impact Of Soft Errors In Combinational Circuits”, In Workshop on System Effects of Logic Soft Errors (SELSE), G. Asadi, M.B. Tahoori, “An Analytical Approach for Soft Error Rate Estimation In Digital Circuits”, In IEEE International Symposium on Circuits and Systems (ISCAS), G. Asadi, V. Sridharan, M. B. Tahoori, D. Kaeli, “Balancing Performance and Reliability in the Memory Hierarchy”, In IEEE Boston Area Architecture (BARC) Workshop, G. Asadi, M.B. Tahoori, “Soft Error Mitigation for SRAM-based FPGAs”, In VLSI Test Symposium (VTS), G. Asadi, V. Sridharan, M. B. Tahoori, D. Kaeli, “Balancing Performance and Reliability in the Memory Hierarchy”, In IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), G. Asadi, M.B. Tahoori, “An Accurate SER Estimation Method Based on Propagation Probability”, In Design Automation and Test in Europe (DATE) Conference, G. Asadi, M.B. Tahoori, “Soft Error Rate Estimation and Mitigation for SRAM- based FPGAs”, In ACM International Conference on Field Programmable Gate Arrays (FPGA), 2005.

Copyright 2005, M. Tahoori47 Questions?

Copyright 2005, M. Tahoori48 Cosmic rays come from deep space Figure 2, Ziegler, et al., “IBM experiments in soft fails in computer electronics ( ),” IBM J. of R. & D., Vol. 40, No. 1, Jan Origin of Cosmic Rays

Copyright 2005, M. Tahoori49 Computing FIT rate of a Chip FIT Rate Law: FIT rate of a system is the sum of the FIT rates of its individual components Vulnerable Bit Law: FIT rate of a chip is the sum of the FIT rate of vulnerable bits in that chip! Total Soft Error FIT =  (for each vulnerable device i) (intrinsic error rate i  vulnerability factor i ) Vulnerability Factor = fraction of faults that become errors Vulnerability Factor is also known as “derating factor” and “soft error sensitivity (SES).”

Copyright 2005, M. Tahoori50 Issues: Output Dependency Same error propagated to multiple outputs Solution forward signal probability of one PO to the other PO. SP of PO k forwarded to next stages instead of EPP of PO k,