Download presentation
Presentation is loading. Please wait.
Published byCeline Popson Modified over 9 years ago
1
Evaluating Impact of Soft-Errors in an Embedded System - Vijay Sheshadri Graduate Student Dept. of Electrical Engineering
2
May 3, 2015 2 What is a Soft-error? Transient fault caused by cosmic ray particles. 1 0 A charged particle incident on a component The charged particle creates EHPs which get collected by the drain Sufficient charge collection causes an erroneous bit- flip
3
May 3, 2015 3 Soft-error in a System Bit Read Bit has error protection Error is only detected (e.g., parity + no recovery) Error can be corrected (e.g, ECC) yes no Does bit matter? Silent Data Corruption (SDC) yes no Detected, but unrecoverable error (DUE) no error yes no benign fault no error benign fault no error Source: Shubhu Mukherjee et al. Radiation-Induced Soft Errors: An Architectural Perspective, HPCA 2005
4
May 3, 2015 4 Masking of Soft-error REGISTERSREGISTERS I1 I2 I3 I4 I5 I6 I7 C E D B REGISTERSREGISTERS O2 O1 1 1 1 0 1 0 1 0 Particle strike Electrical masking Soft error No soft error latching window masking Logical Masking 4
5
May 3, 2015 5 FIT Equation: Vulnerability Factors FIT = (for each vulnerable device i) (intrinsic error rate i * vulnerability factor i ) Vulnerability Factor = Timing Vulnerability Factor * Architectural Vulnerability Factor Timing Vulnerability Factor (TVF) fraction of time bit is vulnerable Architectural Vulnerability Factor (AVF) fraction of time bit matters for final output of a program Source: Shubhu Mukherjee et al. Radiation-Induced Soft Errors: An Architectural Perspective, HPCA 2005
6
May 3, 2015 6 Architectural Vulnerability Factor Fraction of time bit matters for final output of a program Branch Predictor Doesn’t matter at all (AVF = 0%) Program Counter Almost always matters (AVF ~ 100%) Computing AVF for complex structures Statistical Fault Injection ACE (Architecturally Correct Execution) Analysis Source: Shubhu Mukherjee et al. Radiation-Induced Soft Errors: An Architectural Perspective, HPCA 2005
7
Soft-error & Automobiles Mar,2010 - NHTSA enlisted NASA Engineering and Safety Center (NESC) to investigate “Unintended Acceleration” Apr,2011 – NESC discounts SEU in its report to NHTSA stating that the ICs manufactured using SOI (Silicon-on-insulator) technology As per AEC-Q100 standard, SEU testing required for automobile electronics with RAM > 1Mb May 3, 2015 7
8
An Example Predicted Block RAM upset rates for a Virtex-5 FPGA = 635 FIT/Mb = 1.5E-05 upsets per day per Mb. Ref : A. Lesea, “Continuing Experiments of Atmospheric Neutron Effects on Deep Submicron Integrated Circuits,” WP286 (v1.0), Xilinx, Inc. 2008 Assume this FPGA used in throttle control module If 500,000 such vehicles produced by vendor, then total upsets per day = 1.5E-05 x 500,000 = 7.6 vehicle upsets per day May 3, 2015 8
9
Soft-error Mitigation Robust circuit designs (radiation-hardenend) resilient to soft-errors Soft-error mitigation at Device-level – silicon-on-insulator, triple-well Circuit-level – DICE cell, Triple-modular redundancy Architecture-level – RMT, lock-stepping, ECC May 3, 2015 9
10
10 Soft-error Mitigation Soft-error mitigation techniques incur penalties in area (spatial redundancy) timing (temporal redundancy) Selective hardening of the components for reduced penalty Often based on logical/electrical/timing derating A low cost mitigation technique proposed for critical applications based on application derating Certain applications can mask or recover from transient faults* Ref: V. Wong et al, “Soft Error Resilience of Probabilistic Inference Applications” SELSE II, 2006
11
May 3, 2015 11 Critical Application - An Analogy Climate monitor/display Airbag deployment GPS Cruise control A micro-controller embedded in a car dashboard maybe handling many applications. A critical application in this case could be ‘Airbag deployment’. SE during this application could be catastrophic
12
May 3, 2015 12 Target Module PWM – output is a pulse, width of which decides speed of motor. Etpwmi0 module ~800 FFs & ~3000 logic gates 180-nm CMOS technology, 80 MHz frequency ADC CPU core PWM Motor
13
May 3, 2015 13 Basic Simulation Steps* Pre-analysis: Identify components utilized by critical application Fault injection: Inject a single fault at random time instance by depositing the opposite value on the component Error metric: Error count => no. of mismatches b/w output and reference PW count => no. of clock-cycles the output is ‘1’ as compared to reference Ref: J. Blome et al, “Cost-Efficient Soft Error Protection for Embedded Microprocessors” CASES, 2006
14
Simulation tools Verilog netlist simulated with timing information, using Synopsys VCS Fault-injection module coded in C. Uses VPI (verilog procedural interface) functions to Access a net in the netlist (vpiHandle) Read value of the net (vpi_get_value) Overwrite value of the net (vpi_put_value) May 3, 2015 14
15
May 3, 2015 15 Simulation – Pre-analysis Pre-analysis Categorize FFs based on their activity a) Low-activity FFs (no. of toggles less than 2) b) High-activity FFs (no. of toggles higher than 2) Opposite values forced and output pulse observed for errors FFs in which errors were observed are identified and subjected to fault-injection
16
May 3, 2015 16 Simulation – Fault-injection Fault injection For the FFs obtained from pre-analysis, inject fault at a random instance of time (within time interval of first output pulse) Measure Error count & PW count. Identify FFs with error in acceptable limits Fault-injection window Output pulse Original value Test bench Fault- injection module (verilog)(C+VPI) Modified value
17
May 3, 2015 17 Absolute error vs. Acceptable error Absolute error – Raise error flag for any mismatch b/w the output pulse and reference Acceptable error - Raise error flag only if mismatch b/w the output pulse and reference lies outside tolerance limit* Examples: Delayed pulse - Self-correcting pulse Fault- injected here Target FF Actual output reference copy Fault- injected here Target FF reference copy Actual output delay Ref: X. Li, et al “Exploiting Soft Computing for Increased Fault Tolerance” Workshop on Architectural Support for Gigascale Integration, 2006
18
May 3, 2015 18 Simulations-Combinational logic Fault injection steps: SE modeled as a 1ns pulse (System Clock Freq = 80MHz) Transient pulse injected onto the gate output Target combinational circuit selected at random Example: 2-input NAND gate Actual output reference copy A B Y Injected Fault A B Y
19
May 3, 2015 19 Results Pre-analysis - ~18% FFs used by the application Fault-injection - number of faults injected is proportional to the number of flip-flops in the group Low-toggle FFs more in number, hence no. of faults injected in low-toggle FF is higher
20
May 3, 2015 20 Results Low-toggle FF more vulnerable to soft-errors since an erroneous bit-flip may remain unchanged High-toggle FF is written very often, an erroneous bit flip has a higher probability of getting overwritten
21
May 3, 2015 21 Computing AVF AVF = P e * % component P e = probability that a fault injected in the component results in an error (P e ) = (no. of errors) / (no. of faults injected) % component = the percentage of that component with respect to total number of components Example: For a latch, a. if # errors = 50% of injected faults (P e = 0.5) b. if latches make for 20% of circuit AVF = 0.5 x 0.2 = 0.1
22
AVF - Results Low activity FF have a higher P e and are more in number; hence have a higher AVF Combinational logic, though high in number, has P e ~4E-03, causing AVF to drop 5/3/2015 22
23
Summary Fault-resilience scheme for critical applications using application derating and inherent error tolerance For the application considered, ~12% of the sequential logic was safety critical (prev. work reports 30% of seq. logic hardened for 99% fault-coverage in ARM embedded proc. running image processing algorithm) failures in combinational logic were negligible Worst-case scenario would only be the same as radiation-hardening a generic system i.e., all the hardware is identified as safety-critical 5/3/2015 23
24
Future Work Perform fault-injection analysis on the processor core managing the control loop Conduct neutron beam experiments on the circuit to compare with simulations and find FIT rate Implement circuit hardening and test the system to ascertain its robustness 5/3/2015 24
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.