Presentation is loading. Please wait.

Presentation is loading. Please wait.

Spring 2008 CSE 591 Compilers for Embedded Systems Aviral Shrivastava Department of Computer Science and Engineering Arizona State University.

Similar presentations


Presentation on theme: "Spring 2008 CSE 591 Compilers for Embedded Systems Aviral Shrivastava Department of Computer Science and Engineering Arizona State University."— Presentation transcript:

1

2 Spring 2008 CSE 591 Compilers for Embedded Systems Aviral Shrivastava Department of Computer Science and Engineering Arizona State University

3 Lecture 3: Soft Errors Models and Techniques

4 Outline □Soft Errors Recap □Process Technology and Packaging Solutions □Gate-level and Circuit-level Solutions □Microarchitectural Solutions □Single-core □Multi-threaded □Software Solutions □Multi Bit Upsets (MBUs) □Single Event Latchup

5 Phenomenon of Soft Error □Transient Faults □Random and spontaneous bit-changes in system □Can be caused by □Circuit noise □Cross-talk □More than 50% due to radiation strike

6 Metrics □FIT: Failure in Time □No. of failures in 1 billion hours of operation □MTTF: Mean Time To Failure □1000 FITs => MTTF of 114 years □1 GByte of RAM @ 500 FIT/Mbit can expect an error every two weeks □ECC reduces failure rate by 2 orders of magnitude □hypothetical Terabyte system would experience a soft error every few minutes

7 Trends □DRAM □System error rate of DRAMs is fairly constant □SRAM □Increasing exponentially □Logic □Increasing exponentially

8 Masking Effects □Logic Masking □Occurs when particle strikes a portion of combinational logic that is blocked from affecting the output due to a subsequent gate whose result is completely determined by its other input values □Electrical Masking □Occurs when the pulse resulting from a particle strike is attenuated by subsequent logic gates, and does not affect the result of the circuit □Latching Window Masking □Occurs when the pulse resulting from a particle strike reaches a latch, but not at the clock transition where the latch captures its input values □Microarchitectural Masking □Occurs when the incorrect value in the latch is ignored in evaluation of a program variable □Software Masking □Occurs when an incorrect value of a variable is ignored by the software while computing the outputs

9 Faults, Errors, Failures (“Fault Tolerant Computer Systems”, by Pradhan) □Fault □Defect in hardware or software component □defect for cosmic ray = upset from high-energy neutron strike □Error □manifestation of a fault, resulting in deviation from accuracy □faults cause errors (but, not vice versa) □a masked fault is not an error! □vulnerability factor = fraction of faults that cause errors □Failure □non-performance of expected action □ errors cause failures (but not vice versa) □ a corrected error doesn’t cause a failure

10 Fault Tolerance in Microprocessors □Information Redundancy □Protecting data words with information coding □Parity or Hamming codes □ECC codes mainly in memory arrays □Cost is extra/additional storage for coding overhead, and checking logic □Space Redundancy □Carrying out the same computation on multiple independent hardware at the same time □Errors are exposed by checking the independent results □Cause large hardware overhead □Good for permanent faults □Time Redundancy □Execute the same computation on the same hardware at different times

11 The Soft Error Opportunity □Key differences with classical fault tolerance □FIT budget 100x – 1000x more than Tandem-style machines □Traditional “big hammer” solutions too expensive for volume market & can be an overkill □Why architecture plays a critical role? □error often defined in architecture & microarchitecture □e.g., strike on a branch predictor doesn’t cause an error □architectural solutions are often more cost-effective □one bit of parity can protect 64 bits, overhead < 2% □radiation-hardened cells can have overhead around 20-40%

12 Outline □Soft Errors Recap □Process Technology and Packaging Solutions □Gate-level and Circuit-level Solutions □Microarchitectural Solutions □Single-core □Multi-threaded □Software Solutions □Multi Bit Upsets (MBUs) □Single Event Latchup

13 Processing and Packaging Solutions □Reduce the number of particles that strike □Reduce upsets □Use of highly purified fabrication materials □Remove traces of boron and heavy metals □Surround by metallic frame □Reduce low-energy particles □But neutrons can pass through > 10 ft of concrete □Process Technology Solutions □Partially depleted SOI: no help after 250 nm □Fully depleted SOI: very expensive

14 Transistor Level Techniques □Normally CMOS inverter is scaled with 2:1 ratio between p- and n-channel devices □To compensate for electron and hole mobilities □Changing this ratio can increase the tolerance

15 Gate-Level Techniques □Some gates are more vulnerable than others □Radiation hardened designs use NAND gates □When all inputs are low, drive of p-stack is low, high leakage of n-transistors  rise in the output slow  functional failure □Gates vulnerability may change by 5X depending on the state □NAND gate □Extremely vulnerable when inputs 10 □Not vulnerable when inputs 00 □How to synthesize to minimize vulnerability

16 Circuit-Level Techniques □Adding resistance introduces additional time constants that filter out the very fast SEU-induced transients □High temperature coefficients of poly-silicon resistors □Difficult to control variation of resistance

17 Outline □Soft Errors Recap □Process Technology and Packaging Solutions □Gate-level and Circuit-level Solutions □Microarchitectural Solutions □Single-core □Multi-threaded □Software Solutions □Multi Bit Upsets (MBUs) □Single Event Latchup

18 Architectural Vulnerability Factor □AVF: Probability that a fault in a particular structure will results in system failure □AVF of branch predictor = 0% □AVF of PC = 100% □ACE-bit: “Architectural bits” that must be correct for “Correct Execution” □Count number of ACE-bits in a structure □Indentifying Un-ACE bits □Microarchitectural Un-ACE bits: Cannot influence correct instruction execution □Idle or Invalid state, e.g., inputs to un-chosen paths of mux □Mis-speculated state, e.g., wrong path instruction □Predictor structures, e.g., branch predictor □Ex-ACE state, e.g., registers □Architectural Un-ACE bits: Affect correct path execution, but does not change the output □NOP-instructions □Prefetch instructions □Predicated false instructions □Dynamically dead instructions, FDD, TDD □Computing AVF from a Performance Model □Gather the number of ACE-bits in each cycle

19 Vulnerability Contributions □DCache - largest contributor to vulnerability □Data + tags □ICache: Close second □Instructions only □Tags are (almost) not vulnerable □Register File, Pipeline □Rate of errors may be higher in Pipeline and RF □Compute Cache and Register File Vulnerability

20 Vulnerability Variations □System vulnerability changes with time □How can you use this information?

21 Copyright 2005, M. Tahoori 20 D-Cache: Flushing 4x reduction in vulnerability

22 Copyright 2005, M. Tahoori 21 D-Cache: Write Policy 10x reduction in vulnerability

23 Copyright 2005, M. Tahoori 22 D-Cache: Refresh 3x reduction in vulnerability using write-thru (30x total)

24 DIVA Microarchitecture BPredI-$ Dec/Ren IQALUD-$ Rename Regs Arch Regs LR3 + LR7  LR15 4 8 12 Storage Check Rd LR3 and LR7 from Arch Regs and confirm it equals 4 and 8 ALU Check Add 4+8 and confirm it equals 12 If both checks succeed, write 12 into LR15

25 Microarchitecture Details Instructions are fed to checker in order during commit The logic and storage checks detect errors in ALUs and datapath The checker core is a simple in-order pipeline – easy to design and verify An error in an earlier stage (LR3 instead of LR2) can be detected by also adding a ren/decode stage to the checker In-order core has no stalls (need bypass for register file) – no data dependences, cache misses, branch mispredicts Contention for register file and data cache can degrade primary thread

26 Recovery The architected register file and data cache are ECC protected – when an error is detected, it is assumed that checker and architected state are correct Primary core is re-started from faulting instruction A fault in the primary core may result in deadlock: e.g. instruction that produces R5 is waiting for R5 to be produced (instead of R4) A timeout in the checker signals an error

27 Memory FNCFC Main Cache Mini Cache PPC (Partially Protected Caches) □2 Caches at the same level of memory hierarchy □Main Cache, and the protected mini- cache □Mini-cache □low power, low latency □Timing slack to harden it □Compiler maps data to the two caches □Map Failure-Critical data to the protected mini-cache □Map Not Failure-Critical data to unprotected main cache □Intuition is to provide protection to only the FC data □In multimedia applications, the multimedia data is NOT failure critical □An error  Loss in Quality of Service □How to use PPCs for general applications? Processor Pipeline Unprotected Main Cache Protected Mini Cache HPC Processor Memory Controller Page Mapping PPC FNC FC

28 Razor □Originally proposed to tolerate process variations □Shadow latch clocked with a delayed clock □If difference in values latched, raise error □How to use it to detect soft errors?


Download ppt "Spring 2008 CSE 591 Compilers for Embedded Systems Aviral Shrivastava Department of Computer Science and Engineering Arizona State University."

Similar presentations


Ads by Google