Presentation is loading. Please wait.

Presentation is loading. Please wait.

CML CSE 591: Advances in Reliable Computing Aviral Shrivastava.

Similar presentations


Presentation on theme: "CML CSE 591: Advances in Reliable Computing Aviral Shrivastava."— Presentation transcript:

1

2 CML CSE 591: Advances in Reliable Computing Aviral Shrivastava

3 CML Web page: aviral.lab.asu.edu CML Saving Galileo  1978 – Galileo commissioned for Jupiter exploration  1980 – Design and Architecture decided  Use of AT 2901 for attitude control  1982 – Voyager reaches Jupiter  Intermittent Resets  Sulfur ions from Jupiter’s volcanic moon were being whipped up to high energy by the Jovian gravity.  After extensive testing of Galileo, chief engineer decided “not worth flying if soft error problem not solved”  Overheads  5 years, 5 million dollars  Sandia National Laboratories was subcontracted to custom-make radiation hardened AT 2901

4 CML Web page: aviral.lab.asu.edu CML Radiation Induced Soft Errors 3 = 1.64 x 10 -10 sec = 5.10x10 -11 sec Typically Induced current has a rapid rise time but a more gradual fall time

5 CML Web page: aviral.lab.asu.edu CML It started with nuclear tests… 4  1954-57: Nuclear Tests  Electronic anomalies in monitoring equipment  Could not be traced to any hardware fault  Equipment worked properly after restart  1962: Wallmark and Marcus (RCA Labs, Princeton)  Minimum size and Maximum Packing Density of Non- Redundant Semiconductor Devices, March 1962  Predicted that cosmic rays would start affecting microelectronics  1962: Telestar - First communication satellite  July 9, 1962: Starfish Prime  United States tested a high-altitude nuclear device (called Starfish Prime) which super-energized the Earth's Van Allen Belt where Telstar took orbit  100X increase in radiation  Rendered the satellite unoperational  worked after reboot

6 CML Web page: aviral.lab.asu.edu CML Radioactive Contamination 5  1978: Intel could not deliver chips to AT&T to upgrade switching system from mechanical relays to ICs  May and Woods traced problem to packaging  Packaging modules were contaminated with Uranium from and old uranium mine upstream.  Also proposed the Q_critical model of soft errors  Q_critical must be overcome by accumulated charge generated by particle strike to cause a fault.  1986-87: IBM faced problems of radioactive contamination  Traced problem to a distant chemical plant that used radioactive contaminant to clean bottles that were used to store an acid required in chip manufacturing process.

7 CML Web page: aviral.lab.asu.edu CML History of Radiation-induced SERs 6  1979: Zeigler and Lanford  presented solid evidence that, the electronic sensitivity to radiation- induced soft errors could become a nightmare for the future technologies.  Predicted that soft errors due to cosmic radiations would increase with altitude  1995: Baumann et. al.  Soft errors caused by Boron-10 isotopes activated by low-energy atmospheric neutrons.  1996: Normand  Documented strikes in large servers found in error logs  Discovered that memory error rates very significantly correlated to the altitude of the computers – attributed them to soft errors (Z&L)  High in servers in Los Alamos, and in fighter planes.  “Single Event Upset at Ground Level,” IEEE Transactions on Nuclear Science, Vol. 43, No. 6, December 1996.

8 CML Web page: aviral.lab.asu.edu CML Here comes the Sun…  11 year solar cycle of sun-spots  Major solar storms this year and next  10 9 kg/s of material lost by the Sun as ejected solar wind.  Protons (~70%), electrons, ionized helium, less than 0.5% minor ions.  2x10 10 protons/cm 2  Loss of satellites

9 CML Web page: aviral.lab.asu.edu CML Fault, Error and Failure 8 FAULT a physical defect that occurs within hw or sw components HW defect, SW bug Physical Universe physical entities making up a system activation ERROR a deviation from accuracy or correctness manifestation of a fault Informational Universe units of information (eg: data words) fault latency FAILURE nonperformance of some action that is due or expected malfunction External Universe the user of a system ultimately see the effects propagation error latency [Geffroyand, 02] Jean-Claude Geffroyand Gilles Motet, “Design of Dependable Computing Systems”, KluwerAcademic Publishers, 2002, ISBN 1-4020-0437-0

10 CML Web page: aviral.lab.asu.edu CML Electrical Masking 9 Pulse attenuated by electrical resistance in the circuit Pulse still strong enough to be latched at output

11 CML Web page: aviral.lab.asu.edu CML Single Event Latchup  SEL: Single Event Latchup  Parasitic circuit elements forming a silicon controlled rectifier (SCR)  Potentially destructive  the device current may destroy the device if not current limited and removed "in time.  Removal of power to the device is required in all non-catastrophic SEL conditions in order to recover device operations.  SEL probability increases with temperature!

12 CML Web page: aviral.lab.asu.edu CML Logical Masking 11 Value unchanged at the gate

13 CML Web page: aviral.lab.asu.edu CML Logical Masking 12 Error propagated to the output

14 CML Web page: aviral.lab.asu.edu CML Temporal Masking 13 Transient Fault Soft Error A transient pulse at the latching window: 1)Before t setup  masked (not latched) 2)After t setup, Before t hold  race condition 3)At the latching window  not masked (latched) [Firouzi ROCS 2010]

15 CML Web page: aviral.lab.asu.edu CML Soft Error Trends  DRAM  System error rate of DRAMs is fairly constant  SRAM  Increasing exponentially  Logic  Increasing exponentially

16 CML Web page: aviral.lab.asu.edu CML Increasing Soft Error Rates 15  Reducing features sizes and lower supply voltage  Decreasing capacitive nodes and noise margins  Q_critical reducing  Exponentially more low-energy particles than high-energy ones  More number of transistors per chip  More functionality is moving on-chip  Higher probability of error due to more faults.  Increasing clock rates  Larger fraction of time between setup and hold times for better error latching

17 CML Web page: aviral.lab.asu.edu CML One Failure per Day per Chip 16 Soft error rates could increase from one error per year to one error per day in a decade! [Shivakumar et al 2002]

18 CML Web page: aviral.lab.asu.edu CML Processing and Packaging Solutions  Reduce the number of particles that strike  Reduce upsets  Use of highly purified fabrication materials  Remove traces of boron and heavy metals  Surround by metallic frame  Reduce low-energy particles  But neutrons can pass through > 10 ft of concrete  Process Technology Solutions  Partially depleted SOI: no help after 250 nm  Fully depleted SOI: very expensive

19 CML Web page: aviral.lab.asu.edu Transistor Level Techniques □ Normally CMOS inverter is scaled with 2:1 ratio between p- and n-channel devices □ To compensate for electron and hole mobilities □ Changing this ratio can increase the tolerance

20 CML Web page: aviral.lab.asu.edu CML Gate-Level Techniques  Some gates are more vulnerable than others  Radiation hardened designs use NAND gates  When all inputs are low, drive of p-stack is low, high leakage of n-transistors  rise in the output slow  functional failure  Gates vulnerability may change by 5X depending on the state  NAND gate  Extremely vulnerable when inputs 10  Not vulnerable when inputs 00  How to synthesize to minimize vulnerability

21 CML Web page: aviral.lab.asu.edu CML Circuit-Level Techniques  Adding resistance introduces additional time constants that filter out the very fast SEU-induced transients  High temperature coefficients of poly-silicon resistors  Difficult to control variation of resistance

22 CML Web page: aviral.lab.asu.edu CML Copyright 2005, M. Tahoori 21 D-Cache: Flushing 4x reduction in vulnerability

23 CML Web page: aviral.lab.asu.edu Copyright 2005, M. Tahoori 22 D-Cache: Write Policy 10x reduction in vulnerability

24 CML Web page: aviral.lab.asu.edu CML Copyright 2005, M. Tahoori 23 D-Cache: Refresh 3x reduction in vulnerability using write-thru (30x total)

25 CML Web page: aviral.lab.asu.edu Replica Cache Replica Cache

26 CML Web page: aviral.lab.asu.edu CML Memory FNCFC Main Cache Mini Cache PPC (Partially Protected Caches)  2 Caches at the same level of memory hierarchy  Main Cache, and the protected mini-cache  Mini-cache  low power, low latency  Timing slack to harden it  Compiler maps data to the two caches  Map Failure-Critical data to the protected mini-cache  Map Not Failure-Critical data to unprotected main cache  Intuition is to provide protection to only the FC data  In multimedia applications, the multimedia data is NOT failure critical  An error  Loss in Quality of Service  How to use PPCs for general applications? Processor Pipeline Unprotected Main Cache Protected Mini Cache HPC Processor Memory Controller Page Mapping PPC FNC FC

27 CML Web page: aviral.lab.asu.edu CML Cache Scrubbing  Periodically read memory and correct all single bit errors  Disallows accumulation of temporal double bit errors  Standard technique in main memories (DRAMs)

28 CML Web page: aviral.lab.asu.edu CML Pipeline Protection: Razor  Originally proposed to tolerate process variations  Shadow latch clocked with a delayed clock  If difference in values latched, raise error  How to use it to detect soft errors?


Download ppt "CML CSE 591: Advances in Reliable Computing Aviral Shrivastava."

Similar presentations


Ads by Google