Download presentation
Presentation is loading. Please wait.
Published byTiffany Miles Modified over 9 years ago
2
CML CSE 591: Advances in Reliable Computing Aviral Shrivastava
3
CML Web page: aviral.lab.asu.edu CML Saving Galileo 1978 – Galileo commissioned for Jupiter exploration 1980 – Design and Architecture decided Use of AT 2901 for attitude control 1982 – Voyager reaches Jupiter Intermittent Resets Sulfur ions from Jupiter’s volcanic moon were being whipped up to high energy by the Jovian gravity. After extensive testing of Galileo, chief engineer decided “not worth flying if soft error problem not solved” Overheads 5 years, 5 million dollars Sandia National Laboratories was subcontracted to custom-make radiation hardened AT 2901
4
CML Web page: aviral.lab.asu.edu CML Radiation Induced Soft Errors 3 = 1.64 x 10 -10 sec = 5.10x10 -11 sec Typically Induced current has a rapid rise time but a more gradual fall time
5
CML Web page: aviral.lab.asu.edu CML It started with nuclear tests… 4 1954-57: Nuclear Tests Electronic anomalies in monitoring equipment Could not be traced to any hardware fault Equipment worked properly after restart 1962: Wallmark and Marcus (RCA Labs, Princeton) Minimum size and Maximum Packing Density of Non- Redundant Semiconductor Devices, March 1962 Predicted that cosmic rays would start affecting microelectronics 1962: Telestar - First communication satellite July 9, 1962: Starfish Prime United States tested a high-altitude nuclear device (called Starfish Prime) which super-energized the Earth's Van Allen Belt where Telstar took orbit 100X increase in radiation Rendered the satellite unoperational worked after reboot
6
CML Web page: aviral.lab.asu.edu CML Radioactive Contamination 5 1978: Intel could not deliver chips to AT&T to upgrade switching system from mechanical relays to ICs May and Woods traced problem to packaging Packaging modules were contaminated with Uranium from and old uranium mine upstream. Also proposed the Q_critical model of soft errors Q_critical must be overcome by accumulated charge generated by particle strike to cause a fault. 1986-87: IBM faced problems of radioactive contamination Traced problem to a distant chemical plant that used radioactive contaminant to clean bottles that were used to store an acid required in chip manufacturing process.
7
CML Web page: aviral.lab.asu.edu CML History of Radiation-induced SERs 6 1979: Zeigler and Lanford presented solid evidence that, the electronic sensitivity to radiation- induced soft errors could become a nightmare for the future technologies. Predicted that soft errors due to cosmic radiations would increase with altitude 1995: Baumann et. al. Soft errors caused by Boron-10 isotopes activated by low-energy atmospheric neutrons. 1996: Normand Documented strikes in large servers found in error logs Discovered that memory error rates very significantly correlated to the altitude of the computers – attributed them to soft errors (Z&L) High in servers in Los Alamos, and in fighter planes. “Single Event Upset at Ground Level,” IEEE Transactions on Nuclear Science, Vol. 43, No. 6, December 1996.
8
CML Web page: aviral.lab.asu.edu CML Here comes the Sun… 11 year solar cycle of sun-spots Major solar storms this year and next 10 9 kg/s of material lost by the Sun as ejected solar wind. Protons (~70%), electrons, ionized helium, less than 0.5% minor ions. 2x10 10 protons/cm 2 Loss of satellites
9
CML Web page: aviral.lab.asu.edu CML Fault, Error and Failure 8 FAULT a physical defect that occurs within hw or sw components HW defect, SW bug Physical Universe physical entities making up a system activation ERROR a deviation from accuracy or correctness manifestation of a fault Informational Universe units of information (eg: data words) fault latency FAILURE nonperformance of some action that is due or expected malfunction External Universe the user of a system ultimately see the effects propagation error latency [Geffroyand, 02] Jean-Claude Geffroyand Gilles Motet, “Design of Dependable Computing Systems”, KluwerAcademic Publishers, 2002, ISBN 1-4020-0437-0
10
CML Web page: aviral.lab.asu.edu CML Electrical Masking 9 Pulse attenuated by electrical resistance in the circuit Pulse still strong enough to be latched at output
11
CML Web page: aviral.lab.asu.edu CML Single Event Latchup SEL: Single Event Latchup Parasitic circuit elements forming a silicon controlled rectifier (SCR) Potentially destructive the device current may destroy the device if not current limited and removed "in time. Removal of power to the device is required in all non-catastrophic SEL conditions in order to recover device operations. SEL probability increases with temperature!
12
CML Web page: aviral.lab.asu.edu CML Logical Masking 11 Value unchanged at the gate
13
CML Web page: aviral.lab.asu.edu CML Logical Masking 12 Error propagated to the output
14
CML Web page: aviral.lab.asu.edu CML Temporal Masking 13 Transient Fault Soft Error A transient pulse at the latching window: 1)Before t setup masked (not latched) 2)After t setup, Before t hold race condition 3)At the latching window not masked (latched) [Firouzi ROCS 2010]
15
CML Web page: aviral.lab.asu.edu CML Soft Error Trends DRAM System error rate of DRAMs is fairly constant SRAM Increasing exponentially Logic Increasing exponentially
16
CML Web page: aviral.lab.asu.edu CML Increasing Soft Error Rates 15 Reducing features sizes and lower supply voltage Decreasing capacitive nodes and noise margins Q_critical reducing Exponentially more low-energy particles than high-energy ones More number of transistors per chip More functionality is moving on-chip Higher probability of error due to more faults. Increasing clock rates Larger fraction of time between setup and hold times for better error latching
17
CML Web page: aviral.lab.asu.edu CML One Failure per Day per Chip 16 Soft error rates could increase from one error per year to one error per day in a decade! [Shivakumar et al 2002]
18
CML Web page: aviral.lab.asu.edu CML Processing and Packaging Solutions Reduce the number of particles that strike Reduce upsets Use of highly purified fabrication materials Remove traces of boron and heavy metals Surround by metallic frame Reduce low-energy particles But neutrons can pass through > 10 ft of concrete Process Technology Solutions Partially depleted SOI: no help after 250 nm Fully depleted SOI: very expensive
19
CML Web page: aviral.lab.asu.edu Transistor Level Techniques □ Normally CMOS inverter is scaled with 2:1 ratio between p- and n-channel devices □ To compensate for electron and hole mobilities □ Changing this ratio can increase the tolerance
20
CML Web page: aviral.lab.asu.edu CML Gate-Level Techniques Some gates are more vulnerable than others Radiation hardened designs use NAND gates When all inputs are low, drive of p-stack is low, high leakage of n-transistors rise in the output slow functional failure Gates vulnerability may change by 5X depending on the state NAND gate Extremely vulnerable when inputs 10 Not vulnerable when inputs 00 How to synthesize to minimize vulnerability
21
CML Web page: aviral.lab.asu.edu CML Circuit-Level Techniques Adding resistance introduces additional time constants that filter out the very fast SEU-induced transients High temperature coefficients of poly-silicon resistors Difficult to control variation of resistance
22
CML Web page: aviral.lab.asu.edu CML Copyright 2005, M. Tahoori 21 D-Cache: Flushing 4x reduction in vulnerability
23
CML Web page: aviral.lab.asu.edu Copyright 2005, M. Tahoori 22 D-Cache: Write Policy 10x reduction in vulnerability
24
CML Web page: aviral.lab.asu.edu CML Copyright 2005, M. Tahoori 23 D-Cache: Refresh 3x reduction in vulnerability using write-thru (30x total)
25
CML Web page: aviral.lab.asu.edu Replica Cache Replica Cache
26
CML Web page: aviral.lab.asu.edu CML Memory FNCFC Main Cache Mini Cache PPC (Partially Protected Caches) 2 Caches at the same level of memory hierarchy Main Cache, and the protected mini-cache Mini-cache low power, low latency Timing slack to harden it Compiler maps data to the two caches Map Failure-Critical data to the protected mini-cache Map Not Failure-Critical data to unprotected main cache Intuition is to provide protection to only the FC data In multimedia applications, the multimedia data is NOT failure critical An error Loss in Quality of Service How to use PPCs for general applications? Processor Pipeline Unprotected Main Cache Protected Mini Cache HPC Processor Memory Controller Page Mapping PPC FNC FC
27
CML Web page: aviral.lab.asu.edu CML Cache Scrubbing Periodically read memory and correct all single bit errors Disallows accumulation of temporal double bit errors Standard technique in main memories (DRAMs)
28
CML Web page: aviral.lab.asu.edu CML Pipeline Protection: Razor Originally proposed to tolerate process variations Shadow latch clocked with a delayed clock If difference in values latched, raise error How to use it to detect soft errors?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.