Spring 2008 CSE 591 Compilers for Embedded Systems Aviral Shrivastava Department of Computer Science and Engineering Arizona State University
Lecture 2: Soft Errors
Beginings.. □ nuclear tests □Electronic monitoring equipment failure □could not identify the reason!! □Worked fine after rebooting □no hardware fault, no permanent fault □ Wallmark and Marcus □Surmised that cosmic rays can cause failures in electronic systems □Minimum Size and Maximum Packing Density of Non-Redundant Semiconductor Devices □1978 – May and Woods of Intel □Reported alpha particle induced soft errors in the 2107-series 16- KB DRAMs. □1979 – Ziegler and Lanford of IBM □presented solid evidence that, the electronic sensitivity to radiation-induced soft errors could become a nightmare for the future technologies.
First Space Casualty □Telestar □First communication satellite □ATT Bell Telephone, NASA, British GPO, and French PTT □Launched July 10, 1962 □July 23 - live transatlantic television signal □Supposed to telecast speech from President John. F. Kennedy □Instead telecasted major league baseball □Telstar ushered in a new age of the benevolent use of technology □July 9, 1962 □United States tested a high-altitude nuclear device (called Starfish Prime) which super-energized the Earth's Van Allen Belt where Telstar took orbitStarfish Prime Van Allen Belt □100X increase in radiation □Out of service in December, repaired, but unusable after February
Saving Galileo □1978 – Galileo commissioned for Jupiter exploration □1980 – Design and Architecture decided □Use of AT 2901 for attitude control □1982 – Voyager reaches Jupiter □Intermittent Resets □Sulfur ions from Jupiter’s volcanic moon, Io, were being whipped up to high energy by the Jovian gravity. □After extensive testing of Galileo, chief engineer decided “not worth flying if soft error problem not solved” □Overheads □5 years, 5 million dollars □Sandia National Laboratories was subcontracted to custom-make radiation hardened 2901
Recent – Hubble Space Telescope □Intermittent resets after 1996 upgrade of software on Hubble Space Telescope □South Atlantic Anomaly
Sun-earth Interactions □11 year solar cycle of sun-spots □10 9 kg/s of material lost by the Sun as ejected solar wind. □Protons (~70%), electrons, ionized helium, less than 0.5% minor ions. □2x10 10 protons/cm 2 □Loose of satellites
Copyright 2005, M. Tahoori 8 Impact on Earth-bound Electronics □Documented strikes in large servers found in error logs □Normand, “Single Event Upset at Ground Level,” IEEE Transactions on Nuclear Science, Vol. 43, No. 6, December □Sun Microsystems, 2000 (R. Baumann, 2002 IRPS Workshop talk) □Cosmic ray strikes on L2 cache with defective error protection □caused Sun’s flagship servers to suddenly and mysteriously crash! □Companies affected □Baby Bell (Atlanta), America Online, Ebay, & dozens of other corporations □Verisign moved to IBM Unix servers (for the most part) □Cisco line cards may reset after single event upset (SEU) failures
Copyright 2005, M. Tahoori 9 Reactions from Companies □Fujitsu SPARC in 130 nm technology □80% of 200k latches protected with parity □compare with very few latches protected in Mckinley □ISSCC, 2003 □IBM declared 1000 years system MTBF as product goal □for Power4 line □very hard to achieve this goal in a cost-effective way □Bossen, 2002 IRPS Workshop Talk
Evolution of a Product’s Team’s Psyche □Shock □“SER is the crabgrass in the lawn of computer design” □Denial □“We will do the SER work two months before tapeout” □Anger □“Our reliability target is too ambitious” □Acceptance □“You can deny physics only for so long”
Growing Problem □Is going to become a everyday problem (omnipresent) for every devices (ubiquitous) □Soft Errors in Embedded Systems □Not only a space phenomenon anymore!
Phenomenon of Soft Error □Transient Faults □Random and spontaneous bit-changes in system □Can be caused by □Circuit noise □Cross-talk □More than 50% due to radiation strike
Causes of Soft Errors □Alpha particles emitted by traces of uranium, thorium, or lead impurities in packaging materials □Alpha particles emitted by decaying radioactive impurities in packaging and interconnect materials. (plastic packages is the worst. Ceramic,HyperBGA, Flip-chip PBGA) □High-energy ( > 1 MeV) neutrons from cosmic radiation can induce soft errors in semiconductor devices via secondary ions produced by the neutron reaction with silicon nuclei □Less than 1% of the primary flux reaches ground level □Secondary radiation induced from the interaction of low-energy neutrons and boron □Boron-10 in BPSG (Borophosphosilicate glass) □New process technologies use highly refined packaging and no boron □2 nd effect is the most important □Shielding is effective only for low-energy neutrons □High energy neutrons can pass through 6 feet of concrete
LET Spectrum □Linear energy transfer □Measure of energy deposition □MeV per mg/cm 2, MeV/μ or pC/μ
Metrics □FIT: Failure in Time □No. of failures in 1 billion hours of operation □MTTF: Mean Time To Failure □1000 FITs => MTTF of 114 years □1 GByte of 500 FIT/Mbit can expect an error every two weeks □ECC reduces failure rate by 2 orders of magnitude □hypothetical Terabyte system would experience a soft error every few minutes
Trends □DRAM □System error rate of DRAMs is fairly constant □SRAM □Increasing exponentially □Logic □Increasing exponentially
Masking Effects □Logic Masking □Electrical Masking □Latching Window Masking □Microarchitectural Masking □Software Masking