Download presentation
Presentation is loading. Please wait.
Published byRoderick Mathews Modified over 9 years ago
1
CML CSE 520: Advanced Computer Architecture: Reliability Aviral Shrivastava
2
CML Web page: aviral.lab.asu.edu CML Therac-25 1985-1987 The Therac-25 was a machine for administering radiation therapy, generally for treating cancer patients. ‘arithmetic overflow’ sometimes occurred during automatic safety checks. If, at this precise moment, the operator was configuring the machine, the safety checks would fail and the metal target would not be moved into place. The result was that beams 100 times higher than the intended dose would be fired into a patient, giving them radiation poisoning. This happened on 6 known occasions, causing the later death of 4 patients.
3
CML Web page: aviral.lab.asu.edu CML Patriot Missile Bug - February 25th, 1991 During Operation Desert Shield, the US military fired a patriot missile against an incoming missile, but hit a US base where it killed 28 soldiers and injured a further 98. The internal clock would ‘drift’ (much like any clock) further and further from accurate time. It was left running for 100 hours, by which point, the internal clock had drifted out by 0.34 of a second. So when it calculated the target over half a kilometer away from missile’s true location.
4
CML Web page: aviral.lab.asu.edu CML Skynet Brings Judgement Day (1997) Cost: 6 billion dead, near-total destruction of human civilization and animal ecosystems (fictional) Disaster: Human operators attempt to shut off the Skynet global computer network. Skynet responds by firing U.S. nuclear missiles at Russia, initiating global nuclear war on what became known as Judgement Day (August 29, 1997). Cause: Cyberdyne, the leading weapons manufacturer, installed Skynet technology in all military hardware including stealth bombers and missile defense systems. The Skynet technology formed a seamless network and effectively removed humans from strategic defense. Eventually Skynet became sentient, was threatened when the humans tried to take it offline, sought to survive, and retaliated with nuclear war.
5
CML Web page: aviral.lab.asu.edu CML Cold War Missile Crisis September 26, 1983 Soviet military officer Stanislav Petrov received an alert that the US had launched five Minuteman intercontinental ballistic missiles. Petrov found it strange that the US would attack with just a handful of warheads. Considering that the early warning system was known to have flaws and had been rushed into service, Petrov decided to rule the alert as a false alarm. It was later determined that the early detection software had picked up the sun’s reflection from the top of clouds and misinterpreted it as missile launches.
6
CML Web page: aviral.lab.asu.edu CML Michigan Dept. of Corrections Grants Prisoners Early Release In October 2005, The Register reported on the early release of 23 prisoners due to a computer programming glitch with the Michigan Department of Corrections.early release of 23 prisoners The accidental early release dates came around 39 to 161 days early while an undisclosed number of inmates were kept in jail past their release dates. State assembly representative Rick Jones was concerned about the matter, but noted that he was “glad it’s not murderers.”
7
CML Web page: aviral.lab.asu.edu CML North American Blackout August 14, 2003 Affecting around 55 million people, mainly in the North Eastern United States, but also Ontario Canada, this was one of the biggest power blackouts in history. While the causes of this blackout were nothing to do with a software bug, it could have been averted were it not for a software bug in the control centre alarm system. The centre alarm system had a ‘race condition’, which caused the alarm system to freeze and stop processing alerts. The alarm system failed ‘silently’, and didn’t notify anybody.
8
CML Web page: aviral.lab.asu.edu CML Blue screen of death
9
CML Web page: aviral.lab.asu.edu CML Source of Errors Specification errors Functionality in footnotes Programming errors Incorrect implementation (Michigan prison error) Algorithm error (Cold war missile crisis) Floating point errors (Patriot missile) Race conditions (Blackout) Manufacturing errors Process variations Silicon failures Runtime errors Negative Bias Temperature Instability (NBTI) Noise effects Voltage emergencies Environmental Soft errors Assuming systems are mechanically and physically protected!
10
CML Web page: aviral.lab.asu.edu CML Fault Tolerant Computing is not new! 1940s:ENIAC, with 17.5K vacuum tubes and 1000s of other electrical elements, failed once every 2 days 1950s: Early ideas by von Neumann (multichannel, with voting) and Moore-Shannon (“crummy” relays)
11
CML Web page: aviral.lab.asu.edu CML Need is changing: Automation Space age Age of Automation Proliferation of robots
12
CML Web page: aviral.lab.asu.edu CML Need is changing: Proximity Near body computing Google glass In-body computing Accurate drug delivery Robotic surgery
13
CML Web page: aviral.lab.asu.edu CML Need is changing: Technology Transistors are smaller Even low-energy particles can cause soft errors. Exponentially more low-energy particles
14
CML Web page: aviral.lab.asu.edu CMLWelcome To the course on designing reliable computing systems Focus of the course will be on “soft errors” Class webpage http://www.public.asu.edu/~ashriva6/teaching/ARC/
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.