WHY THEY FAILED AND LESSONS TO BE DRAWN Samuel Franklin G53QAT: Quality Assurance and Testing Famous Software Failures
Overview Three Software Failures Patriot Missile Russian Satellite Missile Detection London Ambulance Service Summary of Findings Questions
The Patriot Missile Failure Feb 1991 – Gulf War Failed to intercept Scud missile from Iraq 28 dead 100 injured Error from storing value in fixed point register The Patriot FailingThe Patriot in action
Why it went wrong HoursSecondsCalculated Time (sec)Inaccuracy (sec) Approx. shift in Range Gate (meters) (a) (b) The system had been running for 100 hours The calculations were out by 0.34 seconds Missed the Scud by over 600 meters WOULD MISS AFTER 20 HOURS
What American learnt from this USA knew of the fault from Israeli Military American’s did not reboot regularly enough Software update arrived day after the death of the soldiers
Russian Satellite Missile Detection System Put in place to detect threats from America during cold war Stanislav Petrov monitored system on 26 th September 1983 Oko alerted Petrov that 5 missiles were heading towards Russia. Petrov had to choose: Declare it a false alarm Start a counterstrike and probably a Nuclear war
Stanislav Petrov The Man Who Saved the World
What Russia learnt from this The Russians dissected the Oko System Found the software full of bugs Launched the SPRN-2 Prognoz to supplement the Oko system Cost of this failure could have been: World War III
London Ambulance Fiasco London Ambulance Service (LAS) introuduced a Computer Aided Dispatch System (CAD) on 26 th October 1992 LAS: Carry over 5000 patients per day Receive approx 2500 calls per day 65% of calls are emergency New system needed to have near 100% accuracy and full cooperation from all LAS to succeed
26 th October 1992 The new CAD system could not handle the volume of call – regular use Response time became several hours Communications between ambulance and LAS lost System had: Poor interface between crews and the system Number of technical problems: Failed to identify duplicate calls Did not prioritise exception messages
What London learnt from this Do not use direct conversion Implement in step-by-step fashion Full consultation Quality assurance and testing User training
Conclusion Testing is essential All critical systems Rush to get system in place is bad Training Value of humans in the process
Any questions? Questions and Discussion