CS444A: Software for Critical Systems. 2 Staff Prof. David L. Dill Prof. Armando Fox.

Slides:



Advertisements
Similar presentations
IT Roles and Responsibilities: How Good is Good Enough? IS 485, Professor Matt Thatcher.
Advertisements

“An Investigation of the Therac-25 Accidents” by Nancy G. Leveson and Clark S. Turner Catherine Schell CSC 508 October 13, 2004.
The Therac-25: A Software Fatal Failure
A Gift of Fire, 2edChapter 4: Can We Trust the Computer?1 PowerPoint ® Slides to Accompany A Gift of Fire : Social, Legal, and Ethical Issues for Computers.
MIS 2000 Class 20 System Development Process Updated 2014.
© Chinese University, CSE Dept. Software Engineering / Software Engineering Topic 1: Software Engineering: A Preview Your Name: ____________________.
Social Implications of a Computerized Society Computer Errors Instructor: Oliver Schulte Simon Fraser University.
Background Increasing use of automated systems Hardware and software technology are improving rapidly User interface technology is lagging Critical bottleneck.
An Investigation of the Therac-25 Accidents Nancy G. Leveson Clark S. Turner IEEE, 1993 Presented by Jack Kustanowitz April 26, 2005 University of Maryland.
Therac-25 Lawsuit for Victims Against the AECL
+ THE THERAC-25 - A SOFTWARE FATAL FAILURE Kpea, Aagbara Saturday SYSM 6309 Spring ’12 UT-Dallas.
Software Engineering Disasters
Slides prepared by Cyndi Chie and Sarah Frye. Fourth edition revisions by Sharon Gray. A Gift of Fire Fourth edition Sara Baase Chapter 8: Errors, Failures,
Motivation Why study Software Engineering ?. What is Engineering ? 2 Engineering (Webster) – The application of scientific and mathematical principles.
A Gift of Fire Third edition Sara Baase
A Gift of Fire Third edition Sara Baase
CS CS 5150 Software Engineering Lecture 21 Reliability 3.
Jacky: “Safety-Critical Computing …” ► Therac-25 illustrated that comp controlled equipment could be less safe. ► Why use computers at all, if satisfactory.
Formal verification Marco A. Peña Universitat Politècnica de Catalunya.
Unit 3a Industrial Control Systems
Lecture 7, part 2: Software Reliability
Dr Andy Brooks1 Lecture 4 Therac-25, computer controlled radiation therapy machine, that killed people. FOR0383 Software Quality Assurance.
DJ Wattam, Han Junyi, C Mongin1 COMP60611 Directed Reading 1: Therac-25 Background – Therac-25 was a new design dual mode machine developed from previous.
BS3909 Week 8 1 Self-Study: Safety-critical systems l Wide range of equipment now computer-controlled »Machine could injure operator if certain faults.
CS527: (Advanced) Topics in Software Engineering Overview of Software Quality Assurance Tao Xie ©D. Marinov, T. Xie.
Section 11.1 Identify customer requirements Recommend appropriate network topologies Gather data about existing equipment and software Section 11.2 Demonstrate.
Chapter 5CSA 217 Design in Construction Chapter 5 1.
Therac-25 : Summary Malfunction Complacency Race condition (turntable / energy mismatch) Data overflow (turntable not positioned) time‘85‘86‘88 ‘87 Micro-switch.
Software Safety Case Study Medical Devices : Therac 25 and beyond Matthew Dwyer.
Why is software engineering worth studying?  Demand for software is growing dramatically  Software costs are growing per system  Many projects have.
Therac-25 Final Presentation
Therac 25 Nancy Leveson: Medical Devices: The Therac-25 (updated version of IEEE Computer article)
ITGS Software Reliability. ITGS All IT systems are a combination of: –Hardware –Software –People –Data Problems with any of these parts, or a combination.
Course: Software Engineering © Alessandra RussoUnit 1 - Introduction, slide Number 1 Unit 1: Introduction Course: C525 Software Engineering Lecturer: Alessandra.
Chapter 8: Errors, Failures, and Risk
Team Skill 6: Building the Right System From Use Cases to Implementation (25)
1 Can We Trust the Computer? What Can Go Wrong? Case Study: The Therac-25 Increasing Reliability and Safety Perspectives on Failures, Dependence, Risk,
2.2 Software Myths 2.2 Software Myths Myth 1. The cost of computers is lower than that of analog or electromechanical devices. –Hardware is cheap compared.
CS 430/530 Formal Semantics Paul Hudak Yale University Department of Computer Science Lecture 1 Course Overview September 6, 2007.
Security and Reliability THERAC CASE STUDY TEXTBOOK: BRINKMAN’S ETHICS IN A COMPUTING CULTURE READING: CHAPTER 5, PAGES
Copyright John C. Knight SOFTWARE ENGINEERING FOR DEPENDABLE SYSTEMS John C. Knight Department of Computer Science University of Virginia.
© 2008 Wayne Wolf Overheads for Computers as Components 2nd ed. System design techniques Quality assurance. 1.
© 2012 xtUML.org Bill Chown – Mentor Graphics Model Driven Engineering.
M. Frize, Winter 2003 Reliability and Medical Devices Prof. Monique Frize, P. Eng., O.C. ELG5123/ February 2003.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
1 CS 501 Spring 2002 CS 501: Software Engineering Lecture 24 Delivering the System.
Safety-Critical Systems 7 Summary T V - Lifecycle model System Acceptance System Integration & Test Module Integration & Test Requirements Analysis.
Therac-25 CS4001 Kristin Marsicano. Therac-25 Overview  What was the Therac-25?  How did it relate to previous models? In what ways was it similar/different?
Software Engineering. Acknowledgement Charles Moen Sharon White Bun Yue.
CS, AUHenrik Bærbak Christensen1 Critical Systems Sommerville 7th Ed Chapter 3.
Dr. Rob Hasker. Classic Quality Assurance  Ensure follow process Solid, reviewed requirements Reviewed design Reviewed, passing tests  Why doesn’t “we.
©2001 Southern Illinois University, Edwardsville All rights reserved. Today Finish Ethics Next Week Research Topics in HCI CS 321 Human-Computer Interaction.
Dr. Rob Hasker. Classic Quality Assurance  Ensure follow process Solid, reviewed requirements Reviewed design Reviewed, passing tests  Why doesn’t “we.
Directed Reading 1 Girish Ramesh – Andres Martin-Lopez – Bamdad Dashtban –
Slide #18-1 Introduction to Assurance CS461/ECE422 Fall 2008 Based on slides provided by Matt Bishop for use with Computer Security: Art and Science.
MIS 2000 Class 20 System Development Process Updated 2016.
Increasing use of automated systems
ATTRACT TWD Symposium, Barcelona, Spain, 1st July 2016
EE 585 : FAULT TOLERANT COMPUTING SYSTEMS B.RAM MOHAN
Maintaining software solutions
Therac-25 Accidents What was Therac-25? Who developed it?
A Gift of Fire Third edition Sara Baase
PowerPoint® Slides to Accompany
Reliability and Safety
Week 13: Errors, Failures, and Risks
Reliability and Safety
Computer in Safety-Critical Systems
A Gift of Fire Third edition Sara Baase
Software Engineering Disasters
Presentation transcript:

CS444A: Software for Critical Systems

2 Staff Prof. David L. Dill Prof. Armando Fox

3 Topic The engineering of software for applications where failure is unacceptable... for some value of “failure” and “unacceptable”. Costs of failure exceed value of the software

4 Critical software is growing in importance Computers are getting exponentially smaller, cheaper, faster, and better connected. Communications are improving at least as fast. Increased use of critical software is irresistable  Automation of tasks that were previous manual or infeasible.  Sophisticated control replacing simple control.  Replacing mechanical, analog, digital hardware.

5 Software is growing Software will replace mechanical, analog, and digital hardware  Cheaper to copy.  Easier to manufacture.  Easier to upgrade.  Provides more functionality. Software will replace manual processes  Cheaper and more reliable than human workers  Relieves them of tedious tasks  Faster and more predictable

6 Complexity is increasing COTS is coming to software  Large projects increasingly use commercial off-the-shelf components  Commodity hardware, OS’s, tools, other building blocks  Example: Mars Pathfinder This is good and bad  COTS reduces development cost & development time  Sophisticated “building blocks” allow creation of more complex systems  But they are often brittle: intra-component and inter- component failure modes are poorly understood  Composition of pieces that were designed separately sometimes leads to unexpected failure modes

7 Software will be used in safety-critical applications All of the above reasons (esp. cost) Software can make systems safer  TCAS - Aircraft collision avoidance system Software can enhance system performance  Fly-by-wire  antilock braking Software can perform life-saving functions  Computer-controlled pacemakers

8 Software will be used in safety-critical applications All of the above reasons (esp. cost) Software can make systems safer  TCAS - Aircraft collision avoidance system Software can enhance system performance  Fly-by-wire  antilock braking Software can perform life-saving functions  Computer-controlled pacemakers

9 Subtopics Successful engineering of software encompasses many different issues Relationship of software to the larger system Software development processes Software design Algorithms Programming practices

10 Goal: Best Of Both Worlds Traditional safety-engineering perspective  Formal verification, requirements specification, related formal methods  Traditional hazard/fault analysis  Fault tolerance Systems perspective  Design techniques and programming practices  As much “folklore” as formal  Especially recent experience in Internet-scale mission- critical systems

11 Formal Methodology Outline Safety engineering of systems  Hazard identification  Hazard avoidance  Standards Requirements specification and tools  Specification for reactive systems  Model checking  Logical specification (Z, VDM?)  Theorem proving Fault tolerance  F ault models  Fault tolerant protocols Etc.

12 The Case for the Systems Perspective Many visible success stories  The Internet  Mars Pathfinder  Gargantuan-scale 24x7 mission critical systems: Wal-Mart financial exchanges, Visa, CIRRUS banking network… Some spectacular failures  Therac-25 (today) System design combines engineering judgment and “folklore” with formal methodology

13 The Role of the Internet The distributed system from hell  Evolved over >25 years, lots of legacy code layers  Widely distributed, both geographically and administratively  Transient failure (hardware & software) is a way of life  Yet, it mostly works...What great ideas can we steal? The Internet is a good testbed for new approaches to reliability  “Internet scale” implies large size, exponential growth, and 24x7 operational requirements  People don’t die (usually) when systems go down  Strong financial incentive spurs industrial deployment :-)

14 Systems Track Outline Conceptual vocabulary, research landscape Fault isolation, fault containment, orthogonal guard mechanisms Transactions, replication, consistency State maintenance Availability vs. consistency tradeoffs, harvest and yield Application-level vs. OS-level mechanisms Systems case studies

15 Goals Identify recurrent design philosophies that work well Taxonomize the “folklore” in software systems design Identify fertile crossover areas to the “formal world”

16 Example: Software failures in the Therac-25

17 Motivation The "Therac-25" is a classic case study in engineering failure -- like Tacoma Narrows bridge, Challenger disaster, etc. Illustrates many problems and issues of software safety. Shows how not to do it. Related to assignment.

18 The Machine The Therac-25 is a linear accelerator used for radiation therapy (e.g. cancer treatment ). Safety issues:  overdose: Patient is injured or dies from radiation burns.  underdose: Serious disease is not treated properly, patient may be injured or die because of this. Therac-25 much more dependent on software for safety than its predecessors (Therac-20, Therac-6)  "Hardware interlocks" replaced by software.

19 Technical details Multi-mode machine: protons, electrons, X-rays. X-rays generated when electron beam collides with target. - This is inefficient, so electron beam must be very powerful. Different modes require turntable to be properly positioned with targets, spreaders, etc. between beam and patient.

20 Accidents Machine reliably treated thousands of patients, but occasionally weird things would happen. There were at least 6 accidents. Kennestone 1985:  Patient treated for breast cancer is unexpectedly burned.  Est. 15K-20K rad dose (500 rad to whole body 50% fatal).  Patient lost breast, shoulder and arm paralyzed.  Patient sued, settled out of court.  FDA not informed until much later.

21 Another accident Tyler 1986:  Patient to be treated with electron beam.  Operator said to treat with X-ray, then corrected.  Patient felt "electric shock”.  Operator saw "malfunction 54" and under-dose reading, so said "proceed" to zap patient again.  Patient overdosed a second time (in arm) as he was trying to escape.  Patient died horribly of radiation overdose 5 months later.

22 Software issues No locks on shared variables (race conditions). Control flow bug: some newly entered data can be ignored. Timing sensitivity in user interface. Wrap-around on counters.

23 User interface issues “Malfunction 54” (patient might have received overdose or under-dose). No indication about patient safety with error messages. “Proceed” button continues after error message - one patient overdosed twice.

24 System issues Inadequate mechanical checks on turntable - 3 microswitches for position sensing. - 1-bit error in encoding makes position inaccurate. - potentiometer installed later to sense position. No independent hardware to suppress beam. Dosage measurement devices (ion chambers) report inaccurate results for very high doses. Therac-20 had same bugs, but no accidents because of independent protective systems.

25 Management issues Software complacency - software errors not modelled in fault trees. - users told “no possibility of overdose”. Absurdly low probabilities assigned to SW failure. Guesswork in analyzing observed failures - blamed microswitches on turntable. - no actual failures found in microswitches. - problem was probably software. Inadequate software processes - unclear safety analyses. - no audit trails. - inadequate testing.

26 Regulatory and legal issues FDA, Canadian regulators not heavily involved - no software regulation in med. devices (at that time). - not notified of incidents (no requirement to do so). - inadequate investigation of early incidents. When FDA got involved, the machine got fixed. (speculation) Out of court settlements impeded. dissemination of information about hazards.

27 A more Armando-like example?