Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS444A: Software for Critical Systems. 2 Staff Prof. David L. Dill Prof. Armando Fox.

Similar presentations


Presentation on theme: "CS444A: Software for Critical Systems. 2 Staff Prof. David L. Dill Prof. Armando Fox."— Presentation transcript:

1 CS444A: Software for Critical Systems

2 2 Staff Prof. David L. Dill Prof. Armando Fox

3 3 Topic The engineering of software for applications where failure is unacceptable... for some value of “failure” and “unacceptable”. Costs of failure exceed value of the software

4 4 Critical software is growing in importance Computers are getting exponentially smaller, cheaper, faster, and better connected. Communications are improving at least as fast. Increased use of critical software is irresistable  Automation of tasks that were previous manual or infeasible.  Sophisticated control replacing simple control.  Replacing mechanical, analog, digital hardware.

5 5 Software is growing Software will replace mechanical, analog, and digital hardware  Cheaper to copy.  Easier to manufacture.  Easier to upgrade.  Provides more functionality. Software will replace manual processes  Cheaper and more reliable than human workers  Relieves them of tedious tasks  Faster and more predictable

6 6 Complexity is increasing COTS is coming to software  Large projects increasingly use commercial off-the-shelf components  Commodity hardware, OS’s, tools, other building blocks  Example: Mars Pathfinder This is good and bad  COTS reduces development cost & development time  Sophisticated “building blocks” allow creation of more complex systems  But they are often brittle: intra-component and inter- component failure modes are poorly understood  Composition of pieces that were designed separately sometimes leads to unexpected failure modes

7 7 Software will be used in safety-critical applications All of the above reasons (esp. cost) Software can make systems safer  TCAS - Aircraft collision avoidance system Software can enhance system performance  Fly-by-wire  antilock braking Software can perform life-saving functions  Computer-controlled pacemakers

8 8 Software will be used in safety-critical applications All of the above reasons (esp. cost) Software can make systems safer  TCAS - Aircraft collision avoidance system Software can enhance system performance  Fly-by-wire  antilock braking Software can perform life-saving functions  Computer-controlled pacemakers

9 9 Subtopics Successful engineering of software encompasses many different issues Relationship of software to the larger system Software development processes Software design Algorithms Programming practices

10 10 Goal: Best Of Both Worlds Traditional safety-engineering perspective  Formal verification, requirements specification, related formal methods  Traditional hazard/fault analysis  Fault tolerance Systems perspective  Design techniques and programming practices  As much “folklore” as formal  Especially recent experience in Internet-scale mission- critical systems

11 11 Formal Methodology Outline Safety engineering of systems  Hazard identification  Hazard avoidance  Standards Requirements specification and tools  Specification for reactive systems  Model checking  Logical specification (Z, VDM?)  Theorem proving Fault tolerance  F ault models  Fault tolerant protocols Etc.

12 12 The Case for the Systems Perspective Many visible success stories  The Internet  Mars Pathfinder  Gargantuan-scale 24x7 mission critical systems: Wal-Mart financial exchanges, Visa, CIRRUS banking network… Some spectacular failures  Therac-25 (today) System design combines engineering judgment and “folklore” with formal methodology

13 13 The Role of the Internet The distributed system from hell  Evolved over >25 years, lots of legacy code layers  Widely distributed, both geographically and administratively  Transient failure (hardware & software) is a way of life  Yet, it mostly works...What great ideas can we steal? The Internet is a good testbed for new approaches to reliability  “Internet scale” implies large size, exponential growth, and 24x7 operational requirements  People don’t die (usually) when systems go down  Strong financial incentive spurs industrial deployment :-)

14 14 Systems Track Outline Conceptual vocabulary, research landscape Fault isolation, fault containment, orthogonal guard mechanisms Transactions, replication, consistency State maintenance Availability vs. consistency tradeoffs, harvest and yield Application-level vs. OS-level mechanisms Systems case studies

15 15 Goals Identify recurrent design philosophies that work well Taxonomize the “folklore” in software systems design Identify fertile crossover areas to the “formal world”

16 16 Example: Software failures in the Therac-25

17 17 Motivation The "Therac-25" is a classic case study in engineering failure -- like Tacoma Narrows bridge, Challenger disaster, etc. Illustrates many problems and issues of software safety. Shows how not to do it. Related to assignment.

18 18 The Machine The Therac-25 is a linear accelerator used for radiation therapy (e.g. cancer treatment ). Safety issues:  overdose: Patient is injured or dies from radiation burns.  underdose: Serious disease is not treated properly, patient may be injured or die because of this. Therac-25 much more dependent on software for safety than its predecessors (Therac-20, Therac-6)  "Hardware interlocks" replaced by software.

19 19 Technical details Multi-mode machine: protons, electrons, X-rays. X-rays generated when electron beam collides with target. - This is inefficient, so electron beam must be very powerful. Different modes require turntable to be properly positioned with targets, spreaders, etc. between beam and patient.

20 20 Accidents Machine reliably treated thousands of patients, but occasionally weird things would happen. There were at least 6 accidents. Kennestone 1985:  Patient treated for breast cancer is unexpectedly burned.  Est. 15K-20K rad dose (500 rad to whole body 50% fatal).  Patient lost breast, shoulder and arm paralyzed.  Patient sued, settled out of court.  FDA not informed until much later.

21 21 Another accident Tyler 1986:  Patient to be treated with electron beam.  Operator said to treat with X-ray, then corrected.  Patient felt "electric shock”.  Operator saw "malfunction 54" and under-dose reading, so said "proceed" to zap patient again.  Patient overdosed a second time (in arm) as he was trying to escape.  Patient died horribly of radiation overdose 5 months later.

22 22 Software issues No locks on shared variables (race conditions). Control flow bug: some newly entered data can be ignored. Timing sensitivity in user interface. Wrap-around on counters.

23 23 User interface issues “Malfunction 54” (patient might have received overdose or under-dose). No indication about patient safety with error messages. “Proceed” button continues after error message - one patient overdosed twice.

24 24 System issues Inadequate mechanical checks on turntable - 3 microswitches for position sensing. - 1-bit error in encoding makes position inaccurate. - potentiometer installed later to sense position. No independent hardware to suppress beam. Dosage measurement devices (ion chambers) report inaccurate results for very high doses. Therac-20 had same bugs, but no accidents because of independent protective systems.

25 25 Management issues Software complacency - software errors not modelled in fault trees. - users told “no possibility of overdose”. Absurdly low probabilities assigned to SW failure. Guesswork in analyzing observed failures - blamed microswitches on turntable. - no actual failures found in microswitches. - problem was probably software. Inadequate software processes - unclear safety analyses. - no audit trails. - inadequate testing.

26 26 Regulatory and legal issues FDA, Canadian regulators not heavily involved - no software regulation in med. devices (at that time). - not notified of incidents (no requirement to do so). - inadequate investigation of early incidents. When FDA got involved, the machine got fixed. (speculation) Out of court settlements impeded. dissemination of information about hazards.

27 27 A more Armando-like example?


Download ppt "CS444A: Software for Critical Systems. 2 Staff Prof. David L. Dill Prof. Armando Fox."

Similar presentations


Ads by Google