Reliability and Safety

Slides:



Advertisements
Similar presentations
The Therac-25: A Software Fatal Failure
Advertisements

A Gift of Fire, 2edChapter 4: Can We Trust the Computer?1 PowerPoint ® Slides to Accompany A Gift of Fire : Social, Legal, and Ethical Issues for Computers.
Social Implications of a Computerized Society Computer Errors Instructor: Oliver Schulte Simon Fraser University.
An Investigation of the Therac-25 Accidents Nancy G. Leveson Clark S. Turner IEEE, 1993 Presented by Jack Kustanowitz April 26, 2005 University of Maryland.
Therac-25 Lawsuit for Victims Against the AECL
Can We Trust the Computer?
Slides prepared by Cyndi Chie and Sarah Frye. Fourth edition revisions by Sharon Gray. A Gift of Fire Fourth edition Sara Baase Chapter 8: Errors, Failures,
Reliability and Safety Lessons Learned. Ways to Prevent Problems Good computer systems Good computer systems Good training Good training Accountability.
Social Implications of a Computerized Society Lecture 8 Professional Ethics Instructor: Oliver Schulte Simon Fraser University.
Spreadsheet Management. Field Interviews with Senior Managers by Caulkins et. al. (2007) report that Spreadsheet errors are common and have been observed.
A Gift of Fire Third edition Sara Baase
A Gift of Fire Third edition Sara Baase
Errors, Failures and Risks CS4020 Overview Failures and Errors in Computer Systems Case Study: The Therac-25 Increasing Reliability and Safety Dependence,
Today’s Lecture application controls audit methodology.
Software Quality Chapter Software Quality  How can you tell if software has high quality?  How can we measure the quality of software?  How.
Slides prepared by Cyndi Chie and Sarah Frye A Gift of Fire Third edition Sara Baase Chapter 8: Errors, Failures, and Risks Version modified by Cheryl.
Information Systems Security Computer System Life Cycle Security.
Chapter 13 Processing Controls. Operating System Integrity Operating system -- the set of programs implemented in software/hardware that permits sharing.
Therac 25 Nancy Leveson: Medical Devices: The Therac-25 (updated version of IEEE Computer article)
ITGS Software Reliability. ITGS All IT systems are a combination of: –Hardware –Software –People –Data Problems with any of these parts, or a combination.
Chapter 8: Errors, Failures, and Risk
1 Can We Trust the Computer? What Can Go Wrong? Case Study: The Therac-25 Increasing Reliability and Safety Perspectives on Failures, Dependence, Risk,
2.2 Software Myths 2.2 Software Myths Myth 1. The cost of computers is lower than that of analog or electromechanical devices. –Hardware is cheap compared.
Slides prepared by Cyndi Chie and Sarah Frye1 A Gift of Fire Third edition Sara Baase Chapter 8: Errors, Failures, and Risks.
Testing -- Part II. Testing The role of testing is to: w Locate errors that can then be fixed to produce a more reliable product w Design tests that systematically.
CS 4001Mary Jean Harrold 1 Can We Trust the Computer?
© 2008 Wayne Wolf Overheads for Computers as Components 2nd ed. System design techniques Quality assurance. 1.
CptS 401 Adam Carter. Announcement  Executive decision: no class Thursday! (CH and exam review will take place tomorrow instead)  Be sure that.
A Gift of Fire, 2edChapter 4: Can We Trust the Computer?1 Can We Trust the Computer?
Risk Management & Corporate Governance 1. What is Risk?  Risk arises from uncertainty; but all uncertainties do not carry risk.  Possibility of an unfavorable.
Chapter 1: Fundamental of Testing Systems Testing & Evaluation (MNN1063)
CS 4001Mary Jean Harrold1 Class 20 ŸSoftware safety ŸRest of semester Ÿ11/1 (Thursday) Term paper approach due Ÿ11/13 (Tuesday) Assignment 8 on software.
Why Cryptosystems Fail R. Anderson, Proceedings of the 1st ACM Conference on Computer and Communications Security, 1993 Reviewed by Yunkyu Sung
FACTORS AFFECTING THE EFFICIENCY OF DATA PROCESSING SYSTEMS.
Topic: Reliability and Integrity. Reliability refers to the operation of hardware, the design of software, the accuracy of data or the correspondence.
Chapter 8: Errors, Failures, and Risk Zach Archer Daniel O’Hara Eric Strittmatter.
MAJOR SOFTWARE FAILURES, WHY THEY FAILED AND LESSONS LEARNED BY AKPABIO UWANA.
Can We Trust the Computer? FIRE, Chapter 4. What Can Go Wrong? What are the risks and reasons for computer failures? How much risk must or should we accept?
A Method for Improving Code Reuse System Prasanthi.S.
Auditing Concepts.
Why Software Fails.
Configuration Management
Internal Control Principles
SOFTWARE TESTING Date: 29-Dec-2016 By: Ram Karthick.
3 Chapter Needs Assessment.
Chapter 4 The Revenue Cycle 1.
Controlling Computer-Based Information Systems, Part II
Configuration Management
Systems Analysis and Design
MANAGEMENT INFORMATION SYSTEMS
Why Is Software Testing Important For Modern Businesses?
Internal control - the IA perspective
Show Me the Money Nature of Accounting.
A Gift of Fire Third edition Sara Baase
PowerPoint® Slides to Accompany
Reliability and Safety
Workshop on Accelerator Operations
System design techniques
Week 13: Errors, Failures, and Risks
UNIT 5 EMBEDDED SYSTEM DEVELOPMENT
UNIT 5 EMBEDDED SYSTEM DEVELOPMENT
Reliability and Safety
Reliability and Safety
The Troubleshooting theory
Computer in Safety-Critical Systems
Internal Control Internal control is the process designed and affected by owners, management, and other personnel. It is implemented to address business.
A Gift of Fire Third edition Sara Baase
What is a System? A system is a collection of interrelated components that work together to perform a specific task.
Lesson 3.2 Product Planning
Presentation transcript:

Reliability and Safety Week 7 What can go wrong?

Issues: Hardware Errors Software Errors Fault vs Error

Computer failure causes: Faulty design Sloppy implementation Careless or insufficiently trained users Poor user interfaces Hardware/Software malfunctions Specification errors Scope/Application inconsistency

Computer users perspective Should understand limitations of the computers Need for proper training Need for responsible use Difference between good products and bad ones

Computer Professional Perspective Study computer failures Study computer ethics

Educated Member of Society Perspective Help us evaluate the reliability and safety of various computer applications Help evaluate computer technology

Three Categories of Failures Problems for individuals System failures that affect large numbers of people or cost large amounts of money Problems in safety-critical applications

Problems for Individuals Billing Errors design and/or implementation of programs Not enough care - input error Not enough testing - reasonable range Not enough training

Database Accuracy Problems Info in database is not accurate Automatic entering of info - mistakes can be overlooked Copies of incorrect info can be sent to other systems Not knowledgeable enough about the system

Causes Lack of accountability Large population Most of our financial interactions are with strangers Automated processing without human common sense Overconfidence in accuracy of data Lack of accountability

Consumer Hardware and Software Usually have more serious errors in their first releases Regularly sold with known bugs Hardware also has flaws tradeoff between cost, debugging, and marketing Dishonesty, denials of problems, lack of adequate response to complaints

System Failures Lots of $$$$ Complete shutdown of basic services Areas: communications Business and financial systems Military

WHY? Not enough testing Technical difficulties Poor management decisions Dishonesty in promoting the system and responding to problems

Communications Phone Service How Bad? pagers phone calls 911 Communications for airports cellular phones

Business and financial systems Stock exchange ATM Contest by Pepsi too many winning tickets issued

Destroying Business Loss of sales incorrect info affects business dissatisfied customers incorrect prices loss of data

Military Data management Weapons system design Battle simulation Battle management command/control communications intelligence Nuclear war

Why? Not enough testing technical difficulties poor management decisions dishonesty in promoting the system and responding to problems Results in delays and abandonment of projects Heard Before?

The Denver Airport baggage system Outbound luggage checked at ticket counters or curbside to be delivered to anywhere in <10 minutes via automated system of cars on tracks connecting flights or terminals Laser scanners tracks - 4000 cars

Problems Encountered Cars crash into each other at intersections Luggage misrouted, dumped or flung Needed cars were idle or put to rest

Specific problems Real world problems scanners got dirty knocked out of alignment Software error rerouting of cars to waiting area - idle

Causes Time allowed for development and testing was insufficient Significant changes in specifications were made after project began Not enough debug time Poor management Unrealistic plan

Safety Critical Applications Use of computers is increasing rapidly in these areas Use of computers in these areas can save $ Areas Military Medical Applications Power plants Aircraft Trains Automobiles

Aircraft - Fly by Wire Pilots do not directly control plane Actions are input to computers that control the aircraft systems Pilot interaction is critical Need for easy way to override computers Easy transfer between automatic and manual control

Air Traffic Control Long delays Increased risk of collision Political - government spends $ elsewhere

Case Study - Therac-25 Software controlled radiation therapy machine used to treat people with cancer Problems: Massive overdoses administered Repeated overdoses due to faulty display Death Operated in dual machine mode - electron beam or x-ray photon beam

Why? Lapses in good safety design Insufficient testing Bugs in software that controlled machines Inadequate system of reporting and investigating accidents and deaths

Specific problems Some hardware safety features were eliminated in newer models Software used was assumed correct from older systems Malfunctioned frequently Weakness in design of operator interface inadequate explanation of error messages if any

Specific problems continued Machine allowed one-key intervention versus automatic shutdown Inadequate documentation Poor test plan

Software Errors - bugs Fatal error was a simple fix Fixes are complex, expensive, and prevents use of machine while fixing Bugs can be intermittent and hard to detect importance of self checking importance of using good programming techniques

Overconfidence Leaving out changes that are necessary Ignoring error messages Not using backup devices (video or audio)

Conclusion and Perspective Irresponsibility leads to criminal charges Responsibility leads to merit awards Importance of good software development Consequences of carelessness, cutting corners, unprofessional work, or attempts to avoid responsibility Lack of appreciation for risks Poor training

Ways to prevent problems Good computer systems Good training Accountability Individual responsibility Management responsibility IEEE Code of Ethics

Increasing Reliability and Safety What goes wrong? Many lines of code and many programmers Problems are managerial, technical, social, legal, ethical

Overconfidence Unappreciative of risks Ignore warnings Don’t consult manuals

Professional Techniques Use good software engineering techniques at all stages of development: Requirements Specs Design Implementation Documentation Testing (V&V)

Professional Techniques Study the techniques and tools available Knowing or learning enough about the application field and the software or systems being used (Domain Knowledge)

Why Study Failures? Provides technical lessons Leads to improved hardware and software products Provide ethical data Lead to improved ethical codes/laws

Lessons Learned Accidents are not the result of unknown scientific principles but rather a failure to apply well-known engineering practices Accidents will not be prevented by technological fixes alone, requires control of all aspects of the development and operation of the system

Lessons Learned Software developers need to recognize the limitations of software, and use hardware safety mechanisms

Redundancy and Self-checking Redundancy - judging - expensive Complex systems collect information to diagnose and correct errors Audit trails are vital Detail records help protect against theft and help trace and correct errors

Redundancy and Self-checking Designed to constantly monitor itself and correct problems automatically Half of the computing power is devoted to checking The rest for errors closes off part of the system reroutes corrects problems and reroutes again

TESTING CRITICAL! Principles and techniques exist can use another company to perform Independent verification and validation

Dangerous Tendencies Operators bypass check mechanisms through familiarity Technicians Blame random mechanical or signal glitches rather than software Corporate Managers Initially deny and ignore - then cover up Finally - deal with expensive fixes

Overall Lessons Learned Should not declare problem understood with first hypothesis Should not expect management to follow through on field reports Overconfidence in software leads to economical marginal designs

Overall Lessons Learned Enforcement of software engineering practices is often abysmal Basing risk assessments on individual subsystems often leads to unrealistic optimism

Lessons for systems engineering Hardware backups valuable Software must not be presumed innocent Audit trails are critical Risk estimates are subjective User feedback is valuable

Lessons for software engineering Documentation should be on-going Designs should be kept simple Testing should be built into software Software must be tested out of system and in system Reuse of software should be tested like new software

Lessons for oversight Users are more likely to make initial observations than monitoring officials Users need reliable information

Laws and Regulations Criminal and Civil penalties Suits against company that designs or sells the system Criminal charges when fraud or criminal negligence occurs Need contracts Need well designed laws and standards

Regulation Requirement for approval by a government agency before a new product can be sold including specific testing requirements The profit motive causes skimping on safety/testing Better to abandon in some cases Inadequate abilities to judge by customer Hard to sue large companies

Regulation Expensive and time-consuming Newer procedures may not be enforced Lots of paperwork

Professional licensing Licensing of software development professionals to protect against poor quality and unethical behavior Specific training Passing competency exam Ethical requirements Continuing education