Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 8: Errors, Failures, and Risk

Similar presentations


Presentation on theme: "Chapter 8: Errors, Failures, and Risk"— Presentation transcript:

1 Chapter 8: Errors, Failures, and Risk
CSE/ISE 312 Chapter 8: Errors, Failures, and Risk Legacy systems are out-of-date systems (hardware, software, or peripheral equipment) still in use, often with special interfaces, conversion software, and other adaptations to make them interact with more modern systems. Major users of computers in the early days included banks, airlines, government agencies, and providers of infrastructure services such as power companies. The systems grew gradually. A complete redesign would be expensive and would possibly involve downtime. Thus legacy systems persist. Limited computer memory led to obscure and terse programming practices. A variable a programmer might now call “flight-number” might have been simply “f.”

2 What We Will Cover Failures and Errors in Computer Systems
Case Study: The Therac-25 Increasing Reliability and Safety Dependence, Risk, and Progress

3 Failures and Errors in Computer Systems
Most computer applications are so complex it is virtually impossible to produce programs with no errors The cause of failure is often more than one factor Faulty design, sloppy implementation, careless users, poor user interface, insufficient user training… Design and testing of mission critical systems is much more complex than typical computer-based systems Computer professionals must study failures to learn how to avoid them, and to understand the impacts of poor work

4 Problems for Individuals
Billing errors Inaccurate and misinterpreted data in databases Large population where people may share names Automated processing may not be able to recognize special cases Overconfidence in the accuracy of data Errors in data entry Lack of accountability for errors P407 A woman received a $6.3M bill for electricity; IRS issued an Illinois couple a bill for a few thousand dollars in tax and $68B in penalties, auto insurance rate of a 101-yo man suddenly tripled, because the code only handle two digits - Coder can check if a bill amount is unreasonable (in the billions), or if there is sudden jump from previous bill. A family was harassed, threatened, and physically attacked after its state posted an online list of addresses where sex offenders live. The state did not know the offender had moved away before the family moved in. After 9/11/2001, more than 30K ppl have been mistakenly matched to names on terrorist watch lists at airports and border crossings Police arrested a Michigan man for several crimes, including murders, committed LA. Another man had assumed his id after finding his lost wallet

5 System Failures AT&T, Galaxy IV satellite, Amtrak
Businesses have gone bankrupt after spending huge amounts on computer systems that failed Voting systems in presidential elections Stalled airports: Denver, Hong Kong, Malaysia Abandoned systems Systems discarded after wasting millions even billions of dollars Legacy systems Reliable but inflexible, expensive to replace, little documentation P413 -AT&T: a change of three lines in 2-million line program caused phone svc stall in many cities. The change had a typo Amtrak: ticketing and reservation failure during Thxgiving weekend causes delays and no printed schedules/fares When a Galaxy IV satellite computer failed, many systems we take for granted stopped working. Pager service stopped for an estimated 85% of users in the U.S., including hospitals and police departments. Airlines that got their weather information from the satellite had to delay flights. The gas stations of a major chain could not verify credit cards. Some services were quickly switched to other satellites or backup systems. It took days to restore others. Inventory system (Warehouse Mgr) failed due to a port to a different platform/OS, causing business to go bankrupt. Wrong prices, wrong inventory levels, extreme delay in processing, placing a purchase order takes weeks. Causes of loss: Technical difficulties, poor mgmt decisions(inadequate testing) and dishonesty in promoting the system and responding to the problems. (P415) Voting machines not properly secured, fraud, errors, crashed, incorrect usage, storage, memory full. Accuracy and authenticity of vote counts are essential in a healthy democracy, electronic ones have not reached level of trust we want

6 Denver Airport Baggage handling system costs ~ $200 million, caused most of the delay Baggage system failed due to real world problems, problems in other systems and software errors Carts crashed into each other at track intersections, mistaken route. Scanner got dirty or knocked out of alignment, faulty latches, power surges Main causes: Time allowed for development was insufficient Denver made significant changes in specifications after the project began In 1994, 3.2Billion airport still not open 10 months after planned. Delay costs $30million/month in bond interest and operating costs. Any bag travels to any part of airport in <10min. Automated systems of carts travel on 19mph or 22mph tracks. Each cart is bar-coded with destination, carries bags. Laser scanners track 4K carts, send to computer. Which controls motors and switches to route carts. - Software errors cause intersection, misrouted, - Faulty latches cause luggage to fall onto tracks between stops flung about luggage. Scanners can’t detect carts going by - Electrical system can’t handle the power surges

7 High-level, management-related causes of computer-system failures
Lack of clear, well thought out goals and specifications Poor management decisions and poor communication among customers, designers, programmers, etc. Institutional and political pressures that encourage unrealistically low bids, low budget requests, and underestimates of time requirements Use of very new technology, with unknown reliability and problems Refusal to recognize or admit a project is in trouble High-level, mgmt-related causes of system failures. Many projects that cost millions or billions of dollars were abondened. British food retailer - supply mgmt Ford Motor - puchasing system CA and WA state motor vehicle dept Consortium of hotels and rental car business CA - tracking parents with child-support payments IRS (4Billion) - tax modernization plan FBI - evidence mgmt 5-15% of IT projects out of $1Trillion per year abandoned - 50 to 150Billion dolloars/year

8 Case Study: The Therac-25
Therac-25 Radiation Overdoses: Therac-25: a software controlled radiation therapy machine used to treat cancer patients , 4 medical centers Massive overdoses of radiation were given; the machine said no dose had been administered at all Caused severe and painful injuries and the death of three patients Important to study to avoid repeating errors Manufacturer, computer programmer, and hospitals/clinics all have some responsibility P425 A sw-ctrled radiation therapy machine used to treat ppl w/ cancer. , at 4 medical centers, six patients received massive overdoses of radiation

9 Case Study: The Therac-25 (cont.)
Software and Design problems: Re-used software from older systems, unaware of bugs in previous software Weaknesses in design of operator interface Obscure error messages with no documentation on them Inadequate test plan Bugs in software Allowed beam to deploy when table not in proper position Ignored changes and corrections operators made at console - Fully computer controlled. Older machines had hw safety interlock mechanisms, independent of the computer, that prevented the beam from firing in unsafe conditions. - Many error messages or obscure msgs (Malfunction 54, H-tilt). Operator’s manual did not have explanation of these msgs. Maintenance manual did not explain them either. - Setup procedure does a polling to see if ready, and can poll hundreds of times to set up for a treatment. A flag variable indicates readiness. Zero means ready. It is stored in a byte, and incremented by 1 each time system is not ready. So, when it wraps up, things still not ready, but it would let the beam to deploy!!!!!!!! >-( - Concurrency control problem. While machine is moving into place ctrled by computer, operator may edit the input, sw did not have any consistency check. Cause incorrect settings that are not safe.

10 Therac-25 - Why So Many Incidents?
Hospitals had never seen such massive overdoses before, were unsure of the cause Manufacturer said the machine could not have caused the overdoses and no other incidents had been reported (which was untrue) The manufacturer made changes to the turntable and claimed they had improved safety after the second accident. The changes did not correct any of the causes identified later

11 Therac-25 - Why So Many Incidents?
Recommendations were made for further changes to enhance safety; the manufacturer did not implement them The FDA declared the machine defective after the fifth accident The sixth accident occurred while the FDA was negotiating with the manufacturer on what changes were needed

12 Observations and Perspective
Minor design and implementation errors usually occur in complex systems; they are to be expected The problems in the Therac-25 case were not minor and suggest irresponsibility Accidents occurred on other radiation treatment equipment without computer controls when the technicians: Left a patient after treatment started to attend a party Did not properly measure the radioactive drugs Confused micro-curies and milli-curies

13 Increasing Reliability and Safety
What goes wrong? Design and development problems Management and use problems Misrepresentation, hiding problems and inadequate response to reported problems Insufficient market or legal incentives to do a better job Re-use of software without sufficiently understanding the code and without thoroughly testing it Failure to update or maintain a database Overconfidence First four bullets: Some factors in computer system errors and failures. P431 Next three are just highlights in the four. Overconfidence: do not back up data until disk crashes Inadequate understanding of real-world risks (loose wires, fallen leaves on train tracks) Unrealistic safety estimates can from lack of understanding, carelessness, or intentional misrepresentation. Ppl without a high regard for honesty sometimes give in to Business or political pressure to . exaggerate safety . Hide flaws . Avoid unfavorable publicity . Avoid expense of corrections or lawsuits

14 Professional Techniques
Importance of good software engineering and professional responsibility User interfaces and human factors Feedback Should behave as an experienced user expects Workload that is too low can lead to mistakes Redundancy and self-checking Testing Include real world testing with real users 1st bullet: stages of sw development: spec., design, implementation, documentation, testing P435. Should provide feedbacks to users. So they can take over if needed. 2nd bullet: mainly wrt autopilot system interface design. When approaching an airport. the beacon signal Romeo rather than Rozo pops up at top after a piolot typed “R”, he did not check carefully, and chose the top one, led the plane to have a sharp turn, in the dark, crashed into a mountain, killing 159 ppl. P435  3rd bullet: space shuttles have four identical independent systems that receive input from sensors and check results against each other. In addition it has another separately designed system as backup. AT&T 100million calls/day, system does self-checking and corrections. 4th: beta-testing a near-final stage of testing IV&V – Independent Validation and Verification for space shuttle permanent by NASA

15 Law, Regulation and Markets
Criminal and civil penalties Provide incentives to produce good systems, but shouldn't inhibit innovation Warranties for consumer software Most are sold ‘as-is’ Regulation for safety-critical applications Professional licensing Taking responsibility 1st bullet: legal remedies for faulty systems: law suits against sw company. Criminal charges for fraud or criminal negligence. However, contracts for business computer systems dominate here. Business pay up to the amount of the sw if damage occurs. Some customers try to allege fraud and misrepresentation by the seller in an attempt to recover loss. Well-designed liability laws and criminal laws for increasing reliability and safety are still lacking. Many flaws in them. Complexity of large systems makes designing liability standards difficult. 2nd: Licensing agreements say you are buying as-is. Some even prohibit publically criticizing the sw. Some say vendor can choose state to settle legal disputes. Again, consumer protection argue for mandatory warranties on software. So encourage responsibility and good quality. Opponents say such requirements would raise prices. Reduce innovation and development. So far no warranty law. Liability laws for software and physical products differ, but needs to study many things. What products. Microwave with sw in it counts as former or later? 3rd. E.g. FDA regulates drugs and medical devices for decades. Extensive testing, lengthy docs needed for approval. But lengthy process may cost lives by not introducing them to market faster. 4th. Mandatory licensing of sw development professionals. 5th: Market mechanism: to keep customer happy, business absorbs losses due to their errors. Insurance company charge higher for those irresponsible. We can check BBB’s record of the vendor, check third party reviews. Online groups.

16 Dependence, Risk, and Progress
Are We Too Dependent on Computers? Computers are tools They are not the only dependence Electricity Risk and Progress Many new technologies were not very safe when they were first developed We develop and improve new technologies in response to accidents and disasters We should compare the risks of using computers with the risks of other methods and the benefits to be gained Discussion questions: Do you believe we are too dependent on computers? Why or why not? In what ways are we safer due to new technologies?


Download ppt "Chapter 8: Errors, Failures, and Risk"

Similar presentations


Ads by Google