Download presentation
Presentation is loading. Please wait.
1
COMS W3156: Software Engineering, Fall 2001 Lecture #2: The Open Class Janak J Parekh janak@cs.columbia.edu
2
Important terminology (I) NEW: Different colors from previous version. ALL NEW: Software is not compatible with previous version. UNMATCHED: Almost as good as the competition. ADVANCED DESIGN: Upper management doesn't understand it. NO MAINTENANCE: Impossible to fix.
3
Important terminology (II) BREAKTHROUGH: It finally booted on the first try. DESIGN SIMPLICITY: Developed on a shoestring budget. UPGRADED: Did not work the first time. UPGRADED AND IMPROVED: Did not work the second time.
4
Some leftover points from last class Plagiarism: I was being cute last time – you will get into trouble if you are caught. Books: They’re available from Papyrus, 114 th and Broadway Office hours: Sorry about this week… Questionnaire: finally done, see http://softe.cs.columbia.edu http://softe.cs.columbia.edu C/C++ students, talk to me
5
Next class – course “begins” Read chapters 1 and 4 of Schach, if you have the book The first one should be a breeze (introduction); the fourth isn’t that bad (teams) We will also start discussing the project in detail in next class Recitations will begin next week
6
Why Software Engineering? We started discussing this last class Mythical Man-Month: start reading it when you get a chance; we’ll go over it later In the meantime, let’s discuss some case studies of how software engineering (or lack thereof) changed certain operations
7
Success/Failure: Mars Rover (I) http://catless.ncl.ac.uk/Risks/19.49.html#su bj1http://catless.ncl.ac.uk/Risks/19.49.html#su bj1 To the public, it was said in 1997 that “software glitches” and “too many things trying to be done at once” were the cause of the Pathfinder’s failures In reality, “priority inversion” was at fault
8
Success/Failure: Mars Rover (II) There were three main threads, scheduled preemptively –Information bus data-moving: high priority, frequent –Meterological data-gathering: low priority, occasional –Communications task: medium priority, occasional Occasionally, the communications task would be scheduled during a blocked information bus operation, since the bus was waiting for the meteorological data to be gathered
9
Success/Failure: Mars Rover (III) The communications task would prevent the meterological data work to be done, since it was higher priority A watchdog would occur since the info bus was “dead”, resetting the entire system The low-priority meterological task upended the system: “priority inversion”
10
Success/Failure: Mars Rover (IV) Good news –They had left debugging mode on –The Rover was running VxWorks, a small runtime OS that has tracing capabilities –They managed to trace the source –Lastly, VxWorks has priority inheritance; this means a lower-priority process will inherit the priority of the blocked process if it’s higher. –They were able to upload a small change to solve the crash, as a consequence
11
Lessons: Mars Rover Black box testing would have been impossible – had to see interrupts, etc. Therefore, leaving debugging facilities on afterwards here was a big win –Designing for maintenance Just because the data bus maintenance task ran frequently and is short means nothing
12
Failure: Therac-25 (I) http://sunnyday.mit.edu/papers/therac.pdf - don’t read it if you are squeamishhttp://sunnyday.mit.edu/papers/therac.pdf Therac-25 was a linear accelerator released in 1982 for cancer treatment by releasing limited doses of radiation This new model was software-controlled as opposed to hardware-controlled; previous units had software merely for convenience
13
Failure: Therac-25 (II) Controlled by a PDP-11 computer; software controlled safety In case of error, the software was designed to prevent harmful effects However, in case of software error, cryptic codes were given back to the operator: “MALFUNCTION xx”, where 1 < xx < 64
14
Failure: Therac-25 (III) Operators were rendered insensitive to the errors; they happened often, and they were told it was impossible to overdose a patient However, from 1985-1987, six people received massive overdoses of radiation; several of them died
15
Failure: Therac-25 (IV) Main cause: –Race condition often happened when operator entered data quickly, then hit the UP arrow key to correct, and values weren’t reset properly –AECL (the company) never noticed quick data- entry – their people didn’t do this on a daily basis –Apparently the problem existed in previous units, but they had a hardware interlock mechanism to prevent it; here, they trusted the software and took out the hardware interlock
16
Lessons from Therac-25 (I) Overconfidence in software, especially for embedded systems Reliability != safety No defensive design, bizarre error messages They just “bugfixed”, didn’t look for root causes Complacency
17
Lessons from Therac-25 (II) Improper software engineering practices –Most testing, in reality, was done in a simulated environment and a complete unit; little if any unit and software testing –They claimed 2700 hours of testing; it was really 2700 hours “of use” –Overly complex, poorly organized design –Blind software reuse
18
Is there a “successful” way? Hard to say – software engineering is an imprecise field There’s always “room to improve” Nevertheless, there are many examples of million-dollar savings after initial investments that seemed large, but was quickly offset by the cost-savings See the book
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.