Disaster Learning from Jack Ganssle

Slides:



Advertisements
Similar presentations
Test process essentials Riitta Viitamäki,
Advertisements

Lectures on File Management
1 Basic Definitions: Testing What is software testing? Running a program In order to find faults a.k.a. defects a.k.a. errors a.k.a. flaws a.k.a. faults.
Important concepts in software engineering The tools to make it easy to apply common sense!
A Gift of Fire Third edition Sara Baase
Desktop Security: Worms and Viruses Brian Arkills, C&C NDC-Sysmgt.
Software Reliability: The “Physics” of “Failure” SJSU ISE 297 Donald Kerns 7/31/00.
University of Palestine software engineering department Testing of Software Systems Fundamentals of testing instructor: Tasneem Darwish.
ITGS Software Reliability. ITGS All IT systems are a combination of: –Hardware –Software –People –Data Problems with any of these parts, or a combination.
Ganssle 1 MAPLD 2005/S110 Learning from Jack Ganssle Disaster.
1 Project Information and Acceptance Testing Integrating Your Code Final Code Submission Acceptance Testing Other Advice and Reminders.
Computer Security! Emma Campbell, 8K VirusesHackingBackups.
IT1001 – Personal Computer Hardware & system Operations Week7- Introduction to backup & restore tools Introduction to user account with access rights.
Get your software working before putting it on the robot!
SOFTWARE FAILURES.
Networking Objectives Understand what the following policies will contain – Disaster recovery – Backup – Archiving – Acceptable use – failover.
CSC 108H: Introduction to Computer Programming
Benefits of a Virtual SIL
After Construction Name: Per #:.
AP CSP: Cleaning Data & Creating Summary Tables
User-Written Functions
Introduction Edited by Enas Naffar using the following textbooks: - A concise introduction to Software Engineering - Software Engineering for students-
CSE 341 Section 1 (9/28) With thanks to Dan Grossman, et. al. from previous versions of these slides.
Why study Software Design/Engineering ?
Introduction to Python
CSE 341 Section 1 (9/28) With thanks to Dan Grossman, et. al. from previous versions of these slides.
COMP 7012: Foundations of Software Engineering
Putting Testing First CS 4501 / 6501 Software Testing
Thinking about Safety Step Back 5x5 “Nobody Gets Hurt”
Damned if you do and Damned if you don’t
Instrument Interface FPGA
Testing UW CSE 160 Winter 2017.
Entry Task #1 – Date Self-concept is a collection of facts and ideas about yourself. Describe yourself in your journal in a least three sentences. What.
Introduction to Computers
Microsoft Inspire 9/17/2018 2:10 PM Proactive Insights
Information is at the heart of any University, and Harvard is no exception. We create it, analyze it, share it, and apply it. As you would imagine, we.
Introduction Edited by Enas Naffar using the following textbooks: - A concise introduction to Software Engineering - Software Engineering for students-
Teaching slides Chapter 9.
Testing UW CSE 160 Spring 2018.
CSE 341: Programming Languages Section 1
The Top 10 bugs
Phil Tayco Slide version 1.0 Created Oct 2, 2017
CSE341: Programming Languages Section 1
OOP Paradigms There are four main aspects of Object-Orientated Programming Inheritance Polymorphism Abstraction Encapsulation We’ve seen Encapsulation.
Testing UW CSE 160 Winter 2016.
UNITY TEAM PROJECT TOPICS: [1]. Unity Collaborate
CSE341: Programming Languages Section 1
Software Testing and Maintenance Maintenance and Evolution Overview
A Gift of Fire Third edition Sara Baase
Computers in the real world Objectives
Software testing and configuration : Embedded software testing
Go to =>
How to fail at delivering software
Week 13: Errors, Failures, and Risks
Debugging EECS150 Fall Lab Lecture #4 Sarah Swisher
UNIT 5 EMBEDDED SYSTEM DEVELOPMENT
UNIT 5 EMBEDDED SYSTEM DEVELOPMENT
The of and to in is you that it he for was.
Debugging EECS150 Fall Lab Lecture #4 Sarah Swisher
Applied Software Project Management
An Introduction to Debugging
Jasmine Thornton L. Johnson
LO2 – Understand Computer Software
Arrays.
A Gift of Fire Third edition Sara Baase
Software Development Techniques
#1. LIKE YOURSELF The first self improvement tip is learning to love yourself. Unfortunately for many, this is easier said than done. You have to learn.
Managing Time (and Stress) by Managing Yourself!
Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.
Presentation transcript:

Disaster Learning from Jack Ganssle I’m here to talk to you about failures. Not because I embrace it; but because we can learn so much from them. We can do so to from success, but we don’t. When the miracle occurs, when we ship the product, we’re so astonished and relieved we don’t want to think about it anymore… other than those secret night fears of calls from irate customers as they start discovering bugs At this show you’ll attend a lot of classes. Formal teaching venues using math, analysis, studies and more to get a point across. Not here. We embedded people haven’t learned to be afraid of software. Fear or any emotional response comes from experience; often from experience passed along from generation to generation. This morning let’s pretend we’re sitting around a campfire telling stories about how to avoid the saber-toothed tiger, how to find the best-tasting roots and berries. One of my concerns is that in 50 years of s/w engineering and 30 of ES we have accumulated lots of experiential data that tells us the right and wrong ways to do things. Now… we all ignore that experience and repeat the same old dysfunctional development techniques we’ve always used. That’s because we haven’t got a lore of disaster that’s passed between colleagues and from generation to generation. So let’s do some storytelling. Once upon a time there was a toilet… So, once upon a time there was a bridge… Jack Ganssle

Clementine Lessons learned: Schedules can’t rule A 1994 mission designed to test new technologies. Mapped Moon for 2 months, then go to asteroid 1620 Geographos Final report “An inadequate schedule ensured the spacecraft was launched w/o all the s/w having been written and tested.” And, “spacecraft performance was marred by numerous computer crashes.” Press Key – Schedule pressure Launch dates constrained by planetary geometry – Mars every couple of yrs. But that does not allow capricious deadlines; if the schedule is driven by external forces, mgmt must accept a high level of risk. Press Key – Repeat: “An inadequate schedule ensured the spacecraft was launched w/o all the s/w having been written and tested.” Skip on testing, and the system will be a disaster. Press Key – Tired people make mistake The final report: faults the ambitious schedule which produced tired and stressed-out workers. Yet the final report says nothing about what happened. Smoking gun memo: 1 yr, APL, NASA claimed destroyed. I got it from unhappy eng Left orbit. Data being sent in telemetry packets unchanged – the main computer was locked up. Ground crew spent 20 minutes trying to get a “soft” reset to work (long distance Ctl-Alt-Del?). Finally sent a hard reset which worked. But all fuel was gone! Though not thrusting, s/w wandered off and enabled it. S/w timeout on thrusters did not work – cause of crash. The CPU had a built-in WDT - not used! No time to write the code. It had experienced 16 previous hangs, all cured by the sw reset. Press Key – WDTs What! 5 lines of code!!! The “Smoking Gun Memo” is titled “How Clementine REALLY failed and what NEAR can learn” – NEAR being the Near Earth Asteroid Rendezvous mission. Never sacrifice testing Tired people make mistakes Error handlers save systems

NEAR Lessons Learned: Tired people make mistakes. Use the VCS Test everything! NEAR was launched 2 yrs later- and suffered a huge failure 3 yrs before its planned rendezvous w/ asteroid Eros. Started a burn; engine immediately shut down due to because the normal start-up transient exceeded a value that was set too low. But it went quiet for 27 hrs. During that time it dumped most of the fuel. It initiated THOUSANDS of thruster firings. The abort script was incorrect in way it transferred attitude control to momentum wheels. Quote from the Clementine smoking gun memo: “When I informed them that NEAR had no ability to accept ground-commanded reset or WDT, the expressed total disbelief, and advised me to scream and kick, if necessary, to get a hardware reset.” KEY: Typically 20-30% short on planned staff. Just like Clementine. KEY: scripts were lost, flight code stored on uncontrolled servers. Two versions of version 1.11 code. FAA Stolen s/w problem EFF KEY: Simulator used for testing was unreliable. Not all scripts checked on it… including the abort script which caused the failure. Typical of prototype h/w – you know how it is, when we accept “glitches” because we don’t trust the hardware. PCB vs WW KEY: They saved the mission. Added a year for low energy trajectory. On Feb 12, 2001 landed on Eros – what amazing guys! Key: Why didn’t they learn from the experience of Clementine? Engineers rock! We must learn from disaster

Mars Polar Lander/Deep Space 2 Lessons learned: Tired people make mistakes Test everything! Test like you fly; fly what you test Dec 1999. A triple failure: Twin DS2 probes were to be released 5 minutes before MPL encountered Mars’ atmosphere. Both the lander and the two probes failed. Goal was to deliver a lander to Mars for ½ cost of Pathfinder, which itself was much cheaper than earlier planetary missions. Key: Tired people make mistakes Report: LMA used excessive OT, workers averaging 60-80 hrs/wk for extended periods of time. Key: Test everything Report: Insufficient up front design. Used analysis and modeling instead of test and validation. Though I’m a great believer in modeling, people pushing that – UML – to the exclusion of all else are on drugs. Only when the system gets used, in real life, do we find the real problems. Modeling a problem because if the model is wrong, so’s the code. Testing is like double entry bookkeeping – it insures everything is right. Key: Test like you fly, fly like you test Report: There was no impact test of a running, powered, DS2 system. Planned, but deleted midway through project due to schedule considerations. One possible reason for DS2’s (2 units!) failure was electronics failure due to impact. 2nd possibility: ionization around the antenna after it impacted reducing Tx emissions. But… the antenna was never tested in the Mar’s 6 torr environment. They believe: The landing legs deployed at 1500 meters. At 40 meters the s/w starts looking at 3 touchdown sensors, one per leg. Any signal causes the code to shut down the engine. Well, it was known that deploying the legs (at 1500 m) would create a transient in these sensors… but the s/w people did not account for that. It was NOT in the s/w requirements. The transient got latched; when the code started looking at the sensors at 40m, the latched transient looked like the spacecraft had landed. It shut down the motors. A system test failed to pick up the problem because the sensors were miswired. The wiring error was found, but the test was not repeated.

Pathfinder Lessons learned: There’s no such thing as a glitch – believe your tests! Error handlers save systems Describe problem. Glitch – saw on earth Priority inversion Fixed, sent code to Mars, 100m miles away Key: They saw it on earth! Key: Case where the wdt did work; great exception handler WDTs do save systems; they’re an essential part of many embedded systems. I hear plenty of arguments that WDTs not needed since the code shouldn’t crash. But stuff happens; code is not perfect, h/w is erratic, cosmic rates flip bits. Not just for spacecraft – Intel Itanium 2 Now, you may think the VxWorks thing is an isolated issue…

Titan IVb Centaur Lessons Learned: Test like you fly; fly what you test Use the VCS 1999 launch of a Milstar spacecraft failed. 1st stage OK At 9 minutes in the centaur separated and immediately experienced instabilities around roll axis. After a restart yaw and pitch also. The RCS used up all of its fuel trying to compensate. Meant to go to geosyn orbit but wound up in a useless low elliptical orbit. The actual flight configuration – sw and data setting flight parameters – was never tested. Tests used a mix of new and old sw and data. Key: One data element was wrong. KEY: Critical constants not VCSed – lost! Eng used a diff file and tried to recreate the data. Made an order of mag err We all know re VCS – right? How about the FAA? 1999 disgruntled pgmr quit. Deleted sole – sole – copy of source that controls comm between Ohare and regional airports. Home, encrypted, FBI 6 months. Starting to see a pattern?

Chinook Lessons Learned: Do reviews… before shipping! Test like you fly; fly what you test Electronic engine controller has history of “uncommanded run ups” 1994 crash killed two pilots. Lots of pilot err involved; even more controversy w/ many claims of coverups and more. BUT: did CI. Code was so awful they only audited 17% before giving up. Found 485 errors, 56 of which were category 1 – serious safety issues. KEY: Inspect the code So… We’ve known the value of inspections since 1976… why did they do the CI after the crashes instead of before? QUESTION: How many here religiously inspect all new code? So even today, though we’ve known since 1976 how important CI are, we don’t do it. We’re repeating this mistake. Key: Testing The contractor claims to have done 70,000 hrs of test on the s/w. Turns out… they tested the wrong version, not that which flew. The MoD says the tests were worthless. That’s enough on rockets and aviation. Or is it? I didn’t mention civil aviation accidents, like the 757 that crashed in Columbia in 1995, killing 70. The jury found the s/w vendors 25% responsible for the damages. Lost all instruments. On many planes, like DC-10s, instrument failure is very common. The accepted procedure is to reset the ckt breaker. Reboot. They put orange collars on the breakers to make it easier to spot them.

Therac 25 Lessons Learned: Use tested components Use accepted Describe the instrument 11 installed, between 1985 and 1988 6 massive overdoses; 3 fatalities Many, many lessons learned; here’s three of the most important KEY: use a decent RTOS. Massive probs involved in homemade RTOS. Yet today over half are still homemade Key: No TSET; all sync done via globals. Yet most real time developers do not understand the issues in reentrancy Use accepted practices Use peer reviews

Radiation Deaths in Panama May ‘01: Over 20 dead patients Possible to enter data in such a way to confuse machine; unit prints a safe treatment plan but overexposes. Lessons Learned: Test carefully Better Requirements Use a defined process & peer reviews KEY: to next slide w/ another pic

Pacemakers Lessons Learned: Test everything! Flash is not a Dec 97 Guidant PR 190 BPM Key: Test everything Download new code via inductive loop Downloaded new code Key: The seduction of flash Flash perils - 2003 – a Japanese woman’s pacemaker was reprogrammed by her rice cooker An arthritis-therapy machine set a pacemaker set to 214, killed patient 1998 anti-shoplifting device reset a man’s pacemaker OEM story shipping to Middle East All old news; we’ve gotten better, right??? JAMA study Also, rate got worse in 95-2000 period vs 90-95 Quality: add Bob L thing re hooked up to a medical instrument and heart attack I indicated testing was a big part of this. Surely those handling the REALLY dangerous stuff know about that? Flash is not a schedule enhancer

Near Meltdown Lessons Learned: Test everything! Improve error handling March 98 Los Alamos Criticality experiment. Joystick failed; returned “?”=ascii 63 Key: Test everything Key: Improve error handling

Our Criminal Behavior No Peer Reviews Inadequate testing Average uninspected code contains 50-100 bugs per 1000 LOC. Inspections find most of these. Cheaply. Inadequate testing Ignoring or cheating the VCS Lousy error handlers Inspections are 20x cheaper than debugging and find more problems. 10k LOC model From both a technical and business standpoint, inspections make sense Means testing all the time, adopt the XP approach, means continuous integration. testing – prob is it’s always left to the end. Then schedule pressure (relate these two lessons) means we shortchange testing. Better: test all the time. Continuous integration. Subsubjects believe your data – no glitches (pathfinder) Use realistic tests – airanne – even when that costs real money. Always remember, a simulation is just a simulation. No matter how good it’s fiction. MER Only amateurs don’t use a VCS, even on a 1 person project. Control and BU Team lead responsible to find those who leave stuff checked out forever. It’s appalling that the diag bits were ignored in Ariane and Los Alamos WDTs, Impossible inputs will and do occur. Sw eng is topsey turvey. Beef up a strut for more margin, use a bigger wire to handle unexpected surges, protect via fuses and weak links. In sw 1 err causes total collapse. Worse, we write code “everything’s gonna work and no unexpected inputs.” Thus buffer overrun probs et al. spacecraft goes to Safe mode, WDTs (few ok), proper exception handlers. MMU This means adopting a culture of anticipating and planning for failures! And for FPGA users it means adopting a philosophy that things do fail!

Our Criminal Behavior The use of dangerous tools! C (worst) 500 bugs/KLOC C (average) 167-26 ADA (worst) 50 ADA (average) 25 SPARK (average) 4 I want to take a swipe at some sacred cows. Like C. This is real code from the Obfuscated code contest. This is real, working code! We’re working w/ fundamentally unmanageable tools, building hugely complex systems out of very fragile components. A decent language should help us write correct code. At this point in the talk you’d expect me to give you the URL – no way! These people are code terrorists who should be hunted down and shot. Any language that lets us do this must be either tamed or outlawed. From http://www.stsc.hill.af.mil/crosstalk/2003/11/0311German.html Software Static Code Analysis Lessons Learned© Andy German, QinetiQ Ltd. , Nov 2003 Crosstalk Ada was great – if the damn thing compiled, it’ll work. The tool made us write more or less correct code. Why did we abandon that?

The Boss’s Criminal Behavior Schedules can’t rule: Corollary: Tired people make mistakes It’s well known that after about 50 hrs a week people are doing their own stuff at the office And, Developers are only 55% utilitzed; that’s 22 hrs/week Tired people make mistakes – in XP they recognize this. Implicated in the Clementine, NEAR, Mars Polar Lander and many others

Are we criminals? Or are we still in the dark ages? But there’s a lot we do know, so we’re negligent – and will be culpable – if we don’t consistently use best practices. I used the word “Criminal Behavior”. The reason? Because, IMO, that’s what the courts will soon be saying. In no other industry can we ship products with known defects and not get sued. How long do you think that’s gonna last? The lawyers are gonna find fresh meat in the ES industry. Expect high tech ambulance chasing in the next few years. KEY: But at the moment we’re in the dark ages, like the early bridge builders. We’re still learning how to build reliable code. BUT THERE ARE A LOT OF THINGS we do know, that we don’t practice. If we continue to ignore these lessons we’re crooks, defrauding our customers. But it is possible to learn from things, build a body of knowledge from experience. One of the painful parts of parenting is finding your teenagers unable to learn from the experience of others. Sometimes I wonder if we all must learn things the hard way. So I give you this body of knowledge. Go forth and use it wisely, my friends. I don’t want any of you to find 60 minutes on your doorstep after some disaster.