Disaster Learning from Jack Ganssle

Disaster Learning from Jack Ganssle
I’m here to talk to you about failures. Not because I embrace it; but because we can learn so much from them. We can do so to from success, but we don’t. When the miracle occurs, when we ship the product, we’re so astonished and relieved we don’t want to think about it anymore… other than those secret night fears of calls from irate customers as they start discovering bugs At this show you’ll attend a lot of classes. Formal teaching venues using math, analysis, studies and more to get a point across. Not here. We embedded people haven’t learned to be afraid of software. Fear or any emotional response comes from experience; often from experience passed along from generation to generation. This morning let’s pretend we’re sitting around a campfire telling stories about how to avoid the saber-toothed tiger, how to find the best-tasting roots and berries. One of my concerns is that in 50 years of s/w engineering and 30 of ES we have accumulated lots of experiential data that tells us the right and wrong ways to do things. Now… we all ignore that experience and repeat the same old dysfunctional development techniques we’ve always used. That’s because we haven’t got a lore of disaster that’s passed between colleagues and from generation to generation. So let’s do some storytelling. Once upon a time there was a toilet… So, once upon a time there was a bridge… Jack Ganssle

Clementine Lessons learned: Schedules can’t rule
A 1994 mission designed to test new technologies. Mapped Moon for 2 months, then go to asteroid 1620 Geographos Final report “An inadequate schedule ensured the spacecraft was launched w/o all the s/w having been written and tested.” And, “spacecraft performance was marred by numerous computer crashes.” Press Key – Schedule pressure Launch dates constrained by planetary geometry – Mars every couple of yrs. But that does not allow capricious deadlines; if the schedule is driven by external forces, mgmt must accept a high level of risk. Press Key – Repeat: “An inadequate schedule ensured the spacecraft was launched w/o all the s/w having been written and tested.” Skip on testing, and the system will be a disaster. Press Key – Tired people make mistake The final report: faults the ambitious schedule which produced tired and stressed-out workers. Yet the final report says nothing about what happened. Smoking gun memo: 1 yr, APL, NASA claimed destroyed. I got it from unhappy eng Left orbit. Data being sent in telemetry packets unchanged – the main computer was locked up. Ground crew spent 20 minutes trying to get a “soft” reset to work (long distance Ctl-Alt-Del?). Finally sent a hard reset which worked. But all fuel was gone! Though not thrusting, s/w wandered off and enabled it. S/w timeout on thrusters did not work – cause of crash. The CPU had a built-in WDT - not used! No time to write the code. It had experienced 16 previous hangs, all cured by the sw reset. Press Key – WDTs What! 5 lines of code!!! The “Smoking Gun Memo” is titled “How Clementine REALLY failed and what NEAR can learn” – NEAR being the Near Earth Asteroid Rendezvous mission. Never sacrifice testing Tired people make mistakes Error handlers save systems

NEAR Lessons Learned: Tired people make mistakes. Use the VCS
Test everything! NEAR was launched 2 yrs later- and suffered a huge failure 3 yrs before its planned rendezvous w/ asteroid Eros. Started a burn; engine immediately shut down due to because the normal start-up transient exceeded a value that was set too low. But it went quiet for 27 hrs. During that time it dumped most of the fuel. It initiated THOUSANDS of thruster firings. The abort script was incorrect in way it transferred attitude control to momentum wheels. Quote from the Clementine smoking gun memo: “When I informed them that NEAR had no ability to accept ground-commanded reset or WDT, the expressed total disbelief, and advised me to scream and kick, if necessary, to get a hardware reset.” KEY: Typically 20-30% short on planned staff. Just like Clementine. KEY: scripts were lost, flight code stored on uncontrolled servers. Two versions of version 1.11 code. FAA Stolen s/w problem EFF KEY: Simulator used for testing was unreliable. Not all scripts checked on it… including the abort script which caused the failure. Typical of prototype h/w – you know how it is, when we accept “glitches” because we don’t trust the hardware. PCB vs WW KEY: They saved the mission. Added a year for low energy trajectory. On Feb 12, 2001 landed on Eros – what amazing guys! Key: Why didn’t they learn from the experience of Clementine? Engineers rock! We must learn from disaster

Mars Polar Lander/Deep Space 2
Lessons learned: Tired people make mistakes Test everything! Test like you fly; fly what you test Dec 1999. A triple failure: Twin DS2 probes were to be released 5 minutes before MPL encountered Mars’ atmosphere. Both the lander and the two probes failed. Goal was to deliver a lander to Mars for ½ cost of Pathfinder, which itself was much cheaper than earlier planetary missions. Key: Tired people make mistakes Report: LMA used excessive OT, workers averaging hrs/wk for extended periods of time. Key: Test everything Report: Insufficient up front design. Used analysis and modeling instead of test and validation. Though I’m a great believer in modeling, people pushing that – UML – to the exclusion of all else are on drugs. Only when the system gets used, in real life, do we find the real problems. Modeling a problem because if the model is wrong, so’s the code. Testing is like double entry bookkeeping – it insures everything is right. Key: Test like you fly, fly like you test Report: There was no impact test of a running, powered, DS2 system. Planned, but deleted midway through project due to schedule considerations. One possible reason for DS2’s (2 units!) failure was electronics failure due to impact. 2nd possibility: ionization around the antenna after it impacted reducing Tx emissions. But… the antenna was never tested in the Mar’s 6 torr environment. They believe: The landing legs deployed at 1500 meters. At 40 meters the s/w starts looking at 3 touchdown sensors, one per leg. Any signal causes the code to shut down the engine. Well, it was known that deploying the legs (at 1500 m) would create a transient in these sensors… but the s/w people did not account for that. It was NOT in the s/w requirements. The transient got latched; when the code started looking at the sensors at 40m, the latched transient looked like the spacecraft had landed. It shut down the motors. A system test failed to pick up the problem because the sensors were miswired. The wiring error was found, but the test was not repeated.

Pathfinder Lessons learned: There’s no such thing as a glitch –
believe your tests! Error handlers save systems Describe problem. Glitch – saw on earth Priority inversion Fixed, sent code to Mars, 100m miles away Key: They saw it on earth! Key: Case where the wdt did work; great exception handler WDTs do save systems; they’re an essential part of many embedded systems. I hear plenty of arguments that WDTs not needed since the code shouldn’t crash. But stuff happens; code is not perfect, h/w is erratic, cosmic rates flip bits. Not just for spacecraft – Intel Itanium 2 Now, you may think the VxWorks thing is an isolated issue…

Titan IVb Centaur Lessons Learned: Test like you fly;
fly what you test Use the VCS 1999 launch of a Milstar spacecraft failed. 1st stage OK At 9 minutes in the centaur separated and immediately experienced instabilities around roll axis. After a restart yaw and pitch also. The RCS used up all of its fuel trying to compensate. Meant to go to geosyn orbit but wound up in a useless low elliptical orbit. The actual flight configuration – sw and data setting flight parameters – was never tested. Tests used a mix of new and old sw and data. Key: One data element was wrong. KEY: Critical constants not VCSed – lost! Eng used a diff file and tried to recreate the data. Made an order of mag err We all know re VCS – right? How about the FAA? 1999 disgruntled pgmr quit. Deleted sole – sole – copy of source that controls comm between Ohare and regional airports. Home, encrypted, FBI 6 months. Starting to see a pattern?

Chinook Lessons Learned: Do reviews… before shipping!
Test like you fly; fly what you test Electronic engine controller has history of “uncommanded run ups” 1994 crash killed two pilots. Lots of pilot err involved; even more controversy w/ many claims of coverups and more. BUT: did CI. Code was so awful they only audited 17% before giving up. Found 485 errors, 56 of which were category 1 – serious safety issues. KEY: Inspect the code So… We’ve known the value of inspections since 1976… why did they do the CI after the crashes instead of before? QUESTION: How many here religiously inspect all new code? So even today, though we’ve known since 1976 how important CI are, we don’t do it. We’re repeating this mistake. Key: Testing The contractor claims to have done 70,000 hrs of test on the s/w. Turns out… they tested the wrong version, not that which flew. The MoD says the tests were worthless. That’s enough on rockets and aviation. Or is it? I didn’t mention civil aviation accidents, like the 757 that crashed in Columbia in 1995, killing 70. The jury found the s/w vendors 25% responsible for the damages. Lost all instruments. On many planes, like DC-10s, instrument failure is very common. The accepted procedure is to reset the ckt breaker. Reboot. They put orange collars on the breakers to make it easier to spot them.

Therac 25 Lessons Learned: Use tested components Use accepted
Describe the instrument 11 installed, between 1985 and massive overdoses; 3 fatalities Many, many lessons learned; here’s three of the most important KEY: use a decent RTOS. Massive probs involved in homemade RTOS. Yet today over half are still homemade Key: No TSET; all sync done via globals. Yet most real time developers do not understand the issues in reentrancy Use accepted practices Use peer reviews

Radiation Deaths in Panama
May ‘01: Over 20 dead patients Possible to enter data in such a way to confuse machine; unit prints a safe treatment plan but overexposes. Lessons Learned: Test carefully Better Requirements Use a defined process & peer reviews KEY: to next slide w/ another pic

Pacemakers Lessons Learned: Test everything! Flash is not a
Dec 97 Guidant PR 190 BPM Key: Test everything Download new code via inductive loop Downloaded new code Key: The seduction of flash Flash perils – a Japanese woman’s pacemaker was reprogrammed by her rice cooker An arthritis-therapy machine set a pacemaker set to 214, killed patient 1998 anti-shoplifting device reset a man’s pacemaker OEM story shipping to Middle East All old news; we’ve gotten better, right??? JAMA study Also, rate got worse in period vs 90-95 Quality: add Bob L thing re hooked up to a medical instrument and heart attack I indicated testing was a big part of this. Surely those handling the REALLY dangerous stuff know about that? Flash is not a schedule enhancer

Near Meltdown Lessons Learned: Test everything! Improve error handling
March 98 Los Alamos Criticality experiment. Joystick failed; returned “?”=ascii 63 Key: Test everything Key: Improve error handling

Our Criminal Behavior No Peer Reviews Inadequate testing
Average uninspected code contains bugs per 1000 LOC. Inspections find most of these. Cheaply. Inadequate testing Ignoring or cheating the VCS Lousy error handlers Inspections are 20x cheaper than debugging and find more problems. 10k LOC model From both a technical and business standpoint, inspections make sense Means testing all the time, adopt the XP approach, means continuous integration. testing – prob is it’s always left to the end. Then schedule pressure (relate these two lessons) means we shortchange testing. Better: test all the time. Continuous integration. Subsubjects believe your data – no glitches (pathfinder) Use realistic tests – airanne – even when that costs real money. Always remember, a simulation is just a simulation. No matter how good it’s fiction. MER Only amateurs don’t use a VCS, even on a 1 person project. Control and BU Team lead responsible to find those who leave stuff checked out forever. It’s appalling that the diag bits were ignored in Ariane and Los Alamos WDTs, Impossible inputs will and do occur. Sw eng is topsey turvey. Beef up a strut for more margin, use a bigger wire to handle unexpected surges, protect via fuses and weak links. In sw 1 err causes total collapse. Worse, we write code “everything’s gonna work and no unexpected inputs.” Thus buffer overrun probs et al. spacecraft goes to Safe mode, WDTs (few ok), proper exception handlers. MMU This means adopting a culture of anticipating and planning for failures! And for FPGA users it means adopting a philosophy that things do fail!

Our Criminal Behavior The use of dangerous tools!
C (worst) 500 bugs/KLOC C (average) ADA (worst) 50 ADA (average) 25 SPARK (average) 4 I want to take a swipe at some sacred cows. Like C. This is real code from the Obfuscated code contest. This is real, working code! We’re working w/ fundamentally unmanageable tools, building hugely complex systems out of very fragile components. A decent language should help us write correct code. At this point in the talk you’d expect me to give you the URL – no way! These people are code terrorists who should be hunted down and shot. Any language that lets us do this must be either tamed or outlawed. From Software Static Code Analysis Lessons Learned© Andy German, QinetiQ Ltd. , Nov 2003 Crosstalk Ada was great – if the damn thing compiled, it’ll work. The tool made us write more or less correct code. Why did we abandon that?

The Boss’s Criminal Behavior
Schedules can’t rule: Corollary: Tired people make mistakes It’s well known that after about 50 hrs a week people are doing their own stuff at the office And, Developers are only 55% utilitzed; that’s 22 hrs/week Tired people make mistakes – in XP they recognize this. Implicated in the Clementine, NEAR, Mars Polar Lander and many others

Are we criminals? Or are we still in the dark ages?
But there’s a lot we do know, so we’re negligent – and will be culpable – if we don’t consistently use best practices. I used the word “Criminal Behavior”. The reason? Because, IMO, that’s what the courts will soon be saying. In no other industry can we ship products with known defects and not get sued. How long do you think that’s gonna last? The lawyers are gonna find fresh meat in the ES industry. Expect high tech ambulance chasing in the next few years. KEY: But at the moment we’re in the dark ages, like the early bridge builders. We’re still learning how to build reliable code. BUT THERE ARE A LOT OF THINGS we do know, that we don’t practice. If we continue to ignore these lessons we’re crooks, defrauding our customers. But it is possible to learn from things, build a body of knowledge from experience. One of the painful parts of parenting is finding your teenagers unable to learn from the experience of others. Sometimes I wonder if we all must learn things the hard way. So I give you this body of knowledge. Go forth and use it wisely, my friends. I don’t want any of you to find 60 minutes on your doorstep after some disaster.

Disaster Learning from Jack Ganssle

Similar presentations

Presentation on theme: "Disaster Learning from Jack Ganssle"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Disaster Learning from Jack Ganssle

Similar presentations

Presentation on theme: "Disaster Learning from Jack Ganssle"— Presentation transcript:

Similar presentations

About project

Feedback