1 Reliability Engineering Program University of Maryland at College Park September 5, 2001 Integrating the Contribution of Software into Probabilistic Risk Assessment
2 Probabilistic Risk Assessment (PRA) is a technique to assess the probability of failure or success of a mission. Current PRA neglects the contributions of software to the risk of the mission. The objective of our research is to extend current PRA methodology to integrate software in the risk assessment process. The approach will be tested on a sub-system of the Space Station PRA. Research Objectives
3 The PRA Process PRA is the process designed to answer four basic questions: 1.What can go wrong? 2.What are the consequences of things going wrong? 3.How likely are these undesirable consequences? 4.How confident are we about our answers to the above questions?
4 What Can Go Wrong Mariner I Venus Probe Loses Its Way ( 1962 ) –A probe launched from Cape Canaveral was set to go to Venus. After takeoff, the unmanned rocket carrying the probe went off course. NASA had to explode the rocket to avoid endangering lives on earth. NASA later attributed the error to a faulty line of Fortran code. A hyphen had been dropped.
5 What Can Go Wrong Mars Polar Lander (MPL) Failure (2000) –T–The premature shutdown of the descent engine on the $165 million MPL spacecraft is the most likely cause for the failure of the mission. The 3 landing legs sent spurious signals to the MPL’s computer convincing it the legs had touched down on the Martian surface and thus turned off the descent engine used to slow the spacecraft down in the final seconds before landing.
6
7 Software Failure Mode Taxonomies Diverse failure mode taxonomies have been proposed in the literature: –Chillarege, Kao and Konolit function, interface, checking, assignment, timing/serialization, build/package/merge, document, and algorithm; –Lutz (software failures caused by requirements) inadequate interface requirements and discrepancies between the documented requirements and the requirements actually needed for correct functioning of the system; –Smidts, Stutzke and Stoddard process failure modes and product failure modes;
8 Software Failure Mode Taxonomies An ideal classification should: Cover the entire spectrum of possible failures; Failure modes should be mutually exclusive; Focus on product failure modes;
9 Software Functional Failure Modes Building on existing taxonomies we obtain the following classes: –Omission of a function; –Incorrect realization of a function; –Function was implemented although it was not specified in the requirements; F0.1 Word Processor Utility Software RC F1.2 VDC F1.3 ICC F1.4 VDWD F1.5 Pr(1) Pr(1)
10 Software Functional Failure Modes –Omission of one the attributes of function; –Incorrect realization of one of the attributes of a function; –Introduction of an attribute not specified in the requirements; –Omission of one of the functions in the set S; –Introduction of a function not in set S; –Replacement of a function in set S by another function.
11 Interaction Failure Modes Interaction failure modes are divided into: Input/Output failure modes Support failure modes Environmental impact factors
12 Input/Output Failure Modes Interaction withInputOutput Hardware Electrical signals (originated from a sensor) Electrical signals (sent to actuator) Human Data or control information (input through keyboard, computer screen, voice) Data, recommended activities, warnings (produced through software interface) Software Data
13 Input/Output Failure Modes Characteristics DefinitionFailure Modes AmountThe total number or quantity of input or output. The possible failure modes are “Too much” and “Too little”, for instance, the omission of an input or output, the repetition of an input or output, etc. LoadThe quantity that can be carried at one time by a specified input or output medium. The possible failure mode is “Overload”. ValueThe value taken by the input or output quantity. The possible failure mode is “Incorrect value”. TimeThe point at which the input or output occurs. The possible failure modes are “Premature (too early)”, “Delayed (too late)” and “Omitted (no input/output within the time interval allowed)”. RateThe frequency at which the input is sent or the output is received. The possible failure modes are “Too fast” and “Too slow”. DurationThe time period during which the input or the output last. The possible failure modes are “Too long” and “Too short”. RangeThe limits of input/output’s quantity. The possible failure mode is “Out of range”.
14 Support Failure Modes CPU failures –lead to degraded functionality, loss of function of the software. Memory failures –induce failures due to resource competition, resource shortage, or unavailability of resources. Peripheral devices’ failures –failures of the printer, the input devices, display, network, disk, tapes or other devices –directly lead to software’s malfunction. Shared resources failures –“deadlock” and “synchronization”
15 Environmental Impact Factors Environmental Impact Factors include –Interference with electronic or other signals, barometric pressure, low gravity, fires, floods, snow, temperature, air conditioning, saline atmosphere, humidity, natural disasters, etc. Environmental Impact Factors can be divided into: Immediate impact Insidious impact
16 Conclusions and Further Studies We have established a list of failure modes for software which can serve to identify potential contributions of software in PRA. These failure modes need to be accounted for in the PRA model either as –Initiating Events, –Intermediate Events, –End States
17 Conclusions and Further Studies Our current research is focused on responding to the three remaining questions of PRA Further work is to apply our approach to an example control system: GNC for ISS