Software Reliability Engineering By Jackie Wadzinski
The Patriot Missile Used to destroy incoming Iraqi Scud Missiles Hailed for effectiveness Operated for 100 consecutive hours 28 American soldiers killed Cause: Software Failure
The Patriot Missile A Learning Experience The software can be redesigned A new Patriot Missile can be built The fate of the 28 soldiers remains the same THE MORAL: Software Engineers need to find a way to engineer reliability into software.
Objectives Definition of Software Reliability Importance of Reliability Engineering Why Reliability Engineering is Difficult Reliability Engineering Processes Weibull Musa Monte Carlo Conclusion
What is Software Reliability? IEEE Definition: “The ability of a system or component to perform its required functions under stated conditions for a specified period of time.” Definition allows for “Just Right” level of reliability for software Software Reliability and Hardware Reliability have the same definition
Why is Software Reliability Important? Manager View Reliable software means satisfied customers Reliable software means repeat customers Reliable software is ethical Legal liability Customer View Reliable software saves time Reliable software increases efficiency
Why Software Reliability is Difficult to Calculate Without considering program evolution, failure rate is statistically non existent There are many possible causes for design defects for failures to arise from
Why Software Reliability is Difficult to Calculate Errors can occur without warning Cannot improve software quality if identical software components are used Periodic restarts can sometimes help fix problems Errors are caused by incorrect logic, incorrect statements, or incorrect input data Software may require infinite testing Software reliability models do not always fit the data points well
Over View There are many models to chose from when calculating software reliability Focus on three Weibull Failure Time Model Musa’s Basic Execution Time Model Monte Carlo Simulation Of all the models, each has strengths and limitations
Weibull Failure Time
About Weibull Failure Model Used to model failure processes of hardware One of the first models to be applied to software reliability modeling Flexible – accommodates increasing, decreasing or constant failure rates
Weibull Failure Model Weibull Failure Model Assumptions: Limitations: There are a fixed number of faults in the software being tested The number of faults are detected in time intervals ((t=0, t1), (t1,t2)….) Limitations: Flexibility allows for greater chance of making the wrong assumption
Weibull Failure Model Example Notice how the model follows the actual data
Musa
About Musa’s Basic Time Execution Model Developed by John Musa of AT&T Bell Laboratories One of the first models to use actual execution time of software components versus calendar time Time between failures is expressed in terms of CPU time
Musa’s Basic Time Execution Model Uses a Poisson Distribution Model Assumptions: The execution times between failures is exponentially distributed The hazard rate for a single fault is constant Limitations: Assumes new faults are not introduced after correction Assumes number of faults decreases over time
Musa’s Basic Time Execution Model Example Notice how the model follows the actual data
Monte Carlo Simulation
About Monte Carlo Simulation Developed in 1940s as part of the atomic bomb program Named after Monte Carlo, Monaco because city’s casinos featured games of chance like dice and roulette Today Monte Carlo Simulations are used in many applications including physics, finance, and system reliability
Monte Carlo Simulation Used for very complex problems which are difficult to solve or no solution exists Uses statistics to mathematically model real life processes and then estimates the probability of possible outcomes Involves fitting a curve to a process and then using the fitted curve to model a process over time Dice Example
Monte Carlo Simulation Process Determine a probability function Weibull Distribution – Best for failure process Lognormal Distribution – Best for repair process Determine the random number generator, the source for selecting random numbers that are distributed uniformly on the proper unit interval Determine a sampling rule for selecting samples for the model given a unit interval of random numbers Record a count successes and failures
Monte Carlo Example Select a random location within the rectangle If the selected location is blue, record a hit Repeat 10,000 times Blue Area = (Hits / 10,000) * Area of Rectangle Note: The standard error in the result is inversely proportional to the square root of the sample size
Monte Carlo Software Example Arbitrary 3 component subsystem The failure probability of each component given in the diagram above If the first component fails, then the second is checked If the second component fails, then the third component is checked If the third component fails, then the entire subsystem fails
Monte Carlo Software Example The actual failure of the subsystem is: The results of the actual simulation are:
Conclusion
Conclusion Engineering reliable software is important to both the engineer and the end user Engineering reliable software is not an easy task to accomplish There are methods available for measuring reliability Each method has its strengths and weaknesses At this time, no one method is superior
Questions
References Ganesh, Pai. Survey of Software Reliability Models. Fall 2002. Korver, Brian. The Monte Carlo Method and Software Reliability Theory. Portland State University Computer Science, Portland Oregan, 1994. Lyu, Michael R, Editor. Handbook of Software Reliability Engineering. IEEE Computer Society Press, McGraw-Hill, 1996. Mladen, Vouk A. Software Reliability Engineering. Tutorial Presented at Annual Reliability and Maintenance Symposium, 1998. Pham, Hoang. Software Reliability. Springer-Verlag, 2000.