2007 MIT BAE Systems Fall Conference: October Software Reliability Methods and Experience Dave Dwyer USA – E&IS baesystems.com
2007 MIT BAE Systems Fall Conference: October 30-31Page 2 Overview and outline Definitions Similarities and differences: hardware and software reliability Foundations of Musa’s models reviewed –Trachtenberg (Trachtenberg, Martin. “The Linear Software Reliability Model and Uniform Testing,” IEEE Transactions on Reliability, 1985, pp 8-16) –Downs (Downs, Thomas. “An Approach to the Modeling of Software Testing with Some Applications,” IEEE Transactions on Software Engineering, Vol. SE-11, No. 4, April 1985, pp ) Instantaneous Failure Rate, a.k.a. failure intensity –Hardware - Duane, Codier –Software - analogous derivation Testing results SW reliability calculator
2007 MIT BAE Systems Fall Conference: October 30-31Page 3 SW reliability defined Software reliability defined: –The probability of failure-free operation for a specified time in a specified environment for a specified purpose (“Software Engineering”, 5 th edition, I. Somerville, Addison-Wesley, 1995) –The probability of failure-free operation of a computer program for a specified time in a specified environment (“Software Reliability”, Musa, Iannino, Okumoto, McGraw-Hill, 1987) –We will use MTBF or its reciprocal, λ
2007 MIT BAE Systems Fall Conference: October 30-31Page 4 HW vs. SW reliability The hardware reliability discipline provided an impetus to provide for safety margins in the stresses, both mechanical and electrical But margins of safety don’t mean much in software because it doesn’t wear out Software has ‘x’ failures per million unique executions [if ‘y’ executions/hour, then ‘xy’ failures/million hours] Once a process has been successfully executed, that identical process is not going to fail in the future
2007 MIT BAE Systems Fall Conference: October 30-31Page 5 Martin Trachtenberg (1985): Simulation testing showed that: –Testing the functions of the software system in a random or round-robin order and fixing the failures gives linearly decaying system error rates –Testing and fixing each function exhaustively one at a time gives flat system-error rates –Testing and fixing different functions at widely different frequencies gives exponentially decaying system error rates [operational profile testing], and –Testing strategies that result in linear decaying error rates tend to require the fewest tests to detect a given number of errors –Testing to the operational profile gives the lowest time to reach an operational MTBF
2007 MIT BAE Systems Fall Conference: October 30-31Page 6 Down’s ‘Pure’ approach reflected the nature of software (1985) The execution of a sequence of M paths The actual number of paths affected by a fault is treated as a random variable ‘c’ Not all paths are equally likely to be executed j = (N – j) , where: N = the total number of faults, j = the number of corrected faults, = -r log(1 – c/M), r = the number of paths executed/unit time
2007 MIT BAE Systems Fall Conference: October 30-31Page 7 Down’s execution path parameters Start 12 3 M x1 x2 xN 2 paths affected by x1 1 path affected by x2 ‘N’ total faults initially ‘M’ total paths ‘c’ paths affected by an arbitrary fault
2007 MIT BAE Systems Fall Conference: October 30-31Page 8 Our data analysis approach Cumulative 8-hour test shifts are recorded Failures plotted: –All –First instance The last data point will be put at the end of the test time Only integration and system test data
2007 MIT BAE Systems Fall Conference: October 30-31Page 9 Failure rate is proportional to failure number, Downs: j (N – j)r(c/M)
2007 MIT BAE Systems Fall Conference: October 30-31Page 10 Failure rate plots against failure number for a range of non-uniform testing profiles, M 1, M 2 paths and N 1, N 2 initial faults in those paths ‘Concave’ or logarithmic plots
2007 MIT BAE Systems Fall Conference: October 30-31Page 11 Instantaneous failure intensity derivation ~ Duane’s for hardware Instantaneous for HWInstantaneous for SW Same Approach Similar Result
2007 MIT BAE Systems Fall Conference: October 30-31Page 12 Background – test example Console operation and operating profile Necessity of distinguishing failure priorities: –Priority 1: “Prevents mission essential capability” –Priority 2: “Adversely affects mission essential capability with no alternative workaround” –Priority 3: “Adversely affects mission essential capability with alternative workaround” Work shifts varied over test duration: 1-3/day Calculation of failure intensity
2007 MIT BAE Systems Fall Conference: October 30-31Page 13 Corrective action for Priority 2 failures suspended while Priority 1 failures corrected
2007 MIT BAE Systems Fall Conference: October 30-31Page 14 Codier, Duane 1964 RAMS HW reliability growth Ref. Appendix B, Notes on Plotting (Codier, Ernest O., “Reliability Growth in Real Life”, Proceedings, 1968 Annual Symposium on Reliability, New York, IEEE, January 1968, pp ) –1. “The latter points, having more information content, must be given more weight than earlier points” (Trachtenberg, too) –2. The normal curve-fitting procedures of drawing the line through the “center of gravity” of all the points should not be used –3. Start the line on the last data point and seek the region of highest density of points to the left [right for Musa plots] of it”
2007 MIT BAE Systems Fall Conference: October 30-31Page 15 How I draw a growth line through the points on a reliability growth plot? Is there one point that is most important? –Yes, the last point represents the cumulative MTBF to date; it has the most degrees of freedom Should the trend line go through that point? –Yes, it has the best measure of cumulative MTBF Would an Excel trend line go through that point? –No, it’s just a least squares fit with all points weighing the same What is the least important point? –The first; it has the least degrees of freedom
2007 MIT BAE Systems Fall Conference: October 30-31Page 16 Questions: Drawing a line through the points (cont.) If the line goes through the last point, what else should it go through? –The center of density of the other points (ref. back to Duane, Codier) What is the center of density? –The center of density is where the center of mass would be if “The latter points …[are]… given more weight than earlier points”
2007 MIT BAE Systems Fall Conference: October 30-31Page 17 Example - Priority 1 data plotted
2007 MIT BAE Systems Fall Conference: October 30-31Page 18 Point estimates vs. instantaneous
2007 MIT BAE Systems Fall Conference: October 30-31Page 19 The formula for calculation of i correlates with interval estimates of failure intensity
2007 MIT BAE Systems Fall Conference: October 30-31Page 20 Most recent data plot
2007 MIT BAE Systems Fall Conference: October 30-31Page 21 A calculator has been developed for BAE Systems SW reliability practice
2007 MIT BAE Systems Fall Conference: October 30-31Page 22 Priority 1 data graph
2007 MIT BAE Systems Fall Conference: October 30-31Page 23 Questions? Anybody want a grad course in SW Reliability? I need 5 more students Rivier College can do that through teleconference ( You will solve a real no charge to your department (except tuition)