Download presentation
Presentation is loading. Please wait.
1
Software reliability CS 560 Lecture 15
2
software Testing Two main goals of software testing:
Means of increasing software reliability (Defect Testing). Tests to find and correct faults Boundary testing Code coverage testing Tests to improve software performance Load testing Stress testing Endurance testing Means of gaining confidence that the software is sufficiently reliable (Reliability Testing). Increase the probability of failure-free software operation for a specified period of time, in a specified environment.
3
Hardware Reliability Let’s first discuss hardware reliability. This graph is known as the bathtub curve for hardware reliability. The x axis is Time and the y axis is failure rate. When hardware is created, it starts out with a high failure rate aka the burn in part of the hardware lifecycle. The hardware is tested thoroughly during this phase where the failure rate of the hardware continually decreases as the hardware is tested. It then enters its useful life phase where the hardware is used reliably for a period of time and then starts to wear out. The hardware and all its components start to wear out and become more and more unreliable which is why the failure rate rises faster and faster.
4
Hardware Reliability Failure rate is very high during the burn in period. Testing of all components reduces the number of faults. Enters the useful life with small amount of faults. After time, wears down and quickly increases in failure rate. This kind of describes what I just went over.
5
Software reliability modeling (documentation)
6
Software Reliability This is a revised bathtub graph used to model software reliability over time. It starts out the same as hardware reliability with a large failure rate. But as the software is tested and debugged it decreases in failure rate just as the hardware reliability graph showed. Software then enters its useful life cycle which usually contains many upgrades to the software. These upgrades or changes usually add faults and failures. After each upgrade, the software must be tested again to reduce the failure rate. After awhile software reaches a state of obsolescence where upgrades won’t affect the failure rate as much because the system has been tested so much.
7
terminology Error: Action that results in software containing a fault.
Fault avoidance: Ability of presenting fault-free (bug-free) software. Fault detection (testing and verification): Detect faults (bugs) before the system is put into operation or when discovered after release. Fault tolerance: Build systems that continue to operate when problems (bugs, overloads, bad data, etc.) occur. Failure: any observable divergence of software behavior from user needs/requirements. Failure intensity: the number of failures per time unit. This is a way of expressing reliability.
8
Errors, Faults, and failures
9
Reliability Metrics (Documentation)
Probability of Failure on Demand (POFOD) Probability system will fail when a service request is made Ex: POFOD = The service fails on average once per every 1000 requests Rate of Occurrence of Failures (ROCOF) Reflects rate of failure in the system Useful when system has to process a large number of similar requests. Ex: ROCOF = 0.02 Two failures for each 100 operational time units of operation Mean Time to Failure (MTTF) Measures time between observable system failures Mean Time Between Failures (MTBF) Availability = MTBF / (MTBF+MTTR) MTBF = Mean Time Between Failure MTTR = Mean Time to Repair
10
Mean Time Between Failures (Documentation)
Reliability is quantified using Mean Time Between Failures (MTBF) A correct understanding of this metric is important. A power supply with a MTBF of 40k hours doesn’t mean it should last for an average of 40k hours. The statistical average becomes the true average as the number of samples increase. 𝑻=𝑇𝑜𝑡𝑎𝑙 𝑇𝑖𝑚𝑒 𝑹=𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐹𝑎𝑖𝑙𝑢𝑟𝑒𝑠 𝜽 = 𝑻 𝑹 = 𝑴𝑻𝑩𝑭 MTBF Example: Suppose 3 software components are used for 100 hours. During that time, 4 failures occur. Calculate MTBF. 𝜽 = 𝟑 ∗𝟏𝟎𝟎 𝟒 =𝟕𝟓 𝒉𝒐𝒖𝒓𝒔/𝒇𝒂𝒊𝒍𝒖𝒓𝒆
11
Calculating Reliability (Documentation)
Previous MTBF Example: Suppose 3 software components are used for 100 hours. During that time, 4 failures occur. 𝜽 = 𝟑 (𝒄𝒐𝒎𝒑𝒐𝒏𝒆𝒏𝒕𝒔) ∗𝟏𝟎𝟎 (𝒉𝒐𝒖𝒓𝒔) 𝟒 (𝒇𝒂𝒊𝒍𝒖𝒓𝒆𝒔) =𝟕𝟓 𝒉𝒐𝒖𝒓𝒔/𝒇𝒂𝒊𝒍𝒖𝒓𝒆 Reliability can be calculated as: 𝑹 𝒕 = 𝒆 −𝒕𝝀 Where 𝝀=𝟏/𝜽 = Failure rate 𝒕= Time interval 𝒆 = ~2.718 𝑹 𝒕 = 𝒆 −𝟏𝟎𝟎( 𝟏 𝟕𝟓 ) = 𝟐.𝟕𝟏𝟖 −𝟏.𝟑𝟑 = 26.4%
12
Software vs. hardware reliability
Software Reliability Hardware Reliability Failures are primarily due to design faults. Repairs are made by modifying the code. Failures are caused by deficiencies in design, production, and maintenance. No “wear-out” phenomena. Errors can occur without warning. Failures are caused by wear or energy/environment attributes. Sometimes a warning is available before a failure occurs. No equivalent preventive maintenance for software Repairs can be made that makes hardware more reliable. Reliability is not time dependent. Failures occur when the logic path the program takes contains an error. Reliability is time related. Failure rates can be decreasing, constant, or increasing with respect to operating time.
13
Software vs. hardware reliability
Software Reliability Hardware Reliability External environment conditions do not affect software reliability. Internal conditions, such as insufficient memory or inappropriate clock speeds do affect software reliability. Reliability is related to environmental conditions. Reliability can’t be predicated from knowledge of design, usage, or stress factors. Reliability can, theoretically, be predicted from design factors and physical attributes. Reliability can’t be improved through redundancy of software. Redundancy will simply replicate the same error. Reliability can usually be improved through redundant hardware. Failure rates of software components are not predictable. Failure rates of hardware components are somewhat predictable according to known usage patterns.
14
Good practice guidelines for Increasing Software Reliability (documentation)
1. Limit the visibility of information in a program 2. Check all inputs for validity 3. Provide a handler for all exceptions 4. Minimize the use of error-prone constructs 5. Provide restart capabilities 6. Check array bounds 7. Include timeouts when calling external components 8. Name all constants that represent real-world values
15
(1) Limit the visibility of information in a program
Components should only be allowed access to data that they need for their implementation. You can control visibility by using abstract data types where the data representation is private Allow access to the data through predefined operations such as get() and put().
16
(2) Check all inputs for validity
All programs take inputs from their environment. Assumptions are made about these inputs. Many programs behave unpredictably when presented with unusual inputs. May be threats to the security of the system. Inputs should be checked for validity before processing. Validity Checks for Input: Range checks Input falls within a known range. Size checks Input does not exceed some maximum size. Representation checks Input does not include characters that should not be part of its representation.
17
(3) Provide a handler for all exceptions
A program exception is an error or some unexpected event such as a power failure or hardware component going offline. Software with the ability to detect exceptions needs many additional statements to be added. This adds a significant overhead and is potentially error-prone. May be accomplished by using interrupts depending on the error scope.
18
Exception handling
19
(4) Minimize the use of error-prone constructs
Program faults are usually a consequence of human error Programmers can lose track of the relationships between the different parts of the system Extra attention will be needed to produce fault-free or fault- tolerant software using the following constructs.
20
Error-prone constructs
Unconditional branch (goto) statements Allows program flow between scopes without using the specified scope interface. Difficult to understand and debug. Floating-point numbers Inherently imprecise. May lead to invalid comparisons. stored as or Comparing floats for equality can be difficult Pointers Pointer errors can be devastating. May be set incorrectly and point to the wrong memory address. Also makes bounds checking difficult to implement. Dynamic memory allocation Run-time allocation can cause memory overflow if not de-allocated. Memory “leak”
21
Error-prone constructs
Parallelism Can result in timing errors because of unforeseen interactions between parallel processes. Problems can’t be detected during code reviews and may not occur during testing. Recursion May be difficult to follow recursive logic, with hard to detect errors. Errors in recursion can cause memory overflow as the program stack fills up. Interrupts Forces transfer of control to another section of code. Interrupts can cause a critical operation to be terminated. Inheritance Code is not localized. Can result in unexpected behaviour when changes are made. problems of understanding the code.
22
Error-prone constructs
Aliasing Using more than one name to refer to the same variable. Two pointers with different names pointing to the same memory location. Unbounded arrays Buffer overflow failures can occur if no bound checking on arrays. Security vulnerability Default input processing An input action that occurs irrespective of the input. This can cause problems if the default action is to transfer control elsewhere in the program. Incorrect or malicious input can then trigger a program failure.
23
(5) Provide restart capabilities
For systems with long transactions or user interactions, a restart after failure capability should be provided. Users should not have to redo everything that they’ve finished previously Keep copies of forms so that users don’t have to fill them in again if there is a problem. Save state periodically and restart from the saved state. Microsoft Office products have this feature.
24
(6) Check array bounds Some programming languages, such as C, allow addressing of a memory location outside of the range allowed in an array. This can lead to a ‘buffer overflow’ vulnerability where attackers write executable code into memory outside of the scope of the array. If your language does not include bound checking, you should always check that an array access is within the bounds of the array. Manual bounds checking
25
(7) Include timeouts when calling external components
In a distributed system, failure of a remote computer can be ‘silent’ so that programs expecting services from that computer may never receive the service or any indication that there has been a failure. To avoid this, you should always include timeouts on all calls to external components. Should be included with Container-to-Container communication. After a defined time period has elapsed without a response, your system should then assume failure and attempt to recover.
26
(8) Name all constants that represent real-world values
Always give constants that reflect real-world values (such as tax rates) names rather than using their numeric values and always refer to them by name. Removes ‘magic numbers’ Ex: for i from 1 to 52 do: Change: int deckSize = 52 for i from 1 to deckSize do: This means that when these ‘constants’ change then you only have to make the change in one place in your program.
27
Reliability metrics Reliability
Probability of failure not occurring in operational use. Traditional measures for software systems Mean time between failures (MTBR) Mean time to failure (MTTF) Availability (up time) Mean time to repair (MTTR) Market measures Complaints Customer retention
28
Applying Software Reliability Engineering (Documentation)
Essential components of SRE: Establish reliability goals Develop operational profile Plan and execute tests Document test results/modify software system to improve reliability
29
Applying Software Reliability Engineering (Documentation)
Establish reliability goals Ex: The system will be sufficiently reliable if 10 (or less) errors occur in 10k transactions. The system can tolerate a spike in demand for X number of minutes without dramatically decreasing performance. Reliability of component X, Y, and Z over unit time should be greater than 95%. The customer can tolerate no more than two operational errors per hours of use.
30
Applying Software Reliability Engineering (Documentation)
Develop operational profile Characterization of system/component usage. Used to develop tests as if the product was in the field. Differs from functional/non-functional requirements by assigning components different priorities based on probability of use. Reliability tests are allocated based on these probabilities of use.
31
Applying Software Reliability Engineering (Documentation)
Plan and execute tests Defect testing Reliability testing
32
Applying Software Reliability Engineering (Documentation)
Document test results/modify software system to improve reliability Record all test data in the software documentation Tables, graphs, etc. Use test generated data to improve reliability of the software system.
33
Key factors for reliable software
Programming style that emphasizes simplicity and readability. Helps with maintenance and finding/removing errors. Software tools that restrict or detect errors. Strongly typed languages, source control systems, debuggers. Systematic verification at all stages of development. Requirements, system architecture, program design, implementation, and user testing. MTBF, MTTF, MTTR, Failure Rate, and Reliability statistics should be calculated for all major software components.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.