Download presentation
Presentation is loading. Please wait.
Published byMarybeth Hood Modified over 9 years ago
1
CSE3308/CSC3080 - Software Engineering: Analysis and DesignLecture 7B.1 Software Engineering: Analysis and Design - CSE3308 Reliability CSE3308/CSC3080/DMS/2000/17 Monash University - School of Computer Science and Software Engineering
2
CSE3308/CSC3080 - Software Engineering: Analysis and DesignLecture 7B.2 Lecture Outline u What is reliability? u Failures and Faults u Why is reliability desirable? u Good Enough Software u Measuring reliability u Specifying reliability u Achieving a reliable system
3
CSE3308/CSC3080 - Software Engineering: Analysis and DesignLecture 7B.3 What is reliability? u A formal definition vThe probability of failure-free operation of a computer program in a specified environment for a specified time u An informal definition vHow well the system users think it provides the required services u For a system to be reliable both the informal and the formal definitions must be satisfactorily met ve.g. an aeroplane navigation system may have a very low probability of failure, but even one failure may make it unreliable in the view of the pilot and the passenger
4
CSE3308/CSC3080 - Software Engineering: Analysis and DesignLecture 7B.4 Aspects of reliability u Reliability cannot be defined in an absolute manner u Reliability can only be defined in relationship to a particular operational context u The relationship between the faults in a software product and the reliability of such a product is very complex u To properly consider the reliability of a piece of software, the impact of a fault must be assessed
5
CSE3308/CSC3080 - Software Engineering: Analysis and DesignLecture 7B.5 Faults and Failures u A fault is a static software characteristic which causes a failure to occur u A failure corresponds to unexpected run-time behaviour observed by the user of the system u Faults don’t necessarily cause failures u If a user doesn’t notice a failure, is it a failure?
6
CSE3308/CSC3080 - Software Engineering: Analysis and DesignLecture 7B.6 Faults and Failures (2) u Reliability is related to the probability that a fault will cause a failure while in operational use u One study found that removing 60% of the faults in a product increased reliability only 3% u Many faults will only cause failures after hundreds or thousands of months of use u This is not necessarily something which can be safely ignored though. u It was feared that Y2K faults might cause catastrophic failures after many years of reliable operation
7
CSE3308/CSC3080 - Software Engineering: Analysis and DesignLecture 7B.7 Types of failures u Transient - Occurs only with certain inputs u Permanent - Occurs with all inputs u Recoverable - System can recover without operator intervention u Unrecoverable - Operator intervention needed to recover from failure u Non-Corrupting - Failure does not corrupt system state or data u Corrupting - Failure corrupts system state or data
8
CSE3308/CSC3080 - Software Engineering: Analysis and DesignLecture 7B.8 Why is reliability desirable? u Reliability is only one of many desirable system characteristics u Ensuring reliability can be very expensive u Example - Bell Laboratories reported that it took 8 years to move software availability on one system from 99.9% to 99.98% u Reliability often conflicts with other system characteristics such as efficiency
9
CSE3308/CSC3080 - Software Engineering: Analysis and DesignLecture 7B.9 The penalties of reliability u Increases costs by: u redundant hardware u additional design u additional implementation work u Validation overheads u decreased efficiency of the product due to the need for redundant code to handle exceptions
10
CSE3308/CSC3080 - Software Engineering: Analysis and DesignLecture 7B.10 The prize of reliability u Unreliable software isn’t used u Unreliable systems are hard to improve u System failure costs may be very high (e.g. the Westpac disaster) u Costs of loss of data may be very high u Inefficiency is predictable and can be worked around
11
CSE3308/CSC3080 - Software Engineering: Analysis and DesignLecture 7B.11 Good enough software u A very old concept, recently promulgated in the software industry u The reliability and quality of software should be as low as possible without stopping your customers from purchasing the software u First mover benefits overpower any advantage from increased reliability u Many business software organisations utilise the idea u Not an idea one wants to see move into the safety and mission critical systems field
12
CSE3308/CSC3080 - Software Engineering: Analysis and DesignLecture 7B.12 Measuring reliability u Most of the techniques are derived from hardware reliability metrics u Problem is that hardware is far more likely to fail due to wear than design and implementation defects u Software doesn’t wear and failures are from design and implementation defects u Still worthwhile to consider the techniques derived from hardware reliability
13
CSE3308/CSC3080 - Software Engineering: Analysis and DesignLecture 7B.13 Reliability Acronyms u MTBF - Mean Time Between Failures u MTTF - Mean Time To Failure u MTTR - Mean Time To Repair u MTBF = MFFT + MTTR u Many people consider it to be far more useful than measuring fault rate per LOC Availability = MTTF x 100% (MTTF +MTTR) u Very important in any continuously running system
14
CSE3308/CSC3080 - Software Engineering: Analysis and DesignLecture 7B.14 Other reliability metrics u MTBF while better than fault rates still has problems u Many software failures are transient and recoverable and therefore MTBF is not really a good measure of the reliability u Need measures which handle whether a software system will be available to meet a demand u We may need to use different measures for different parts of the system; often is no one best measure of reliability
15
CSE3308/CSC3080 - Software Engineering: Analysis and DesignLecture 7B.15 Other reliability metrics (2) u POFOD - Probability Of Failure On Demand u Measure of the likelihood that the system will fail when a service request is made u A POFOD of 0.001 means that 1 out of every 1000 service requests will fail u ROCOF - Rate of Occurrence Of Failure u Measure of the frequency of occurrence with which unexpected behaviour is likely to occur u A ROCOF of 2/100 means 2 failures are likely to occur in each of 100 operational time units u Also called failure intensity
16
CSE3308/CSC3080 - Software Engineering: Analysis and DesignLecture 7B.16 Reliability measurements u Number of system failures for a given number of inputs u Time between system failures u Number of transactions between failures u Time to restart after failure u Time may be measured as vraw execution time vcalendar time vnumber of transactions
17
CSE3308/CSC3080 - Software Engineering: Analysis and DesignLecture 7B.17 Reliability Specification u Need to be able to express reliability requirements in a quantifiable and verifiable manner u Specifications as follow are irrelevant vThe software shall be reliable as possible vThe software shall exhibit no more than N faults per 1000 lines u Reliability is dynamic and therefore can’t be expressed in terms of source code u We can never know if all the faults have been removed from source code
18
CSE3308/CSC3080 - Software Engineering: Analysis and DesignLecture 7B.18 Establishing a reliability specification u For each identified sub-system v identify the different types of system failure vanalyse the consequences of the failure u Partition the failures into different classes u For each failure class identified vdefine a reliability metric which is appropriate vit is not necessary to use the same metric for different classes of failure u Realise that some reliability metrics are unable to be validated va reliability specification which says that over the lifetime of the system an event will never occur
19
CSE3308/CSC3080 - Software Engineering: Analysis and DesignLecture 7B.19 Examples of a reliability specification for an ATM
20
CSE3308/CSC3080 - Software Engineering: Analysis and DesignLecture 7B.20 Statistical Testing u A software testing process used to test the reliability of software rather than discover the faults vDetermine the operational profile of the system, i.e. the probable pattern of usage of the system vSelect or generate a set of test data corresponding to the operational profile vApply the test cases to the program, recording the amount of execution time between failures, using appropriate time units vAfter a statistically significant number of failures have been observed, the software reliability can be computed
21
CSE3308/CSC3080 - Software Engineering: Analysis and DesignLecture 7B.21 Difficulties of Statistical Testing u Operational Profile uncertainty u High costs of generating the operational profile u Statistical uncertainty when high reliability is specified u Very hard to generate a valid operational profile for new systems which don’t correspond to an existing system u Reliability measurements are unreliable u Still a very valuable tool in specifying and measuring the reliability of a system
22
CSE3308/CSC3080 - Software Engineering: Analysis and DesignLecture 7B.22 Achieving a reliable system u Three basic strategies to achieve reliability u Fault Avoidance vBuild fault-free systems from the start u Fault Tolerance vBuild facilities into the system to let the system continue when faults cause system failures u Fault Detection vUse software validation techniques to discover faults prior to the system being put into operation u For most systems, fault avoidance and fault detection suffice to provide the required level of reliability
23
CSE3308/CSC3080 - Software Engineering: Analysis and DesignLecture 7B.23 Implementing Fault Avoidance u Availability of a formal and unambiguous system specification u Adoption of a quality philosophy by developers. Developers should be expected to write bug- free programs u Adoption of information hiding and encapsulation u Production of readable programs/specifications u Use of a strongly-typed language
24
CSE3308/CSC3080 - Software Engineering: Analysis and DesignLecture 7B.24 Implementing Fault Avoidance u Restrictions on use of error prone constructs e.g. vpointers vfloating point numbers vdynamic memory allocation vrecursion vparallelism vinterrupts
25
CSE3308/CSC3080 - Software Engineering: Analysis and DesignLecture 7B.25 Implementing Fault Tolerance u Even if somehow we build a fault-free system, we still need fault-tolerance in critical systems u Fault-free does not mean failure-free u Fault-free means that the system correctly meets its specifications u Specifications may be incomplete or faulty or unaware of a requirement of the environment u Can never conclusively prove that a system is fault-free
26
CSE3308/CSC3080 - Software Engineering: Analysis and DesignLecture 7B.26 Aspects of Fault Tolerance u Failure Detection vSystem must be able to detect that the current state of the system has caused a failure or will cause a failure u Damage Assessment vSystem must detect what damage the system failure has caused u Fault Recovery vSystem must change the state of the system to a known “safe” state vCan correct the damaged state (forward error recovery - harder) vCan restore to a previous known “safe” state (backwards error recovery - easier) u Fault Repair vModifying the system so that the failure does not recur vMany software failures are transient and need no repair and normal processing can resume after fault recovery
27
CSE3308/CSC3080 - Software Engineering: Analysis and DesignLecture 7B.27 Implementing Fault Tolerance u Hardware - Triple-Modular Redundancy (TMR) vHardware unit is replicated three (or more) times vOutput is compared from three units vIf one unit fails, its output is ignored vSpace Shuttle is a classic example Machine 1 Machine 2 Machine 3 Output Comparator Output Comparator
28
CSE3308/CSC3080 - Software Engineering: Analysis and DesignLecture 7B.28 Implementing Fault Tolerance (2) u Using Software u N-Version programming vHave multiple teams build different versions of the software and then execute them in parallel vAssumes teams are unlikely to make the same mistakes vNot necessarily a valid assumption, if teams all work from the same specification u Recovery Blocks vEach program component includes a test to check if the component has executed successfully vHas alternative code to back-up and repeat the operation if it fails vSimilar to assertions and exceptions u Both assume that the specification is correct
29
CSE3308/CSC3080 - Software Engineering: Analysis and DesignLecture 7B.29 N-Version Programming Version 1 Version 2 Version 3 Output Comparator Output Comparator
30
CSE3308/CSC3080 - Software Engineering: Analysis and DesignLecture 7B.30 Recovery Blocks Algorithm 1 Algorithm 2 Algorithm 3 Acceptance Test Acceptance Test Try Algorithm 1 Test for success Continue execution if acceptance test succeeds. Signal exception if all algorithms fail Retest Retry Retest Acceptance test fails - Retry
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.