CSE3308/CSC3080 - Software Engineering: Analysis and DesignLecture 7B.1 Software Engineering: Analysis and Design - CSE3308 Reliability CSE3308/CSC3080/DMS/2000/17.

Slides:



Advertisements
Similar presentations
Test process essentials Riitta Viitamäki,
Advertisements

©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 2.
Fault-Tolerant Systems Design Part 1.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 24 Slide 1 Critical Systems Validation.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development.
SE 450 Software Processes & Product Metrics Reliability: An Introduction.
SWE Introduction to Software Engineering
Critical systems development
Modified from Sommerville’s originals Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development.
Modified from Sommerville’s originals Software Engineering, 7th edition. Chapter 24 Slide 1 Critical Systems Validation.
Modified from Sommerville’s originals Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development.
Soft. Eng. II, Spr. 2002Dr Driss Kettani, from I. Sommerville1 CSC-3325: Chapter 9 Title : Reliability Reading: I. Sommerville, Chap. 16, 17 and 18.
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
Developing Dependable Systems CIS 376 Bruce R. Maxim UM-Dearborn.
CIS 376 Bruce R. Maxim UM-Dearborn
Software Testing and QA Theory and Practice (Chapter 15: Software Reliability) © Naik & Tripathy 1 Software Testing and Quality Assurance Theory and Practice.
Software Dependability CIS 376 Bruce R. Maxim UM-Dearborn.
Software faults & reliability Presented by: Presented by: Pooja Jain Pooja Jain.
Software Reliability Categorising and specifying the reliability of software systems.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 24 Slide 1 Critical Systems Validation 1.
CSCI 5801: Software Engineering
Software Testing Verification and validation planning Software inspections Software Inspection vs. Testing Automated static analysis Cleanroom software.
System Testing There are several steps in testing the system: –Function testing –Performance testing –Acceptance testing –Installation testing.
2. Fault Tolerance. 2 Fault - Error - Failure Fault = physical defect or flow occurring in some component (hardware or software) Error = incorrect behavior.
Handouts Software Testing and Quality Assurance Theory and Practice Chapter 15 Software Reliability
University of Palestine software engineering department Testing of Software Systems Fundamentals of testing instructor: Tasneem Darwish.
Critical systems development. Objectives l To explain how fault tolerance and fault avoidance contribute to the development of dependable systems l To.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 9 Slide 1 Critical Systems Specification 2.
Software Engineering DKT 311 Lecture 11 Verification and critical system validation.
This chapter is extracted from Sommerville’s slides. Text book chapter
OHTO -99 SOFTWARE ENGINEERING “SOFTWARE PRODUCT QUALITY” Today: - Software quality - Quality Components - ”Good” software properties.
Ch. 1.  High-profile failures ◦ Therac 25 ◦ Denver Intl Airport ◦ Also, Patriot Missle.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 22 Slide 1 Software Verification, Validation and Testing.
Fault-Tolerant Systems Design Part 1.
Software Reliability (Lecture 13) Dr. R. Mall. Organization of this Lecture: $ Introduction. $ Reliability metrics $ Reliability growth modelling $ Statistical.
Building Dependable Distributed Systems Chapter 1 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
Critical Systems Development IS301 – software Engineering Lecture #19 – M. E. Kabay, PhD, CISSP Dept of Computer Information Systems Norwich.
CprE 458/558: Real-Time Systems
LESSON 3. Properties of Well-Engineered Software The attributes or properties of a software product are characteristics displayed by the product once.
Fault-Tolerant Systems Design Part 1.
Chapter 8 Lecture 1 Software Testing. Program testing Testing is intended to show that a program does what it is intended to do and to discover program.
Fault Tolerance Benchmarking. 2 Owerview What is Benchmarking? What is Dependability? What is Dependability Benchmarking? What is the relation between.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development.
HNDIT23082 Lecture 09:Software Testing. Validations and Verification Validation and verification ( V & V ) is the name given to the checking and analysis.
©Ian Sommerville 2000Dependability Slide 1 Chapter 16 Dependability.
Testing Overview Software Reliability Techniques Testing Concepts CEN 4010 Class 24 – 11/17.
SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4a) Department of Electrical.
 Software reliability is the probability that software will work properly in a specified environment and for a given amount of time. Using the following.
Week#3 Software Quality Engineering.
Chapter 11 – Reliability Engineering
Software Metrics and Reliability
Introduction Edited by Enas Naffar using the following textbooks: - A concise introduction to Software Engineering - Software Engineering for students-
Testing Tutorial 7.
Hardware & Software Reliability
Software Reliability Definition: The probability of failure-free operation of the software for a specified period of time in a specified environment.
Software Reliability PPT BY:Dr. R. Mall 7/5/2018.
Fault Tolerance In Operating System
Software Reliability: 2 Alternate Definitions
Introduction Edited by Enas Naffar using the following textbooks: - A concise introduction to Software Engineering - Software Engineering for students-
Critical systems development
Critical Systems Validation
Software Reliability Models.
Critical Systems Validation
Fault Tolerance Distributed Web-based Systems
Software Reliability (Lecture 12)
Critical Systems Development
Presentation transcript:

CSE3308/CSC Software Engineering: Analysis and DesignLecture 7B.1 Software Engineering: Analysis and Design - CSE3308 Reliability CSE3308/CSC3080/DMS/2000/17 Monash University - School of Computer Science and Software Engineering

CSE3308/CSC Software Engineering: Analysis and DesignLecture 7B.2 Lecture Outline u What is reliability? u Failures and Faults u Why is reliability desirable? u Good Enough Software u Measuring reliability u Specifying reliability u Achieving a reliable system

CSE3308/CSC Software Engineering: Analysis and DesignLecture 7B.3 What is reliability? u A formal definition vThe probability of failure-free operation of a computer program in a specified environment for a specified time u An informal definition vHow well the system users think it provides the required services u For a system to be reliable both the informal and the formal definitions must be satisfactorily met ve.g. an aeroplane navigation system may have a very low probability of failure, but even one failure may make it unreliable in the view of the pilot and the passenger

CSE3308/CSC Software Engineering: Analysis and DesignLecture 7B.4 Aspects of reliability u Reliability cannot be defined in an absolute manner u Reliability can only be defined in relationship to a particular operational context u The relationship between the faults in a software product and the reliability of such a product is very complex u To properly consider the reliability of a piece of software, the impact of a fault must be assessed

CSE3308/CSC Software Engineering: Analysis and DesignLecture 7B.5 Faults and Failures u A fault is a static software characteristic which causes a failure to occur u A failure corresponds to unexpected run-time behaviour observed by the user of the system u Faults don’t necessarily cause failures u If a user doesn’t notice a failure, is it a failure?

CSE3308/CSC Software Engineering: Analysis and DesignLecture 7B.6 Faults and Failures (2) u Reliability is related to the probability that a fault will cause a failure while in operational use u One study found that removing 60% of the faults in a product increased reliability only 3% u Many faults will only cause failures after hundreds or thousands of months of use u This is not necessarily something which can be safely ignored though. u It was feared that Y2K faults might cause catastrophic failures after many years of reliable operation

CSE3308/CSC Software Engineering: Analysis and DesignLecture 7B.7 Types of failures u Transient - Occurs only with certain inputs u Permanent - Occurs with all inputs u Recoverable - System can recover without operator intervention u Unrecoverable - Operator intervention needed to recover from failure u Non-Corrupting - Failure does not corrupt system state or data u Corrupting - Failure corrupts system state or data

CSE3308/CSC Software Engineering: Analysis and DesignLecture 7B.8 Why is reliability desirable? u Reliability is only one of many desirable system characteristics u Ensuring reliability can be very expensive u Example - Bell Laboratories reported that it took 8 years to move software availability on one system from 99.9% to 99.98% u Reliability often conflicts with other system characteristics such as efficiency

CSE3308/CSC Software Engineering: Analysis and DesignLecture 7B.9 The penalties of reliability u Increases costs by: u redundant hardware u additional design u additional implementation work u Validation overheads u decreased efficiency of the product due to the need for redundant code to handle exceptions

CSE3308/CSC Software Engineering: Analysis and DesignLecture 7B.10 The prize of reliability u Unreliable software isn’t used u Unreliable systems are hard to improve u System failure costs may be very high (e.g. the Westpac disaster) u Costs of loss of data may be very high u Inefficiency is predictable and can be worked around

CSE3308/CSC Software Engineering: Analysis and DesignLecture 7B.11 Good enough software u A very old concept, recently promulgated in the software industry u The reliability and quality of software should be as low as possible without stopping your customers from purchasing the software u First mover benefits overpower any advantage from increased reliability u Many business software organisations utilise the idea u Not an idea one wants to see move into the safety and mission critical systems field

CSE3308/CSC Software Engineering: Analysis and DesignLecture 7B.12 Measuring reliability u Most of the techniques are derived from hardware reliability metrics u Problem is that hardware is far more likely to fail due to wear than design and implementation defects u Software doesn’t wear and failures are from design and implementation defects u Still worthwhile to consider the techniques derived from hardware reliability

CSE3308/CSC Software Engineering: Analysis and DesignLecture 7B.13 Reliability Acronyms u MTBF - Mean Time Between Failures u MTTF - Mean Time To Failure u MTTR - Mean Time To Repair u MTBF = MFFT + MTTR u Many people consider it to be far more useful than measuring fault rate per LOC Availability = MTTF x 100% (MTTF +MTTR) u Very important in any continuously running system

CSE3308/CSC Software Engineering: Analysis and DesignLecture 7B.14 Other reliability metrics u MTBF while better than fault rates still has problems u Many software failures are transient and recoverable and therefore MTBF is not really a good measure of the reliability u Need measures which handle whether a software system will be available to meet a demand u We may need to use different measures for different parts of the system; often is no one best measure of reliability

CSE3308/CSC Software Engineering: Analysis and DesignLecture 7B.15 Other reliability metrics (2) u POFOD - Probability Of Failure On Demand u Measure of the likelihood that the system will fail when a service request is made u A POFOD of means that 1 out of every 1000 service requests will fail u ROCOF - Rate of Occurrence Of Failure u Measure of the frequency of occurrence with which unexpected behaviour is likely to occur u A ROCOF of 2/100 means 2 failures are likely to occur in each of 100 operational time units u Also called failure intensity

CSE3308/CSC Software Engineering: Analysis and DesignLecture 7B.16 Reliability measurements u Number of system failures for a given number of inputs u Time between system failures u Number of transactions between failures u Time to restart after failure u Time may be measured as vraw execution time vcalendar time vnumber of transactions

CSE3308/CSC Software Engineering: Analysis and DesignLecture 7B.17 Reliability Specification u Need to be able to express reliability requirements in a quantifiable and verifiable manner u Specifications as follow are irrelevant vThe software shall be reliable as possible vThe software shall exhibit no more than N faults per 1000 lines u Reliability is dynamic and therefore can’t be expressed in terms of source code u We can never know if all the faults have been removed from source code

CSE3308/CSC Software Engineering: Analysis and DesignLecture 7B.18 Establishing a reliability specification u For each identified sub-system v identify the different types of system failure vanalyse the consequences of the failure u Partition the failures into different classes u For each failure class identified vdefine a reliability metric which is appropriate vit is not necessary to use the same metric for different classes of failure u Realise that some reliability metrics are unable to be validated va reliability specification which says that over the lifetime of the system an event will never occur

CSE3308/CSC Software Engineering: Analysis and DesignLecture 7B.19 Examples of a reliability specification for an ATM

CSE3308/CSC Software Engineering: Analysis and DesignLecture 7B.20 Statistical Testing u A software testing process used to test the reliability of software rather than discover the faults vDetermine the operational profile of the system, i.e. the probable pattern of usage of the system vSelect or generate a set of test data corresponding to the operational profile vApply the test cases to the program, recording the amount of execution time between failures, using appropriate time units vAfter a statistically significant number of failures have been observed, the software reliability can be computed

CSE3308/CSC Software Engineering: Analysis and DesignLecture 7B.21 Difficulties of Statistical Testing u Operational Profile uncertainty u High costs of generating the operational profile u Statistical uncertainty when high reliability is specified u Very hard to generate a valid operational profile for new systems which don’t correspond to an existing system u Reliability measurements are unreliable u Still a very valuable tool in specifying and measuring the reliability of a system

CSE3308/CSC Software Engineering: Analysis and DesignLecture 7B.22 Achieving a reliable system u Three basic strategies to achieve reliability u Fault Avoidance vBuild fault-free systems from the start u Fault Tolerance vBuild facilities into the system to let the system continue when faults cause system failures u Fault Detection vUse software validation techniques to discover faults prior to the system being put into operation u For most systems, fault avoidance and fault detection suffice to provide the required level of reliability

CSE3308/CSC Software Engineering: Analysis and DesignLecture 7B.23 Implementing Fault Avoidance u Availability of a formal and unambiguous system specification u Adoption of a quality philosophy by developers. Developers should be expected to write bug- free programs u Adoption of information hiding and encapsulation u Production of readable programs/specifications u Use of a strongly-typed language

CSE3308/CSC Software Engineering: Analysis and DesignLecture 7B.24 Implementing Fault Avoidance u Restrictions on use of error prone constructs e.g. vpointers vfloating point numbers vdynamic memory allocation vrecursion vparallelism vinterrupts

CSE3308/CSC Software Engineering: Analysis and DesignLecture 7B.25 Implementing Fault Tolerance u Even if somehow we build a fault-free system, we still need fault-tolerance in critical systems u Fault-free does not mean failure-free u Fault-free means that the system correctly meets its specifications u Specifications may be incomplete or faulty or unaware of a requirement of the environment u Can never conclusively prove that a system is fault-free

CSE3308/CSC Software Engineering: Analysis and DesignLecture 7B.26 Aspects of Fault Tolerance u Failure Detection vSystem must be able to detect that the current state of the system has caused a failure or will cause a failure u Damage Assessment vSystem must detect what damage the system failure has caused u Fault Recovery vSystem must change the state of the system to a known “safe” state vCan correct the damaged state (forward error recovery - harder) vCan restore to a previous known “safe” state (backwards error recovery - easier) u Fault Repair vModifying the system so that the failure does not recur vMany software failures are transient and need no repair and normal processing can resume after fault recovery

CSE3308/CSC Software Engineering: Analysis and DesignLecture 7B.27 Implementing Fault Tolerance u Hardware - Triple-Modular Redundancy (TMR) vHardware unit is replicated three (or more) times vOutput is compared from three units vIf one unit fails, its output is ignored vSpace Shuttle is a classic example Machine 1 Machine 2 Machine 3 Output Comparator Output Comparator

CSE3308/CSC Software Engineering: Analysis and DesignLecture 7B.28 Implementing Fault Tolerance (2) u Using Software u N-Version programming vHave multiple teams build different versions of the software and then execute them in parallel vAssumes teams are unlikely to make the same mistakes vNot necessarily a valid assumption, if teams all work from the same specification u Recovery Blocks vEach program component includes a test to check if the component has executed successfully vHas alternative code to back-up and repeat the operation if it fails vSimilar to assertions and exceptions u Both assume that the specification is correct

CSE3308/CSC Software Engineering: Analysis and DesignLecture 7B.29 N-Version Programming Version 1 Version 2 Version 3 Output Comparator Output Comparator

CSE3308/CSC Software Engineering: Analysis and DesignLecture 7B.30 Recovery Blocks Algorithm 1 Algorithm 2 Algorithm 3 Acceptance Test Acceptance Test Try Algorithm 1 Test for success Continue execution if acceptance test succeeds. Signal exception if all algorithms fail Retest Retry Retest Acceptance test fails - Retry