8. Fault Tolerance in Software 8.1 Introduction Is it true that a program that has once performed a given task as specified will continue to do so? Yes,

Slides:



Advertisements
Similar presentations
Computer-System Structures Er.Harsimran Singh
Advertisements

Lectures on File Management
Fault-Tolerant Systems Design Part 1.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development.
Fault Detection in a HW/SW CoDesign Environment Prepared by A. Gaye Soykök.
Making Services Fault Tolerant
1 Building Reliable Web Services: Methodology, Composition, Modeling and Experiment Pat. P. W. Chan Department of Computer Science and Engineering The.
Software Fault Tolerance – The big Picture RTS April 2008 Anders P. Ravn Aalborg University.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 2: Computer-System Structures Computer System Operation I/O Structure Storage.
8. Fault Tolerance in Software 8.5 Construction of Acceptance Tests Goal Goal: describe the types and selection criteria for acceptance tests Two levels.
1 Chapter Fault Tolerant Design of Digital Systems.
8. Fault Tolerance in Software
1 Output Controls Ensure that system output is not lost, misdirected, or corrupted and that privacy is not violated. Exposures of this sort can cause serious.
Dependability ITV Real-Time Systems Anders P. Ravn Aalborg University February 2006.
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
Testing an individual module
1 Making Services Fault Tolerant Pat Chan, Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong Miroslaw Malek.
Design of SCS Architecture, Control and Fault Handling.
TRANSACTION PROCESSING SYSTEM Liew Woei Song Muhammad Hofiz Achoson.
Chapter Seven Advanced Shell Programming. 2 Lesson A Developing a Fully Featured Program.
General System Architecture and I/O.  I/O devices and the CPU can execute concurrently.  Each device controller is in charge of a particular device.
System/Software Testing
Real-Time Software Design Yonsei University 2 nd Semester, 2014 Sanghyun Park.
University of Palestine software engineering department Testing of Software Systems Fundamentals of testing instructor: Tasneem Darwish.
CS4311 Spring 2011 Unit Testing Dr. Guoqiang Hu Department of Computer Science UTEP.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 9 Slide 1 Critical Systems Specification 2.
Chapter 13: Implementation Phase 13.3 Good Programming Practice 13.6 Module Test Case Selection 13.7 Black-Box Module-Testing Techniques 13.8 Glass-Box.
2.1 Silberschatz, Galvin and Gagne ©2003 Operating System Concepts with Java Chapter 2: Computer-System Structures Computer System Operation I/O Structure.
CHAPTER 2: COMPUTER-SYSTEM STRUCTURES Computer system operation Computer system operation I/O structure I/O structure Storage structure Storage structure.
Silberschatz, Galvin, and Gagne  Applied Operating System Concepts Module 2: Computer-System Structures Computer System Operation I/O Structure.
1 CSE Department MAITSandeep Tayal Computer-System Structures Computer System Operation I/O Structure Storage Structure Storage Hierarchy Hardware Protection.
2: Computer-System Structures
Black Box Testing Techniques Chapter 7. Black Box Testing Techniques Prepared by: Kris C. Calpotura, CoE, MSME, MIT  Introduction Introduction  Equivalence.
Fault-Tolerant Systems Design Part 1.
SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4b) Department of Electrical.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
Fault Tolerance Mechanisms ITV Model-based Analysis and Design of Embedded Software Techniques and methods for Critical Software Anders P. Ravn Aalborg.
CprE 458/558: Real-Time Systems
Fault-Tolerant Systems Design Part 1.
Chapter 8 Lecture 1 Software Testing. Program testing Testing is intended to show that a program does what it is intended to do and to discover program.
CE Operating Systems Lecture 2 Low level hardware support for operating systems.
1 CS.217 Operating System By Ajarn..Sutapart Sappajak,METC,MSIT Chapter 2 Computer-System Structures Slide 1 Chapter 2 Computer-System Structures.
Silberschatz, Galvin and Gagne  Applied Operating System Concepts Chapter 2: Computer-System Structures Computer System Architecture and Operation.
CE Operating Systems Lecture 2 Low level hardware support for operating systems.
Dynamic Testing.
A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.
Testing Overview Software Reliability Techniques Testing Concepts CEN 4010 Class 24 – 11/17.
Structuring Redundancy for Fault Tolerance Chapter 2 Designed by: Hadi Salimi Instructor: Dr. Mohsen Sharifi.
SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4a) Department of Electrical.
18/05/2006 Fault Tolerant Computing Based on Diversity by Seda Demirağ
Powerpoint Templates Data Communication Muhammad Waseem Iqbal Lecture # 07 Spring-2016.
Week#3 Software Quality Engineering.
Chapter 2: Computer-System Structures(Hardware)
Chapter 2: Computer-System Structures
Chapter 8 – Software Testing
Real-time Software Design
Outline Announcements Fault Tolerance.
Computer-System Architecture
Module 2: Computer-System Structures
Fault Tolerance Distributed Web-based Systems
BIC 10503: COMPUTER ARCHITECTURE
Test Case Test case Describes an input Description and an expected output Description. Test case ID Section 1: Before execution Section 2: After execution.
Module 2: Computer-System Structures
Chapter 2: Computer-System Structures
Chapter 2: Computer-System Structures
Module 2: Computer-System Structures
Module 2: Computer-System Structures
Abstractions for Fault Tolerance
Seminar on Enterprise Software
Presentation transcript:

8. Fault Tolerance in Software 8.1 Introduction Is it true that a program that has once performed a given task as specified will continue to do so? Yes, if provided that none of the following parameters change: The inputs The computing environment The user requirements

8. Fault Tolerance in Software 8.1 Introduction Consistency of failure rates in time. Federal Reserve Funds Transfer Program, active 12 hours/day, 5 days/week.

8. Fault Tolerance in Software 8.1 Introduction Failure rates of Command and Control Systems. Data and Analysis Center for Software (DACS), fault density: the # of faults per 1000 lines of code, ranges from 10 – 50 for “good” SW and from 1 – 5 after intensive testing using automated tools.

8. Fault Tolerance in Software 8.1 Introduction Consequences of SW failure : Attendance has personal experience with incorrect billing, lost airline or hotel reservations. Attendance has personal experience with incorrect billing, lost airline or hotel reservations. More serious errors reported in the media, such as the disruption of phone service to over 20 million customers during the summer of 1991 due to coding error in a new generation digital switch. The most serious consequences are related to real-time applications, such as those involving spacecrafts: the launch failure of Mariner I (1962), the destruction of a French meteorological satellite in 1968, several problems during the Apollo missions in the early of 1970s, the NASA Space Shuttle, the fly-by-wire Airbus A320, the Russian satellite “Mars”, the satellite launcher Ariane.

8. Fault Tolerance in Software 8.1 Introduction Causes of SW failure : Malfunction of a process. E.g. exception handling, timeout computation, design error (solution: check the outputs and timer); Erroneous control sequence (solution: set an upper limit on loop iterations); Data entry error (solution: use of error-detecting code and type checks in input data).

8. Fault Tolerance in Software 8.3 Dealing with Faulty Programs Robustness The minimum requirement is that the program will properly handle inputs out of range, or in a different type of format than defined, without degrading its performance of functions not dependent on the nonstandard input. When these input data are found not to comply with the program specification: a new input may be requested; the last acceptable value of a variable can be used; or a predefined default can e assigned.

8. Fault Tolerance in Software 8.3 Dealing with Faulty Programs Robustness In general, Robustness is used to test: the function of a process (e.g., by checking the outputs); the control sequence (e.g., by setting an upper limit on loop iterations); the input data (e.g., by using error-detecting code and type checks).

8. Fault Tolerance in Software 8.3 Dealing with Faulty Programs Temporal Redundancy Temporal Redundancy consists of the reexecution of a program when an error is encountered. The error may involve faulty data (as detected by Robustness), faulty execution (e.g., accessing protected memory), or incorrect output (as detected by Acceptance Tests). Temporary reexecution will clear errors that arose from temporary circumstances that are no longer present when a new pass through the program is taken.  E.g., busy or noisy communication channels, full buffers, power supply transients.

8. Fault Tolerance in Software 8.3 Dealing with Faulty Programs Temporal Redundancy When the error persists, Fault Containment Procedures must be triggered by the system.

8. Fault Tolerance in Software 8.3 Dealing with Faulty Programs Software Diversity SW Diversity permits uninterrupted system operation under the presence of program faults through multiple implementations of a given functional process and it is therefore particularly applicable to real-time control systems. It is divided into two categories: Static SW Fault Tolerance: N-Version programming Dynamic SW Fault Tolerance: Recovery Block

8. Fault Tolerance in Software 8.3 Dealing with Faulty Programs Software Diversity Static SW Fault Tolerance: N-Version Programming  A given task is executed by several programs (consecutively on the same machine) and the result accepted only if a specified # of programs agree within specified limits. The same computer performs comparison and selection of the results to be propagated to the external system.  In practice, the programs are executed concurrently, and therefore multiple computers are required to implement this technique.

8. Fault Tolerance in Software 8.3 Dealing with Faulty Programs Software Diversity Dynamic SW Fault Tolerance: Recovery Block  A single program is executed and the result (including intermediate results) is subjected to an Acceptance Test.

8. Fault Tolerance in Software 8.3 Dealing with Faulty Programs Software Diversity The term STATIC is used because the selection of the acceptable result does not affect the subsequent execution of the programs. The term DYNAMIC is used because the selection between the original and alternate program is made during execution based on the outcome of the Acceptance Test.

8. Fault Tolerance in Software 8.4 Design of Fault Tolerant Software Using Diversity N-Version Programming Defined as the independent generation of N  2 functionally equivalent programs, called versions, from the same initial specification. In this case, fault masking is not provided and upon disagreement among the versions, 3 alternatives are available: Retry or restart (in this case fault containment rather than FT is provided; Transition to a predefined “safe state”, possibly followed by later retries; Reliance on one of the versions, either designated in advance as more reliable or selected by a diagnostic program (in the latter case the technique takes on some aspects of dynamic redundancy).

8. Fault Tolerance in Software 8.4 Design of Fault Tolerant Software Using Diversity N-Version Programming For N > 2, a majority voting logic can be implemented (N = 3), it is required: I. Three independent programs, each furnishing identical output formats; II. An acceptance program that evaluates the output of (i) and selects the result to be furnished as N-version output; III. A driver (process controller) that invokes requirements (i) and (ii) and furnishes the N-version output to other programs or the physical system.

8. Fault Tolerance in Software 8.4 Design of Fault Tolerant Software Using Diversity N-Version Programming Experiment carried out at UCLA (1978): 7 separate versions for the application program; From this, 12 3-version sets were constructed; Each set was subject to 32 test cases,yielding 384 total tests. One of the conclusions: Cases where a single faulty version resulted in incorrect execution, the OS of the computer intervened before the program reached the voting stage. Most later N-version experiments overcame this problem by incorporating acceptance tests for abort conditions and precluding the intervention of the OS under these conditions.

8. Fault Tolerance in Software 8.4 Design of Fault Tolerant Software Using Diversity N-Version Programming Results of an Early N-Version Programming Experiment.

8. Fault Tolerance in Software 8.4 Design of Fault Tolerant Software Using Diversity Recovery Block Represents the Dynamic Redundancy Approach to SW fault tolerance. Consists of 3 SW elements: a primary routing, which executes critical SW functions; an acceptance test, which tests the output of the primary routine after every execution; at least one alternate routine which performs the same function as the primary routine (but may be less capable or slower) and is invoked by the acceptance test upon detection of a failure.

8. Fault Tolerance in Software 8.4 Design of Fault Tolerant Software Using Diversity Recovery Block The basic structure is: Ensure T By P Else by Q Else Error Where: T is the acceptance test condition that is expected to be met by successful execution of either the primary routine P or the alternate routine Q. The structure is easily expanded to accommodate several alternates Q1, Q2, GQ3,...,Qn.

8. Fault Tolerance in Software 8.4 Design of Fault Tolerant Software Using Diversity Recovery Block Difference between Recovery Block and N-Version Programming are: only a single implementation of the program is run at a time (in this case: P or Q); the acceptability of the results is decided by a test rather than by comparison with functionally equivalent alternate versions.

8. Fault Tolerance in Software 8.4 Design of Fault Tolerant Software Using Diversity Recovery Block Real-time control applications require that results furnished by a program be both correct and timely. For this reason, the recovery block for a real-time program should incorporate a watchdog timer which initiates execution by Q (if P does not produce an acceptance result within the allocated time).

8. Fault Tolerance in Software 8.4 Design of Fault Tolerant Software Using Diversity Recovery Block Recovery block for real-time application *. (Program flow under direction of the application module is shown in solid lines; timer-triggered interrupts are shown in dashed lines.)

A single program is executed at any given time: No special demands on computer redundancy or computer architecture are made. Performance penalty in normal operation is small: the execution of the acceptance test. 8. Fault Tolerance in Software 8.4 Design of Fault Tolerant Software Using Diversity Recovery Block Highlights...

Storage requirements are expanded: in addition to the primary application program, the acceptance test and the backup program must also be available in memory. SW development cost is increased: Need to generate two programs and the associated acceptance test. 8. Fault Tolerance in Software 8.4 Design of Fault Tolerant Software Using Diversity Recovery Block Highlights...

The Acceptance Test is divided into 2 separate tests which are invoked before and after the execution of the primary routine: Before: The first acceptance test checks on the call format and parameters. The second acceptance test checks on the validity of the input data. (When data errors are common, provision of an alternate data source may be considered: dashed lines indicating the backup data) After: The last acceptance test examines the output data. 8. Fault Tolerance in Software 8.4 Design of Fault Tolerant Software Using Diversity Recovery Block Details about the Basic Recovery Block Structure...

8. Fault Tolerance in Software 8.4 Design of Fault Tolerant Software Using Diversity Recovery Block Internal Structure for primary application module.

8. Fault Tolerance in Software 8.4 Design of Fault Tolerant Software Using Diversity Recovery Block The integration of application modules structured as recovery blocks into a fault-tolerant SW system is shown in the next figure. * “Application Modules” and the decision diamond labeled “Return” together represent the structure shown in figure *. In the absence of failures of the recovery blocks, the process will always remain in the inner loop. If an abort is taken, the failure is recorded and some diagnostics may be performed. In case of a first failure in a recovery block, a retry may be initiated. If the failure persists, further execution of the task represented by the recovery block is suspended

8. Fault Tolerance in Software 8.4 Design of Fault Tolerant Software Using Diversity Recovery Block Executive and application modules.