Presentation is loading. Please wait.

Presentation is loading. Please wait.

8. Fault Tolerance in Software 8.1 Introduction Is it true that a program that has once performed a given task as specified will continue to do so? Yes,

Similar presentations


Presentation on theme: "8. Fault Tolerance in Software 8.1 Introduction Is it true that a program that has once performed a given task as specified will continue to do so? Yes,"— Presentation transcript:

1 8. Fault Tolerance in Software 8.1 Introduction Is it true that a program that has once performed a given task as specified will continue to do so? Yes, if provided that none of the following parameters change: The inputs The computing environment The user requirements

2 8. Fault Tolerance in Software 8.1 Introduction Consistency of failure rates in time. Federal Reserve Funds Transfer Program, active 12 hours/day, 5 days/week.

3 8. Fault Tolerance in Software 8.1 Introduction Failure rates of Command and Control Systems. Data and Analysis Center for Software (DACS), fault density: the # of faults per 1000 lines of code, ranges from 10 – 50 for “good” SW and from 1 – 5 after intensive testing using automated tools.

4 8. Fault Tolerance in Software 8.1 Introduction Consequences of SW failure : Attendance has personal experience with incorrect billing, lost airline or hotel reservations. Attendance has personal experience with incorrect billing, lost airline or hotel reservations. More serious errors reported in the media, such as the disruption of phone service to over 20 million customers during the summer of 1991 due to coding error in a new generation digital switch. The most serious consequences are related to real-time applications, such as those involving spacecrafts: the launch failure of Mariner I (1962), the destruction of a French meteorological satellite in 1968, several problems during the Apollo missions in the early of 1970s, the NASA Space Shuttle, the fly-by-wire Airbus A320, the Russian satellite “Mars”, the satellite launcher Ariane.

5 8. Fault Tolerance in Software 8.1 Introduction Causes of SW failure : Malfunction of a process. E.g. exception handling, timeout computation, design error (solution: check the outputs and timer); Erroneous control sequence (solution: set an upper limit on loop iterations); Data entry error (solution: use of error-detecting code and type checks in input data).

6 8. Fault Tolerance in Software 8.3 Dealing with Faulty Programs 8.3.1 Robustness The minimum requirement is that the program will properly handle inputs out of range, or in a different type of format than defined, without degrading its performance of functions not dependent on the nonstandard input. When these input data are found not to comply with the program specification: a new input may be requested; the last acceptable value of a variable can be used; or a predefined default can e assigned.

7 8. Fault Tolerance in Software 8.3 Dealing with Faulty Programs 8.3.1 Robustness In general, Robustness is used to test: the function of a process (e.g., by checking the outputs); the control sequence (e.g., by setting an upper limit on loop iterations); the input data (e.g., by using error-detecting code and type checks).

8 8. Fault Tolerance in Software 8.3 Dealing with Faulty Programs 8.3.2 Temporal Redundancy Temporal Redundancy consists of the reexecution of a program when an error is encountered. The error may involve faulty data (as detected by Robustness), faulty execution (e.g., accessing protected memory), or incorrect output (as detected by Acceptance Tests). Temporary reexecution will clear errors that arose from temporary circumstances that are no longer present when a new pass through the program is taken.  E.g., busy or noisy communication channels, full buffers, power supply transients.

9 8. Fault Tolerance in Software 8.3 Dealing with Faulty Programs 8.3.2 Temporal Redundancy When the error persists, Fault Containment Procedures must be triggered by the system.

10 8. Fault Tolerance in Software 8.3 Dealing with Faulty Programs 8.3.3 Software Diversity SW Diversity permits uninterrupted system operation under the presence of program faults through multiple implementations of a given functional process and it is therefore particularly applicable to real-time control systems. It is divided into two categories: Static SW Fault Tolerance: N-Version programming Dynamic SW Fault Tolerance: Recovery Block

11 8. Fault Tolerance in Software 8.3 Dealing with Faulty Programs 8.3.3 Software Diversity Static SW Fault Tolerance: N-Version Programming  A given task is executed by several programs (consecutively on the same machine) and the result accepted only if a specified # of programs agree within specified limits. The same computer performs comparison and selection of the results to be propagated to the external system.  In practice, the programs are executed concurrently, and therefore multiple computers are required to implement this technique.

12 8. Fault Tolerance in Software 8.3 Dealing with Faulty Programs 8.3.3 Software Diversity Dynamic SW Fault Tolerance: Recovery Block  A single program is executed and the result (including intermediate results) is subjected to an Acceptance Test.

13 8. Fault Tolerance in Software 8.3 Dealing with Faulty Programs 8.3.3 Software Diversity The term STATIC is used because the selection of the acceptable result does not affect the subsequent execution of the programs. The term DYNAMIC is used because the selection between the original and alternate program is made during execution based on the outcome of the Acceptance Test.

14 8. Fault Tolerance in Software 8.4 Design of Fault Tolerant Software Using Diversity 8.4.1 N-Version Programming Defined as the independent generation of N  2 functionally equivalent programs, called versions, from the same initial specification. In this case, fault masking is not provided and upon disagreement among the versions, 3 alternatives are available: Retry or restart (in this case fault containment rather than FT is provided; Transition to a predefined “safe state”, possibly followed by later retries; Reliance on one of the versions, either designated in advance as more reliable or selected by a diagnostic program (in the latter case the technique takes on some aspects of dynamic redundancy).

15 8. Fault Tolerance in Software 8.4 Design of Fault Tolerant Software Using Diversity 8.4.1 N-Version Programming For N > 2, a majority voting logic can be implemented (N = 3), it is required: I. Three independent programs, each furnishing identical output formats; II. An acceptance program that evaluates the output of (i) and selects the result to be furnished as N-version output; III. A driver (process controller) that invokes requirements (i) and (ii) and furnishes the N-version output to other programs or the physical system.

16 8. Fault Tolerance in Software 8.4 Design of Fault Tolerant Software Using Diversity 8.4.1 N-Version Programming Experiment carried out at UCLA (1978): 7 separate versions for the application program; From this, 12 3-version sets were constructed; Each set was subject to 32 test cases,yielding 384 total tests. One of the conclusions: Cases where a single faulty version resulted in incorrect execution, the OS of the computer intervened before the program reached the voting stage. Most later N-version experiments overcame this problem by incorporating acceptance tests for abort conditions and precluding the intervention of the OS under these conditions.

17 8. Fault Tolerance in Software 8.4 Design of Fault Tolerant Software Using Diversity 8.4.1 N-Version Programming Results of an Early N-Version Programming Experiment.

18 8. Fault Tolerance in Software 8.4 Design of Fault Tolerant Software Using Diversity 8.4.2 Recovery Block Represents the Dynamic Redundancy Approach to SW fault tolerance. Consists of 3 SW elements: a primary routing, which executes critical SW functions; an acceptance test, which tests the output of the primary routine after every execution; at least one alternate routine which performs the same function as the primary routine (but may be less capable or slower) and is invoked by the acceptance test upon detection of a failure.

19 8. Fault Tolerance in Software 8.4 Design of Fault Tolerant Software Using Diversity 8.4.2 Recovery Block The basic structure is: Ensure T By P Else by Q Else Error Where: T is the acceptance test condition that is expected to be met by successful execution of either the primary routine P or the alternate routine Q. The structure is easily expanded to accommodate several alternates Q1, Q2, GQ3,...,Qn.

20 8. Fault Tolerance in Software 8.4 Design of Fault Tolerant Software Using Diversity 8.4.2 Recovery Block Difference between Recovery Block and N-Version Programming are: only a single implementation of the program is run at a time (in this case: P or Q); the acceptability of the results is decided by a test rather than by comparison with functionally equivalent alternate versions.

21 8. Fault Tolerance in Software 8.4 Design of Fault Tolerant Software Using Diversity 8.4.2 Recovery Block Real-time control applications require that results furnished by a program be both correct and timely. For this reason, the recovery block for a real-time program should incorporate a watchdog timer which initiates execution by Q (if P does not produce an acceptance result within the allocated time).

22 8. Fault Tolerance in Software 8.4 Design of Fault Tolerant Software Using Diversity 8.4.2 Recovery Block Recovery block for real-time application *. (Program flow under direction of the application module is shown in solid lines; timer-triggered interrupts are shown in dashed lines.)

23 A single program is executed at any given time: No special demands on computer redundancy or computer architecture are made. Performance penalty in normal operation is small: the execution of the acceptance test. 8. Fault Tolerance in Software 8.4 Design of Fault Tolerant Software Using Diversity 8.4.2 Recovery Block Highlights...

24 Storage requirements are expanded: in addition to the primary application program, the acceptance test and the backup program must also be available in memory. SW development cost is increased: Need to generate two programs and the associated acceptance test. 8. Fault Tolerance in Software 8.4 Design of Fault Tolerant Software Using Diversity 8.4.2 Recovery Block Highlights...

25 The Acceptance Test is divided into 2 separate tests which are invoked before and after the execution of the primary routine: Before: The first acceptance test checks on the call format and parameters. The second acceptance test checks on the validity of the input data. (When data errors are common, provision of an alternate data source may be considered: dashed lines indicating the backup data) After: The last acceptance test examines the output data. 8. Fault Tolerance in Software 8.4 Design of Fault Tolerant Software Using Diversity 8.4.2 Recovery Block Details about the Basic Recovery Block Structure...

26 8. Fault Tolerance in Software 8.4 Design of Fault Tolerant Software Using Diversity 8.4.2 Recovery Block Internal Structure for primary application module.

27 8. Fault Tolerance in Software 8.4 Design of Fault Tolerant Software Using Diversity 8.4.2 Recovery Block The integration of application modules structured as recovery blocks into a fault-tolerant SW system is shown in the next figure. * “Application Modules” and the decision diamond labeled “Return” together represent the structure shown in figure *. In the absence of failures of the recovery blocks, the process will always remain in the inner loop. If an abort is taken, the failure is recorded and some diagnostics may be performed. In case of a first failure in a recovery block, a retry may be initiated. If the failure persists, further execution of the task represented by the recovery block is suspended

28 8. Fault Tolerance in Software 8.4 Design of Fault Tolerant Software Using Diversity 8.4.2 Recovery Block Executive and application modules.


Download ppt "8. Fault Tolerance in Software 8.1 Introduction Is it true that a program that has once performed a given task as specified will continue to do so? Yes,"

Similar presentations


Ads by Google