Presentation is loading. Please wait.

Presentation is loading. Please wait.

©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.

Similar presentations


Presentation on theme: "©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3."— Presentation transcript:

1 ©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3

2 ©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 2 l Forward recovery Apply repairs to a corrupted system state. l Backward recovery Restore the system state to a known safe state. l Forward recovery is usually application specific - domain knowledge is required to compute possible state corrections. l Backward error recovery is simpler. Details of a safe state are maintained and this replaces the corrupted system state. Fault recovery and repair

3 ©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 3 l Corruption of data coding Error coding techniques which add redundancy to coded data can be used for repairing data corrupted during transmission. l Redundant pointers When redundant pointers are included in data structures (e.g. two-way lists), a corrupted list or filestore may be rebuilt if a sufficient number of pointers are uncorrupted Often used for database and file system repair. Forward recovery

4 ©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 4 l Transactions are a frequently used method of backward recovery. Changes are not applied until computation is complete. If an error occurs, the system is left in the state preceding the transaction. l Periodic checkpoints allow system to 'roll- back' to a correct state. Backward recovery

5 ©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 5 Safe sort procedure l A sort operation monitors its own execution and assesses if the sort has been correctly executed. l It maintains a copy of its input so that if an error occurs, the input is not corrupted. l Based on identifying and handling exceptions. l Used in situations where the source code of components is not available so the component (bubblesort in this case) cannot be trusted. l Possible in this case as the condition for a‘valid’ sort is known. However, in many cases it is difficult to write validity checks.

6 ©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 6 Safe sort 1

7 ©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 7 Safe sort 2

8 ©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 8 Fault tolerant architecture l Defensive programming cannot cope with faults that involve interactions between the hardware and the software. l Misunderstandings of the requirements may mean that checks and the associated code are incorrect. l Where systems have high availability requirements, a specific architecture designed to support fault tolerance may be required. l This must tolerate both hardware and software failure.

9 ©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 9 Hardware fault tolerance l Depends on triple-modular redundancy (TMR). l There are three replicated identical components that receive the same input and whose outputs are compared. l If one output is different, it is ignored and component failure is assumed. l Based on most faults resulting from component failures rather than design faults and a low probability of simultaneous component failure.

10 ©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 10 Hardware reliability with TMR

11 ©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 11 Output selection l The output comparator is a (relatively) simple hardware unit. l It compares its input signals and, if one is different from the others, it rejects it. Essentially, the selection of the actual output depends on the majority vote. l The output comparator is connected to a fault management unit that can either try to repair the faulty unit or take it out of service.

12 ©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 12 Fault tolerant software architectures l The success of TMR at providing fault tolerance is based on two fundamental assumptions The hardware components do not include common design faults; Components fail randomly and there is a low probability of simultaneous component failure. l Neither of these assumptions are true for software It isn’t possible simply to replicate the same component as they would have common design faults; Simultaneous component failure is therefore virtually inevitable. l Software systems must therefore be diverse.

13 ©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 13 Design diversity l Different versions of the system are designed and implemented in different ways. They therefore ought to have different failure modes. l Different approaches to design (e.g object-oriented and function oriented) Implementation in different programming languages; Use of different tools and development environments; Use of different algorithms in the implementation.

14 ©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 14 Software analogies to TMR l N-version programming The same specification is implemented in a number of different versions by different teams. All versions compute simultaneously and the majority output is selected using a voting system. This is the most commonly used approach e.g. in many models of the Airbus commercial aircraft. l Recovery blocks A number of explicitly different versions of the same specification are written and executed in sequence. An acceptance test is used to select the output to be transmitted.

15 ©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 15 N-version programming

16 ©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 16 Output comparison l As in hardware systems, the output comparator is a simple piece of software that uses a voting mechanism to select the output. l In real-time systems, there may be a requirement that the results from the different versions are all produced within a certain time frame.

17 ©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 17 N-version programming l The different system versions are designed and implemented by different teams. It is assumed that there is a low probability that they will make the same mistakes. The algorithms used should but may not be different. l There is some empirical evidence that teams commonly misinterpret specifications in the same way and chose the same algorithms in their systems.

18 ©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 18 Recovery blocks l As with N-version programming, relies on the existence of redundant diverse components. l However, execution of components is sequential rather than concurrent. l An explicit acceptance test is used to decide if an operation has been successful rather than a voting mechanism.

19 ©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 19 Recovery blocks

20 ©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 20 Recovery blocks l These force a different algorithm to be used for each version so they reduce the probability of common errors. l However, the design of the acceptance test is difficult as it must be independent of the computation used. l There are problems with this approach for real-time systems because of the sequential operation of the redundant versions. Time to complete the operation is therefore variable.

21 ©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 21 Problems with design diversity l Teams are not culturally diverse so they tend to tackle problems in the same way. l Characteristic errors Different teams make the same mistakes. Some parts of an implementation are more difficult than others so all teams tend to make mistakes in the same place; Specification errors; If there is an error in the specification then this is reflected in all implementations; This can be addressed to some extent by using multiple specification representations.

22 ©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 22 Specification dependency l Both approaches to software redundancy are susceptible to specification errors. If the specification is incorrect, the system could fail l This is also a problem with hardware but software specifications are usually more complex than hardware specifications and harder to validate. l This has been addressed in some cases by developing separate software specifications from the same user specification.

23 ©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 23 Key points l Forward recovery from a fault means using some alternative approach to carry out some operation. This may use, e.g., redundant data. l Backward recovery means restoring the system to some known correct state. It is generally easier to implement than forward recovery. l N-version programming and recovery blocks are alternative approaches to fault-tolerant architectures.


Download ppt "©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3."

Similar presentations


Ads by Google