CSE 8377 Software Fault Tolerance. CSE 8377 Motivation Software is becoming central to many life- critical systems Software is created by error-prone.

CSE 8377 Software Fault Tolerance

CSE 8377 Motivation Software is becoming central to many life- critical systems Software is created by error-prone humans In the real world, software is executed by error-intolerant machines Software development and maintenance is affected more by budget and schedule concerns than by a concern of reliability

CSE 8377 Faults and Failures A software is said to contain a fault if for some input data the output is incorrect For each execution of the software program where the output is incorrect, we observe a failure Error, bug, mistake, malfunction, defect etc.

CSE 8377 Software Reliability Many measures have been proposed –One widely accepted measure is the number of remaining faults after the release –Since the size of the software may vary, one could specify the remaining faults per given number of lines of code –However, the effect and the rate of occurrence of the remaining faults will vary - need better definition Software reliability is defined as the probability that the software will be functioning without failure under a given environmental condition during a specified period of time

CSE 8377 Software vs. Hardware Reliability Software faults are all due to human errors in creating the software where as the hardware faults are due to random phenomena such as aging, external intervention etc. Software has no aging property - by keeping the testing intensity constant software failure intensity is also constant whereas, the hardware follow a bath-tub curve

CSE 8377 Comparison (cont’d) Once a software fault is removed it will never cause the same failure again. –Software reliability can be improved by testing whereas, for hardware one has to use better material, improved design, and increased strength etc. Software redundancy does not make any sense unless multi-version

CSE 8377 Reliability Improvement Fault Avoidance Fault Detection and Removal - Reliability Growth Fault Tolerance

CSE 8377 Fault Tolerance Recovery Block Schemes N-version programming Self checking programs Exception handling

CSE 8377 Exception Handling Intended service –Standard post condition –Standard domain (SD) –Exception domain (Anticipated as well as Unanticipated) Exception handlers will try to mask the the exception at higher level Default handlers will be provided for UE Handlers could use either forward or backward recovery – default handlers are usually backward

CSE 8377 Recovery Blocks Can view the progress of a sequential process as a sequence of basic operations Cannot check each basic operation - cost, difficulty in formulating checks Structured programs have blocks of code to simplify understanding of functional description Choose blocks as units for error detection/recovery

CSE 8377 RBS: Example Sorting program –Ensure A[j+1] V A[j] for j=1,2,...,n-1 –by Sort A using quick sort –by Sort A using insertion sort –by Sort A using bubble sort –else ERROR

CSE 8377 Acceptance Test Function : Ensure operation of the recovery block is satisfactory Should reference all variables accessible to the program - local and external –Same AT for all the alternatives –Need not check absolute correctness - cost/complexity trade-off

CSE 8377 Alternates Primary alternate used more frequently Other alternates attempt less desirable operations Empty alternate - similar to forward error recovery?

CSE 8377 Restoration of States Keeping copy of the entire system state is too costly Use Recovery Caches Usually recovery regions are constructed in a nested fashion

CSE 8377 RBS Extensions Distributed execution of recovery blocks Consensus recovery blocks Retry blocks with data retry Self-Configuring Optimal programming

CSE 8377 DRBS Targeted for real-time applications Primarily for DCS and Parallel computer systems Deals with both hardware and software faults Composed of two component technologies: –PSP (Pair of Self-checking Processing nodes –Recovery Blocks

CSE 8377 DRBS Governing Rule : The primary node tries to execute the primary try block whenever possible whereas the shadow node tries to execute the alternate try block – Forward recovery (outside the block) is possible for both HW and software faults – Minimal recovery time due to concurrency between the primary and shadow nodes – Minimal time overhead since the primary node does not wait for a status report from the shadow node – Flexibility in the ATs as in any RB schemes

CSE 8377 Design Parameters Mechanism to ensure input data consistency -to make sure the processors receive the same data Mechanisms to share AT results - note that the shadow node has to wait for the AT results from the primary node whereas the primary does not wait for the AT results from the shadow node Mechanisms to reliably communicate the results

CSE 8377 N-Version Programming Independent generation of n>2 functionally equivalent programs from the same initial specification. The Comparison Vectors - c-vectors are generated by the programs at certain points The program variables to be included in the c- vectors and the cross check points are specified in the initial specification. Independent generation - programs developed by N different groups that do not interact.

CSE 8377 Special Mechanisms Comparison Vectors (c-vectors) - Data structure representing various program states - Contains two types of information (a) Comparison variables to be matched (b) Status flags, Example: EOF, Exception conditions etc. Comparison Status Indicator - Indicates actions to be taken after matching (a) Continue (b) Terminate (c) Continue after modification

CSE 8377 Cross-check points - Points at which c-vectors are generated and used for matching - Actions depend on (a) whether all versions deliver results within time (b) whether c-vectors agree/disagree Synchronization Mechanisms - Used to synchronize various versions - To signal the driver that a c-vector is ready (prevents matching before the c-vector is ready.

CSE 8377 RBS vs. NVP In RBS, backward recovery through recovery caches In NVS, forward recovery In RBS if the error escapes the AT, no recovery action is initiated In NVS if a majority of versions have the same fault recovery will not be initiated In recovery blocks, production cost low, since earlier versions of the software can be used as alternates Combination schemes are attractive.

CSE 8377 Specification of member versions State the functional requirements completely and unambiguously leaving the widest possible choices of implementation Inconsistencies, ambiguities, and omissions are likely to bias otherwise independent programming efforts Say only what to do; avoid the hows to eliminate dependencies

CSE 8377 Explicit Specification of Diversity Training, experience, and location of implementation personnel Application algorithms and data structures Programming languages Software development methods Programming tools and environments Testing methods and tools

CSE 8377 Modeling of fault-tolerant software Dependability, reliability, performability, timing etc. Minimal tolerance requirements - single point of failures and multiple point of failures In RBS and NVS, correlated faults are critical factors

CSE 8377 N Self Checking Programming (NSCP) Correspondence to hardware techniques: Stand-by-and-sparing RBS NMR NVS Active dynamic redundancy NSCP

CSE 8377 NSCP One self checking component will be acting Other self-checking components are hot-spares When the acting component fails service switched to one of the spares The variants may have different acceptance tests - Computation with different accuracy - Inverse functions - Exploit intermediate results as in certification trails - Input data consistency mechanism required

CSE 8377 Application Layer Fault Tolerance Levels of Fault Tolerance: Hardware Operating systems Application layer - A set of software components executing in the application layer of a computer system to detect and recover from faults that are not handled in the hardware and operating system layers - More efficient - More portable - If Can be built as libraries and reusable software components

CSE 8377 Levels of A-layer Software Fault Tolerance Level 0: No tolerance to faults in the application layer - Manual restart may be necessary - The crash may leave inconsistent data Level 1: Automatic detection and restart: - Will detect the crash and restart from the initial state - No intermediate checkpoints saved - Reprocessing may cause inconsistencies - Application availability higher than Level 0. Level 2: Level 1 plus periodic checkpointing, logging, and recovery of internal state

CSE 8377 Level 3: Level 2 plus persistent data recovery - Persistent data replicated in a back up disk connected to the backup node - Consistency between the primary and back up storage maintained Level 4: Continuous operation without any interruption - Application layer cannot provide this guarantee - needs replicated processing of application in "hot" spares

CSE 8377 Software Fault Tolerance. CSE 8377 Motivation Software is becoming central to many life- critical systems Software is created by error-prone.

Similar presentations

Presentation on theme: "CSE 8377 Software Fault Tolerance. CSE 8377 Motivation Software is becoming central to many life- critical systems Software is created by error-prone."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSE 8377 Software Fault Tolerance. CSE 8377 Motivation Software is becoming central to many life- critical systems Software is created by error-prone.

Similar presentations

Presentation on theme: "CSE 8377 Software Fault Tolerance. CSE 8377 Motivation Software is becoming central to many life- critical systems Software is created by error-prone."— Presentation transcript:

Similar presentations

About project

Feedback