Reliability and Fault Tolerance Setha Pan-ngum. Introduction From the survey by American Society for Quality Control [1]. Ten most important product attributes.

Slides:



Advertisements
Similar presentations
Redundant Array of Independent Disks (RAID) Striping of data across multiple media for expansion, performance and reliability.
Advertisements

Chapter 8 Fault Tolerance
Fault-Tolerant Systems Design Part 1.
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.1 FAULT TOLERANT SYSTEMS Part 1 - Introduction.
5th Conference on Intelligent Systems
Reliable System Design 2011 by: Amir M. Rahmani
Dependability TSW 10 Anders P. Ravn Aalborg University November 2009.
Software Fault Tolerance – The big Picture RTS April 2008 Anders P. Ravn Aalborg University.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
REAL-TIME SOFTWARE SYSTEMS DEVELOPMENT Instructor: Dr. Hany H. Ammar Dept. of Computer Science and Electrical Engineering, WVU.
Last Class: Weak Consistency
Dependability ITV Real-Time Systems Anders P. Ravn Aalborg University February 2006.
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
Computer Science Lecture 16, page 1 CS677: Distributed OS Last Class:Consistency Semantics Consistency models –Data-centric consistency models –Client-centric.
Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.
Introduction to Dependability slides made with the collaboration of: Laprie, Kanoon, Romano.
Lecture 11: Storage Systems Disk, RAID, Dependability Kai Bu
Software Dependability CIS 376 Bruce R. Maxim UM-Dearborn.
ATIF MEHMOOD MALIK KASHIF SIDDIQUE Improving dependability of Cloud Computing with Fault Tolerance and High Availability.
Lecture 11: Storage Systems Disk, RAID, Dependability Kai Bu
1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Storage Systems.
2. Fault Tolerance. 2 Fault - Error - Failure Fault = physical defect or flow occurring in some component (hardware or software) Error = incorrect behavior.
 The software systems must do what they are supposed to do. “do the right things”  They must perform these specific tasks correctly or satisfactorily.
I/O – Chapter 8 Introduction Disk Storage and Dependability – 8.2 Buses and other connectors – 8.4 I/O performance measures – 8.6.
Software Reliability SEG3202 N. El Kadri.
Made by: Sambit Pulak XI-IB. Reliability refers to the operation of hardware, the design of software, the accuracy of data or the correspondence of data.
High Performance Embedded Computing © 2007 Elsevier Lecture 5: Embedded Systems Issues Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.
Introduction to Dependability. Overview Dependability: "the trustworthiness of a computing system which allows reliance to be justifiably placed on the.
Part.1.1 In The Name of GOD Welcome to Babol (Nooshirvani) University of Technology Electrical & Computer Engineering Department.
Fault-Tolerant Systems Design Part 1.
Redundant Array of Independent Disks.  Many systems today need to store many terabytes of data.  Don’t want to use single, large disk  too expensive.
Building Dependable Distributed Systems Chapter 1 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Adaptive control and process systems. Design and methods and control strategies 1.
1 Fault Tolerant Computing Basics Dan Siewiorek Carnegie Mellon University June 2012.
Faults and fault-tolerance
Copyright 1999 G.v. Bochmann ELG 7186B ch.1 1 Course Notes ELG 7186C Formal Methods for the Development of Real-Time System Applications Gregor v. Bochmann.
Fault-Tolerant Systems Design Part 1.
Hwajung Lee. One of the selling points of a distributed system is that the system will continue to perform even if some components / processes fail.
Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery.
Mixed Criticality Systems: Beyond Transient Faults Abhilash Thekkilakattil, Alan Burns, Radu Dobrin and Sasikumar Punnekkat.
1 Fault-Tolerant Computing Systems #1 Introduction Pattara Leelaprute Computer Engineering Department Kasetsart University
1 INTRUSION TOLERANT SYSTEMS WORKSHOP Phoenix, AZ 4 August 1999 Jaynarayan H. Lala ITS Program Manager.
Introduction to Fault Tolerance By Sahithi Podila.
CSE SW Metrics and Quality Engineering Copyright © , Dennis J. Frailey, All Rights Reserved CSE8314M12 8/20/2001Slide 1 SMU CSE 8314 /
©Ian Sommerville 2000Dependability Slide 1 Chapter 16 Dependability.
Faults and fault-tolerance One of the selling points of a distributed system is that the system will continue to perform even if some components / processes.
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.1 FAULT TOLERANT SYSTEMS Fault tolerant Measures.
Topic: Reliability and Integrity. Reliability refers to the operation of hardware, the design of software, the accuracy of data or the correspondence.
CS203 – Advanced Computer Architecture Dependability & Reliability.
UNIT 14: INSTALLING & MAINTAINING COMPUTER HARDWARE
1 Introduction to Engineering Spring 2007 Lecture 16: Reliability & Probability.
Hardware & Software Reliability
Faults and fault-tolerance
Fault Tolerance & Reliability CDA 5140 Spring 2006
Software Reliability PPT BY:Dr. R. Mall 7/5/2018.
Fault Tolerance In Operating System
Software Reliability: 2 Alternate Definitions
RAID RAID Mukesh N Tekwani
Faults and fault-tolerance
Reliability and Fault Tolerance
Fault Tolerance Distributed Web-based Systems
Faults and fault-tolerance
Mattan Erez The University of Texas at Austin July 2015
Introduction to Fault Tolerance
Overview Dependability: "[..] the trustworthiness of a computing system which allows reliance to be justifiably placed on the service it delivers [..]"
RAID RAID Mukesh N Tekwani April 23, 2019
Seminar on Enterprise Software
Presentation transcript:

Reliability and Fault Tolerance Setha Pan-ngum

Introduction From the survey by American Society for Quality Control [1]. Ten most important product attributes From the survey by American Society for Quality Control [1]. Ten most important product attributes Attribute Ave. Score Attribute performance9.5 Ease of use 8.3 Last a long time (reliability) 9.0Appearance7.7 Service8.9 Brand name 6.3 Easily repaired (maintainability) 8.8 Packaging/displa y 5.8 warranty8.4 Latest model 5.4

Introduction Embedded system major requirements Embedded system major requirements –Low failure rate –Leads to fault tolerance design –Gracefully degradable

Failures, errors, faults Fault – defects that cause malfunction Fault – defects that cause malfunction –Hardware fault e.g. broken wire, stuck logic –Software fault e.g. bug Error – unintended state caused by fault. E.g. software bug leads to wrong calculation  wrong output Error – unintended state caused by fault. E.g. software bug leads to wrong calculation  wrong output Failure – errors leads to system failure (opearates differently from intended) Failure – errors leads to system failure (opearates differently from intended)

Causes of Failures Errors in specification or design Errors in specification or design Component defects Component defects Environmental effects Environmental effects

Errors in specification or design Probably the hardest to detect Probably the hardest to detect Embedded system development: Embedded system development: –Specification –Design –Implementation If specification is wrong, the following steps will be wrong. E.g. unit compatibility of rocket example. If specification is wrong, the following steps will be wrong. E.g. unit compatibility of rocket example.

Component defects Depends on device Depends on device Electronic components can have defects from manufacturing, and wear and tear. Electronic components can have defects from manufacturing, and wear and tear.

Operating environment Stresses Stresses Temperatures Temperatures Moisture Moisture vibration vibration

Classification of failures Nature Nature –Value – incorrect output –Timing – correct output but too late. Perception – as seen by users Perception – as seen by users –Persistent – all users see same results. E.g. sensor reading stuck at ‘0’ –Inconsistent – users see differently. E.g. sensor reading floats (say between 1-3V, and could be seen as ‘1’ or ‘0’). Called malicious or Byzantine failures Called malicious or Byzantine failures

Classification of failures Effects Effects –Benign – not serious e.g. broken tv –Malign – serious e.g. plane crash Oftenness Oftenness –Permanent – broken equipment –Transient – lose wire, processors under stress (EMI, power supply, radiation) –Transient occurs a lot more often!

Example of transient failure From report on fire control radar of F- 16 fighters [3] From report on fire control radar of F- 16 fighters [3] –Pilot noticed malfunctions every 6 hrs –Pilot requested maintenance every 31 hrs –1/3 of requests can be reproduced in workshop –Overall less than 10% of transient failures can be reproduced!

Types of errors Transient Transient –Regularly occurs. E.g. electrical glitches causes temporary value error Permanent Permanent –Transient fault can be kept in database, making it permanent.

Classifications of faults Nature Nature –By chance – broken wire –Intentional – virus Perception Perception –Physical –Design Boundary Boundary –Internal – component breakdown –External – EMI causes faults

Classifications of faults Origin Origin –Development e.g. in program or device –Operation e.g. user entering wrong input Persistence Persistence –Transient – glitches caused by lightning –Permanent faults that need repair

Definitions Reliability R(t) Reliability R(t) –Probability that a system will perform its intended function in the specified environment up to time t. Maintainability M(t) Maintainability M(t) –Probability that a system can be restored within t units after a failure. Availability A(t) Availability A(t) –Probability that a system is available to perform the specified service at tdt. (% of system working)

Reliability [4] > R(0) = 1, R(  > Failure density f(t) = -dR(t)/dt > Failure rate (t) = f(t)/R(t)  (t) dt is the conditional probability that a system will fail in the interval dt, provided it has been operational at the beginning of this interval > When (t) = constant then R(t) = e - t   = MTTF (Mean Time to Failure)

Failure rate (t) Real-time Period of constant Failure Rate Early faillures Late faillures Burn-inWear-out

Failure rate vs Costs [4] (t) Cost of System US Air Force: Failure rate of electronic systems within a given technology increases with increasing system cost.

19 Maintainability > Mesured by Repair-rate  > When  (t) = constant then M(t) = e -  t   = MTTR (Mean Time to Repair) > Preventive maintenace: –If increases in time, then it makes sense to replace the aging unit. –If of different units evolves differently, preventive maintenace consists in replacing the “Smallest Replaceable Units” with growing 

20 Reliability vs. Maintainability > Reliability and maintainability are, to a certain extent, conflicting goals. > Example: Connectors > Inside a SRU, reliability must be optimized > Between SRU’s, maintainability is important PlugSolderReliabilitybadgoodMaintainabilitygoodbad

21 Availability > A = MTTF / ( MTTF + MTTR ) > Good availability can be achieved either –by a high MTTF –by a small MTTR A high system MTTF can be achieved by means of fault tolerance: the system continues to operate properly even when some components have failed. A high system MTTF can be achieved by means of fault tolerance: the system continues to operate properly even when some components have failed. Fault tolerance reduces also the MTTR requirements. Fault tolerance reduces also the MTTR requirements.

22 Fault tolerance obtained through redundancy (more resources assigned to a task than strictly required) REDUNDANCY can be used for can be used for –Fault detection –Fault correction can be implemented at various levels can be implemented at various levels –at component level –at processor level –at system level

23 Redundancy at component level Error detection/correction in memories Error detection by parity bit. Error correction by multiple parity bits.

24 Redundancy at component level Stripe Sets with Parity (RAID) Disk 1 Disk 3 Disk 2 = XOR of two other disks

25 Redundancy at component level Error detection in an ALU ALU proof by 9 Error !

26 Redundancy in components Error detection Error detection –to correct transient errors by retry –to avoid using corrupted data Error correction Error correction –to correct transient errors on the fly –to remain operational after catastrophic component failure –Scheduled maintenance instead of urgent repair.

27 Fault detection at Processor Level CPU1CPU2 = Error

28 Fault correction at Processor Level CPU1 CPU3CPU2 Voting Logic

29 Replica Determinism A set of replicated RT objects is “replica determinate” if all objects of this set visit the same state at about the same time. A set of replicated RT objects is “replica determinate” if all objects of this set visit the same state at about the same time. “At about the same time” makes a concession to the finite precision of the clock synchronization “At about the same time” makes a concession to the finite precision of the clock synchronization Replica determinism is needed for Replica determinism is needed for –consistent distributed actions –fault tolerance by active redundancy

30 Replica Determinism Lack of replica determinism makes voting meaningless. Lack of replica determinism makes voting meaningless. Example: Airplane on takeoff Example: Airplane on takeoff Lack of replica determinism causes the faulty channel to win !!! Lack of replica determinism causes the faulty channel to win !!! System 1: System 2: System 3: Majority: Take off Abort Accelerate Engine Stop Engine Stop Engine (fault) Stop Engine

31 Fault Correction at System Level Hot Stand-By SYSTEM1 SYSTEM2 Error Detection

32 Fault Correction at System Level Cold Stand-By SYSTEM1 SYSTEM2 Error Detection Common Memory

33 Fault Correction at System Level Distributed Common Memory SYSTEM1 SYSTEM2 Distributed Common Memory In fact, each processor has access to the memory of the other to keep a copy of the state of all critical processes Error Detection

34 Fault Correction at System Level Load Sharing Common Memory SYSTEM1SYSTEM1SYSTEM1SYSTEM1

35 Safety Critical systems SYS1 Voting Logic SYS2 SYS4SYS3 Fail once, still operational, fail twice, still safe.

36 Safety Critical Systems But What happens in case of a Software Bug ???

37 Space Shuttle Computer system SYS1 Voting Logic SYS2 SYS4 SYS3SYS5

References 1. Ebeling C, An introduction to reliability and maintainability engineering, McGraw-Hill, Krishna C, Real-time systems, McGraw-Hill, Kopetz H, Real-time systems design principles for distributed embedded applications, Kluwer, Tiberghien J, Real-time system fault tolerance, Lecture slides