R R R Fault Tolerant Computing. R R R Acknowledgements The following lectures are based on materials from the following sources; –S. Kulkarni –J. Rushby.

Slides:



Advertisements
Similar presentations
Fault Tolerance. Basic System Concept Basic Definitions Failure: deviation of a system from behaviour described in its specification. Error: part of.
Advertisements

Chapter 8 Fault Tolerance
Self-Stabilization in Distributed Systems Barath Raghavan Vikas Motwani Debashis Panigrahi.
Reliability of Disk Systems. Reliability So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Software Fault Tolerance – The big Picture RTS April 2008 Anders P. Ravn Aalborg University.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 3 – Distributed Systems.
Enhancing The Fault-Tolerance of Nonmasking Programs Sandeep S. Kulkarni and Ali Ebnenasir Software Engineering and Network Systems Laboratory Computer.
CS 603 Failure Models April 12, Fault Tolerance in Distributed Systems Perfect world: No Failures –W–We don’t live in a perfect world Non-distributed.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
The Complexity of Adding Failsafe Fault-tolerance Sandeep S. Kulkarni Ali Ebnenasir.
Last Class: Weak Consistency
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
Computer Science Lecture 16, page 1 CS677: Distributed OS Last Class:Consistency Semantics Consistency models –Data-centric consistency models –Client-centric.
Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.
Introduction to Dependability slides made with the collaboration of: Laprie, Kanoon, Romano.
CIS 376 Bruce R. Maxim UM-Dearborn
Software Dependability CIS 376 Bruce R. Maxim UM-Dearborn.
Lecture 13 Fault Tolerance Networked vs. Distributed Operating Systems.
2. Fault Tolerance. 2 Fault - Error - Failure Fault = physical defect or flow occurring in some component (hardware or software) Error = incorrect behavior.
Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.
Introduction to Dependability. Overview Dependability: "the trustworthiness of a computing system which allows reliance to be justifiably placed on the.
Chapter 19 Recovery and Fault Tolerance Copyright © 2008.
Defining Programs, Specifications, fault-tolerance, etc.
Part.1.1 In The Name of GOD Welcome to Babol (Nooshirvani) University of Technology Electrical & Computer Engineering Department.
Secure Systems Research Group - FAU 1 Active Replication Pattern Ingrid Buckley Dept. of Computer Science and Engineering Florida Atlantic University Boca.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
Faults and fault-tolerance
CprE 458/558: Real-Time Systems
CS 505: Thu D. Nguyen Rutgers University, Spring CS 505: Computer Structures Fault Tolerance Thu D. Nguyen Spring 2005 Computer Science Rutgers.
Hwajung Lee. One of the selling points of a distributed system is that the system will continue to perform even if some components / processes fail.
Winter 2007SEG2101 Chapter 111 Chapter 11 Implementation Design.
V1.7Fault Tolerance1. V1.7Fault Tolerance2 A characteristic of Distributed Systems is that they are tolerant of partial failures within the distributed.
Presentation-2 Group-A1 Professor:Mohamed Khalil Anita Kanuganti Hemanth Rao.
Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery.
1 Fault-Tolerant Computing Systems #1 Introduction Pattara Leelaprute Computer Engineering Department Kasetsart University
Hwajung Lee. One of the selling points of a distributed system is that the system will continue to perform even if some components / processes fail.
CS 542: Topics in Distributed Systems Self-Stabilization.
Replication Improves reliability Improves availability ( What good is a reliable system if it is not available?) Replication must be transparent and create.
Introduction to Fault Tolerance By Sahithi Podila.
Superstabilizing Protocols for Dynamic Distributed Systems Authors: Shlomi Dolev, Ted Herman Presented by: Vikas Motwani CSE 291: Wireless Sensor Networks.
A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.
Faults and fault-tolerance One of the selling points of a distributed system is that the system will continue to perform even if some components / processes.
COP 5611 Operating Systems Spring 2010 Dan C. Marinescu Office: HEC 439 B Office hours: M-Wd 1:00-2:00 PM.
Introduction to distributed systems description relation to practice variables and communication primitives instructions states, actions and programs synchrony.
SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4a) Department of Electrical.
Classifying fault-tolerance Masking tolerance. Application runs as it is. The failure does not have a visible impact. All properties (both liveness & safety)
Fault Tolerance in Distributed Systems. A system’s ability to tolerate failure-1 Reliability: the likelihood that a system will remain operational for.
Fundamentals of Fault-Tolerant Distributed Computing In Asynchronous Environments Paper by Felix C. Gartner Graeme Coakley COEN 317 November 23, 2003.
Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.
Week#3 Software Quality Engineering.
Faults and fault-tolerance
Fault Tolerance & Reliability CDA 5140 Spring 2006
Fault Tolerance In Operating System
Fault Tolerance - Transactions
Fault Tolerance - Transactions
Faults and fault-tolerance
COP 5611 Operating Systems Fall 2011
Fault Tolerance Distributed Web-based Systems
Faults and fault-tolerance
Fault Tolerance - Transactions
Replication Improves reliability Improves availability
Introduction to Fault Tolerance
COP 5611 Operating Systems Spring 2010
Fault Tolerance - Transactions
Abstractions for Fault Tolerance
Fault Tolerance - Transactions
Presentation transcript:

R R R Fault Tolerant Computing

R R R Acknowledgements The following lectures are based on materials from the following sources; –S. Kulkarni –J. Rushby –J. Knight

R R R Objectives Exposure to area of Critical Systems What it means to have a fault-tolerant system Specification techniques for representing critical properties How to Design Fault tolerance into a system

R R R Reliability and Recovery Reliability – Probability that a system will not fail at time t if it was operating properly at time 0. Recovery – Process of restoring consistency after a failure

R R R Dependability Dependability: –How much one may rely on the quality of services delivered –Quality of service depends on: Correctness Continuity of service

R R R Terms Failure: malfunction Fault: condition that might lead to failure Error: an incorrect response indicates a fault is present Faults may be: o permanent o intermittent o transient

R R R Terms (cont’d) Graceful Degradation system is operational, but degraded, after faults Fail-safe system execution is safe after the fault Stabilizing system recovers to a consistent state after the fault Masking the user of the system does not see any unintended behavior due to faults

R R R Terms (cont’d) Mean Time to Failure (MTTF) –expected value of system failure time Mean Time to Repair (MTTR) –expected value of system repair time Mean Time Between Failure –expected time between successive failures MTBF = MTTF + MTTR Fault Tolerance –ability to continue operation after occurrence of faults

R R R Design Decisions Fault detection Fault confinement Fault diagnosis Repair and/or reconfigure Redundancy –Hardware: extra hardware –Information: redundancy bits –Software: diagnosis software, extra software – Temporal: re-execute software to recover from intermittent faults

R R R Safety vs Reliability Reliability: –concerns occurrence of failures –System failures defined in terms of system services Safety: concerns occurrence of accidents –Unplanned events that result in death, inury, illness, damage, loss of property or evironmental harm –Defined in terms of external consequences

R R R Types of Faults Omission failure –server omits to respond to an input (fail-silent failure) Timing failure –response is functionally correct, but untimely - can be early timing failure or late timing failure – (performance failure) Response failure –incorrect response –if output value incorrect (value failure) –state transition incorrect (state transition failure)

R R R Types of Faults (cont’d) Crash failure –if after a first omission, a server omits to produce output until it restarts Amnesia crash –server restarts in a predefined initial state that does not depend on the inputs seen before crash Partial amnesia crash – some part of the state is the same before the crash; rest is in predefined initial state Pause crash – server restarts in the state it had before the crash Halting crash – crashed server never restarts

R R R Examples OS crashed followed by reboots in initial state Database server crash followed by recovery of a database state that reflects all transactions before the crash Communication server occasionally loses messages but does not delay messages (omission failure) Excessive message transmission or message processing delay (communication performance failure) Alteration of a message due to random noise during transmission (response failure)

R R R Hierarchical Failure Masking A failure of a certain type at a lower level can propagate as a different kind of failure at a higher level abstraction. Value Error at the physical layer (e.g., 2 bits corrupted) propagates as omission error at data link layer

R R R Group Failure Masking To ensure a service remains available to clients despite server failure, –one can implement a group of redundant, physically independent servers. The group masks the failure of a member. Hierarchical masking requires: – users to implement resource failure-masking attempts as exception handling code. In group masking, – individual members failures are entirely hidden from users by group management mechanisms.

R R R Group Failure Masking (cont’d) Group output is a function of outputs of individual group members. –fastest member –distinguished member –result of majority vote A server able to mask any k concurrent member failures will be termed k-fault tolerant –e.g., a primary/standby group of k servers with members ranked as primary, 1st backup, 2nd backup,..., can mask k-1 failures.

R R R Some Formalism Programs A Program consists of: – a finite set of variables – a finite set of actions – where guard is a boolean expression over program variables, and statement updates program variables Modifications –guards may contain receive from channels –statements may contain sends/receive guardstatement

R R R Computation A program computation is a ``fair'' sequence of steps, where in each step an action whose guard is true has its statement executed –In one step, multiple guards may be true. –If guard of some action is true continuously, then that action would eventually be chosen for execution. Notes A program computation is a sequence of states

R R R Specification A specification is a set of sequences of states. What does it mean for a program, p to satisfy a specification sp from a set of states S? –every computation of p that starts from a state in S is in sp.

R R R Examples of specifications Let S be a predicate. –invariant : Invariant(S) = {seq: S is true in each state of seq} A sequence seq is in invariant(S) iff S is true in each state in seq. –Closure Closed(S) = –{seq: Ai: I >= 0: ‘ S is true in the ith state of seq’ => ‘S is true in the (I+1)th state of seq’ } If S ever becomes true, it continues to be true.

R R R Examples of specifications (cont’d) Let R and S be predicates. leads-to: R leads-to S = {seq: (Ai: i>= 0: ‘R is true in ith state of seq’ => (Ek: k >=i : ‘S is true in kth state of seq’) ) }

R R R Examples of specifications (cont’d) Mutual Exclusion – invariant( (j <> k) => ~(cs.j /\ cs.k) ) – (Aj :: (req.j leads-to cs.j)) Leader Election – invariant ( ( j<>k) => ~(leader.j s /\ leader.k) ) – true leads-to (Ej:: leader.j) Load Balancing – true leads-to (Aj,k:: |load.j - load.k| =< bound)

R R R Safety and Liveness Safety specification –A sequence ``does nothing bad'' –No sequence has a bad prefix Let sp be a specification. –sp is a safety specification –iff (A s:: s ~element_of sp => (E a: a is a prefix of s: (Ab:: ab ~element_of sp)))

R R R Liveness Specification Liveness specification – A sequence “does something good” – Every finite prefix has a good extension Let sp be a specification – sp is a liveness specification iff – (A a:: (E b:: ab element_of sp))

R R R Faults A fault is an action that can change the program state All faults – (be they crash, failstop, omission, corruption, timing, Byzantine, intruders, or...) can be thus viewed as perturbations on the system

R R R Faults (cont’d) A program computation in the presence of faults is a sequence of steps where –in each step either program action executes or fault action executes –the program actions are fairly executed –the fault occurrences are finite

R R R Representation of Faults Communication faults –Let c denote the sequence of messages on a channel. –Let m 1 and m 2 be messages, and let seq m be a sequence of messages. Message Loss c = => c = Message Duplication c = => c = Message Reorder c = => c =

R R R Representation of Faults (cont’d) Amnesia/Transient faults. Let c denote all the variables of a process. –True => c=??

R R R Representation of Permanent Faults Fail-stop fault : –Upon fail-stop, a process does nothing; –it does not execute any action and –it does not send any messages. Introduce an auxiliary variable up.j at process j Add up.j to the guard of each action of j –If processes can detect failure of other processes, then they can do so using variable up.

R R R Representation of Permanent Faults Byzantine Faults: –Introduce an auxiliary variable b.j at process j –Add these actions as faults ~b.j => b.j = true b.j => state.j=??

R R R Goal of Fault-tolerance Design Starting from some initial states, S, –If the program executes alone then the original specification, sp, is satisfied –If the program executes in the presence of faults then the fault-tolerant specification, sp', is satisfied. The fault-tolerance specification depends upon the type of the desired fault-tolerance, e.g., –for masking sp' = sp –for fail-safe sp' = `safety specification of sp'

R R R Representation of Permanent Faults Fault-tolerant systems are rarely designed from scratch!!! One needs to modify a fault-intolerant system to add fault- tolerance –Need for reuse the fault-intolerant program. Fault-tolerant systems need to be modified to deal with new faults. –Need for incremental design Need to perform several activities while developing fault- tolerant systems. –manual or automated design, testing, verification, synthesis,... –desirable to have a unified framework that allows to perform these activities.

R R R Overall Design Manual Design TestingAutomated Synthesis Theorem Proving Model Checking Refinement

R R R Overall Design (cont’d) Should separate concerns of functionality and fault-tolerance. –Should use components that are responsible for fault-tolerance alone. Should provide structural continuity while performing these tasks. –Should be able to use the same components while performing the above tasks.