Failure Mode Assumptions and Assumption Coverage David Powell.

Slides:

Advertisements

Similar presentations

Impossibility of Distributed Consensus with One Faulty Process

Advertisements

Chapter 8 Fault Tolerance

DISTRIBUTED SYSTEMS II FAULT-TOLERANT BROADCAST Prof Philippas Tsigas Distributed Computing and Systems Research Group.

Byzantine Generals. Outline r Byzantine generals problem.

BASIC BUILDING BLOCKS -Harit Desai. Byzantine Generals Problem If a computer fails, –it behaves in a well defined manner A component always shows a zero.

Self-Stabilization in Distributed Systems Barath Raghavan Vikas Motwani Debashis Panigrahi.

The Byzantine Generals Problem Boon Thau Loo CS294-4.

The Byzantine Generals Problem Leslie Lamport, Robert Shostak, Marshall Pease Distributed Algorithms A1 Presented by: Anna Bendersky.

Distributed Computing 8. Impossibility of consensus Shmuel Zaks ©

Byzantine Generals Problem: Solution using signed messages.

Formal Methods in Software Engineering Credit Hours: 3+0 By: Qaisar Javaid Assistant Professor Formal Methods in Software Engineering1.

CPSC 668Set 14: Simulations1 CPSC 668 Distributed Algorithms and Systems Spring 2008 Prof. Jennifer Welch.

Byzantine Generals Problem Anthony Soo Kaim Ryan Chu Stephen Wu.

Fail-Safe Mobility Management and Collision Prevention Platform for Cooperative Mobile Robots with Asynchronous Communications Rami Yared School of Information.

Distributed Systems Fall 2010 Time and synchronization.

CS 582 / CMPE 481 Distributed Systems Fault Tolerance.

Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 3 – Distributed Systems.

Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall Chapter Chapter 4: Modeling Decision Processes Decision Support Systems in the.

Networking Theory (Part 1). Introduction Overview of the basic concepts of networking Also discusses essential topics of networking theory.

CS 603 Failure Models April 12, Fault Tolerance in Distributed Systems Perfect world: No Failures –W–We don’t live in a perfect world Non-distributed.

1 Software Testing and Quality Assurance Lecture 34 – Software Quality Assurance.

Soft. Eng. II, Spr. 2002Dr Driss Kettani, from I. Sommerville1 CSC-3325: Chapter 9 Title : Reliability Reading: I. Sommerville, Chap. 16, 17 and 18.

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 12: Impossibility.

1 Software Testing and Quality Assurance Lecture 35 – Software Quality Assurance.

Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.

Toward A Mathematical Model of Computer Security Gina Duncanson Kevin Jonas Ben Lange John Loff-Peterson Ben Neigebauer.

Software faults & reliability Presented by: Presented by: Pooja Jain Pooja Jain.

1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University

Chapter 2 What is software quality ?. Outline What is software? Software errors, faults and failures Classification of the causes of software errors Software.

1 A Modular Approach to Fault-Tolerant Broadcasts and Related Problems Author: Vassos Hadzilacos and Sam Toueg Distributed Systems: 526 U1580 Professor:

1 System Models. 2 Outline Introduction Architectural models Fundamental models Guideline.

Formal Model for Simulations Instructor: DR. Lê Anh Ngọc Presented by – Group 6: 1. Nguyễn Sơn Hùng 2. Lê Văn Hùng 3. Nguyễn Xuân Hậu 4. Nguyễn Xuân Tùng.

Assessing the Suitability of UML for Modeling Software Architectures Nenad Medvidovic Computer Science Department University of Southern California Los.

Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.

Fault-Tolerant Systems Design Part 1.

Lecture 4: Sun: 23/4/1435 Distributed Operating Systems Lecturer/ Kawther Abas CS- 492 : Distributed system & Parallel Processing.

©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.

Hong Zhu Dept of Computing and Communication Technologies Oxford Brookes University Oxford, OX33 1HX, UK TOWARDS.

Lecture 4: State-Based Methods CS 7040 Trustworthy System Design, Implementation, and Analysis Spring 2015, Dr. Rozier Adapted from slides by WHS at UIUC.

CprE 458/558: Real-Time Systems

Agenda Fail Stop Processors –Problem Definition –Implementation with reliable stable storage –Implementation without reliable stable storage Failure Detection.

A. Haeberlen Fault Tolerance and the Five-Second Rule 1 HotOS XV (May 18, 2015) Ang Chen Hanjun Xiao Andreas Haeberlen Linh Thi Xuan Phan Department of.

Fault-tolerant Control Motivation Definitions A general overview on the research area. Active Fault Tolerant Control (FTC) FTC- Analysis and Development.

An introduction to Fault Detection in Logic Circuits By Dr. Amin Danial Asham.

UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department

Fault tolerance and related issues in distributed computing Shmuel Zaks GSSI - Feb

A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.

Testing Overview Software Reliability Techniques Testing Concepts CEN 4010 Class 24 – 11/17.

Fault-tolerant Control Motivation Definitions A general overview on the research area. Active Fault Tolerant Control (FTC) FTC- Analysis and Development.

Classifying fault-tolerance Masking tolerance. Application runs as it is. The failure does not have a visible impact. All properties (both liveness & safety)

Langley Research Center An Architectural Concept for Intrusion Tolerance in Air Traffic Networks Jeffrey Maddalon Paul Miner {jeffrey.m.maddalon,

Fail-Stop Processors UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau One paper: Byzantine.

Fundamentals of Fault-Tolerant Distributed Computing In Asynchronous Environments Paper by Felix C. Gartner Graeme Coakley COEN 317 November 23, 2003.

Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.

The Consensus Problem in Fault Tolerant Computing

A Review of Software Testing - P. David Coward

Hardware & Software Reliability

Operating System Reliability

Operating System Reliability

CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS

Agreement Protocols CS60002: Distributed Systems

Operating System Reliability

Operating System Reliability

Baisc Of Software Testing

Operating System Reliability

Computer Science 340 Software Design & Testing

Abstractions for Fault Tolerance

Operating System Reliability

Operating System Reliability

Presentation transcript:

Failure Mode Assumptions and Assumption Coverage David Powell

Fault-Tolerance Key questions –How components may fail?  Prevention strategies –At what rate they may fail?  The Amount of redundancy needed –What are the important type of faults? Types of redundancy needed –The relation between dependability, redundancy and faults?  General FT design guidelines

An F-T Paradox/Dilemma More faulty  More redundancy  More redundancy  More possibility of faults  More possibility of faults???

Solution- Some Key Steps Classify, quantify and verify the assumptions

Type of Failures

Overview Single-user service –Service Model –Potential Errors Multiple-user service –Service Model –Potential Errors

Single-user Service Model Service items: s i, i=1,2,… Values of s i : vs i Observation time of s i : ts i Service Model: S i = S i = An omniscient observer

Correctness Model Service item s i is correct iff (vs i  SV i )  (ts i  ST i ) (vs i  SV i )  (ts i  ST i ) SV i and ST i are respectively the specified sets of values and times for service item s i

Potential Errors Arbitrary value error: s i : vs i  SV i Noncode error: s i : vs i  CV (CV defines a code) Arbitrary timing error: s i : ts i  ST i Early timing error: s i : ts i < min(ST i ) Late timing error: s i : ts i > max(ST i ) Omission error: s i : ts i =  Impromptu error: s i : (vs i =  )  (ts i =  )

Multi-user Service Model Service item s i ={s i (1), s i (2),…, s i (n),} Service model:, all i,u New issues: “consistency”

Correctness Model vs i (u)– the value of service item i on process u vs i -- the value of service item i SV i – the set of specified service item i ts i (u)– the observation time of service item i on process u ST i (u) – the range of specified observation time of service item i on process u uv -- the time bound of related occurrences uv -- the time bound of related occurrences

Examples of Potential Errors Consistent value error Consistent timing error Semi-consistent value error

Failure Mode Assumptions Attempt to formalize the concept of an assumed failure mode By assertions on the sequences of service items delivered by a component

Examples of Value Error Assertions No value errors occur (V none )  i, vs i  SV i  i, vs i  SV i The only value errors that occur are noncode value errors (V n )  i, (vs i  SV i )  (vs i  CV )  i, (vs i  SV i )  (vs i  CV ) Arbitrary value error can occur (V arb )  i, (vs i  SV i )  (vs i  SV i )  i, (vs i  SV i )  (vs i  SV i )

Examples of Timing Error Assertions No timing error occurs (T none ) The only timing errors are omission errors (T O ) The only timing errors are late timing errors (T L ) The only timing errors are early timing errors (T E ) Arbitrary timing error can occur (T arb ) Permanent omission/crash (T p ) Bounded omission degree (T Bk )

Timing Error Implications

Failure Mode Assertions(FMA) A complete FMA entails an assertion on errors occurring on both value and time domains By taking the Cartesian production of the two domains, we get a family of FMA

FMA Implication Graph

So what? The FMA classification and implication graph can serve as a guideline to design families of FT algorithms that can process errors in increasing severity!

Assumption Coverage Establishing a link between assumed component failure mode and system dependability (The design a FT system relies on the assumption they make) (The dependability of a FT system is related to the failure mode they assume)

Motivation Components may fail They may fail in a bad way  leads to a violation of assumptions of the system The system, in turn, can fail Question: to what degree can a component FMA prove to be true in the real system?

The Coverage of the Assumption Definition P(X) = Pr{ X= true | component failed} P(X) = Pr{ X= true | component failed} P(V arb  T arb ) = 1 P(V none  T none ) = 0

Coverage of an FT system PS(X) = PS(X) = Pr{ correct error processing |X= true} Pr{ correct error processing |X= true} *Pr{ X= true | component failed} *Pr{ X= true | component failed}

Influence of Assumption Coverage on System Dependability A Case Study

The System A system of n processors Connected via unidirectional message-passing bus Each processor carries out the same computation steps The result of each processing step is communicated to all other processors Each process has a decision function (DF) The DF is applied to the results received from other processors … Each processor and its associated bus is viewed as a single component

Fail-Silent Processor-bus A fail-silent processor –Only has semi-consistent value errors –Always produces message on time –Or ceases to produce messages forever –If a message is delivered to a processor, it is to be delivered to all processors with consistent fixed delay

Fail-Consistent Processor Bus Only semi-consistent value errors may occur Faulty processors may send erroneous values Consistent timing error may occur

Fail-uncontrolled Processor Bus Arbitrary timing error Arbitrary value error

Implications of Assumption Coverage Failure mode relations Coverage relations

Dependability Expressions From Markov Models r = e –λt λ = failure rate

A Life-critical Application System reliability objective: R > over 10 hours Single processor reliability: –r = e -λt –1/λ = 5 years

A Money-Critical Application It is about availability of the system rather than reliability of the system Please look at the paper for more details

Unavailability v.s. Coverage

Conclusion A formalism for describing component failure modes Multiplicity of value and timing errors The notion of assumption coverage The relation between dependability, availability and assumption coverage

Thank you