CS 603 Failure Models April 12, 2002. Fault Tolerance in Distributed Systems Perfect world: No Failures –W–We don’t live in a perfect world Non-distributed.

Slides:



Advertisements
Similar presentations
Fault Tolerance. Basic System Concept Basic Definitions Failure: deviation of a system from behaviour described in its specification. Error: part of.
Advertisements

Chapter 8 Fault Tolerance
Distributed DBMSPage © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database.
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.1 FAULT TOLERANT SYSTEMS Part 1 - Introduction.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Chapter 7 Fault Tolerance Basic Concepts Failure Models Process Design Issues Flat vs hierarchical group Group Membership Reliable Client.
Distributed Systems CS Fault Tolerance- Part I Lecture 13, Oct 17, 2011 Majd F. Sakr, Mohammad Hammoud andVinay Kolar 1.
CS-550 (M.Soneru): Recovery [SaS] 1 Recovery. CS-550 (M.Soneru): Recovery [SaS] 2 Recovery Computer system recovery: –Restore the system to a normal operational.
2/23/2009CS50901 Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial Fred B. Schneider Presenter: Aly Farahat.
Last Class: Weak Consistency
Computer Science Lecture 16, page 1 CS677: Distributed OS Last Class:Consistency Semantics Consistency models –Data-centric consistency models –Client-centric.
Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.
R R R Fault Tolerant Computing. R R R Acknowledgements The following lectures are based on materials from the following sources; –S. Kulkarni –J. Rushby.
CS 603 Data Replication February 25, Data Replication: Why? Fault Tolerance –Hot backup –Catastrophic failure Performance –Parallelism –Decreased.
Introduction to Dependability slides made with the collaboration of: Laprie, Kanoon, Romano.
Software Dependability CIS 376 Bruce R. Maxim UM-Dearborn.
Reliability and Fault Tolerance Setha Pan-ngum. Introduction From the survey by American Society for Quality Control [1]. Ten most important product attributes.
Copyright © Clifford Neuman and Dongho Kim - UNIVERSITY OF SOUTHERN CALIFORNIA - INFORMATION SCIENCES INSTITUTE Advanced Operating Systems Lecture.
Software Testing Testing principles. Testing Testing involves operation of a system or application under controlled conditions & evaluating the results.
Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.
Introduction to Dependability. Overview Dependability: "the trustworthiness of a computing system which allows reliance to be justifiably placed on the.
PMIT-6102 Advanced Database Systems By- Jesmin Akhter Assistant Professor, IIT, Jahangirnagar University.
Part.1.1 In The Name of GOD Welcome to Babol (Nooshirvani) University of Technology Electrical & Computer Engineering Department.
Building Dependable Distributed Systems Chapter 1 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
1 CSE 8343 Presentation # 2 Fault Tolerance in Distributed Systems By Sajida Begum Samina F Choudhry.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability Copyright 2004 Daniel J. Sorin Duke University.
Hwajung Lee. One of the selling points of a distributed system is that the system will continue to perform even if some components / processes fail.
CS603 Fault Tolerance - Communication April 17, 2002.
Fault Tolerance. Basic Concepts Availability The system is ready to work immediately Reliability The system can run continuously Safety When the system.
V1.7Fault Tolerance1. V1.7Fault Tolerance2 A characteristic of Distributed Systems is that they are tolerant of partial failures within the distributed.
Presentation-2 Group-A1 Professor:Mohamed Khalil Anita Kanuganti Hemanth Rao.
Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery.
1 Fault-Tolerant Computing Systems #1 Introduction Pattara Leelaprute Computer Engineering Department Kasetsart University
Reliable Client-Server Communication. Reliable Communication So far: Concentrated on process resilience (by means of process groups). What about reliable.
Software Quality Assurance and Testing Fazal Rehman Shamil.
Introduction to Fault Tolerance By Sahithi Podila.
A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.
Fault Tolerance Chapter 7. Goal An important goal in distributed systems design is to construct the system in such a way that it can automatically recover.
Faults and fault-tolerance One of the selling points of a distributed system is that the system will continue to perform even if some components / processes.
Fault Tolerance in Distributed Systems. A system’s ability to tolerate failure-1 Reliability: the likelihood that a system will remain operational for.
Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.
Distributed Systems Lecture 4 Failure detection 1.
Week#3 Software Quality Engineering.
Faults and fault-tolerance
Outline Introduction Background Distributed DBMS Architecture
When Is Agreement Possible
Fault Tolerance & Reliability CDA 5140 Spring 2006
Operating System Reliability
Operating System Reliability
Fault Tolerance In Operating System
Dependability Dependability is the ability to avoid service failures that are more frequent or severe than desired. It is an important goal of distributed.
Fault Tolerance - Transactions
Fault Tolerance - Transactions
Operating System Reliability
Fault Tolerance Distributed Web-based Systems
Operating System Reliability
Faults and fault-tolerance
Fault Tolerance - Transactions
Mattan Erez The University of Texas at Austin July 2015
Introduction to Fault Tolerance
Distributed Systems CS
Distributed Systems CS
Operating System Reliability
Fault Tolerance - Transactions
Distributed Systems CS
Abstractions for Fault Tolerance
Operating System Reliability
Operating System Reliability
Fault Tolerance - Transactions
Presentation transcript:

CS 603 Failure Models April 12, 2002

Fault Tolerance in Distributed Systems Perfect world: No Failures –W–We don’t live in a perfect world Non-distributed system Crash, you’re dead Distributed system: Redundancy –S–Should result in less down time –B–But does it? –D–Distributed systems according to Butler Lampson A distributed system is a system in which I can’t get my work done because a computer that I’ve never heard of has failed.

Analogy: Single vs. Multi-Engine Airplanes If the engine quits on a single-engine airplane, you will land soon In a multi-engine airplane, you can still fly Fatal accidents per 100,000 hours flown (1997, private aircraft): –Single-engine:1.47 –Multi-engine:1.92 (Airlines:.038) Why is multi-engine more dangerous?

Problems with Distributed Systems Failure more frequent –Many places to fail –More complex More pieces New challenges: Asynchrony, communication Potential problems of failure greater –Single system – everything stops –Distributed – some parts continue Consistency!

First Step: Goals Availability –Can I use it now? Reliability –Will it be up as long as I need it? Safety –If it fails, what are the consequences? Maintainability –How easy is it to fix if it breaks?

Next Step: Failure Models Failure: System doesn’t give desired behavior –Component-level failure (can compensate) –System-level failure (incorrect result) Fault: Cause of failure (component-level) –Transient: Not repeatable –Intermittent: Repeats, but (apparently) independent of system operations –Permanent: Exists until component repaired Failure Model: How the system behaves when it doesn’t behave properly

Failure Model (Flaviu Cristian, 1991) Dependency –Proper operation of Database depends on proper operation of processor, disk Failure Classification –Type of response to failure Failure semantics –State of system after given class of failure Failure masking –High-level operation succeeds even if they depend on failed services

Failure Classification Correct –In response to inputs, behaves in a manner consistent with the service specification Omission Failure –Doesn’t respond to input Crash: After first omission failure, subsequent requests result in omission failure Timing failure (early, late) –Correct response, but outside required time window Response failure –Value: Wrong output for inputs –State Transition: Server ends in wrong state

Crash Failure types (based on recovery behavior) Amnesia –Server recovers to predefined state independent of operations before crash Partial amnesia –Some part of state is as before crash, rest to predefined state Pause –Recovers to state before omission failure Halting –Never restarts

Failure Semantics Max delay on link: d; Max service time: p –S–Should get response in 2d+p Assume omission failure only –I–If no response in 2d+p, resend request What if performance failure possible? –M–Must distinguish between response to sr and sr’ u r l sr f(sr) sr’ f(sr’)

Specification for service must include –Failure-free (normal) semantics –Failure semantics (likely failure behaviors) Multiple semantics –Combine to give (weaker) semantics –Arbitrary failure semantics: Weakest possible Choice of failure semantics –Is class of failure likely? Probability of type of failure –What is the cost of failure Catastrophic?

Hierarchical Failure Masking Hierarchical failure masking –Dependency: Higher level gets (at best) failure semantics of lower level –Can compensate for lower level failure to improve this Example: TCP fixes communication errors, so some failure semantics not propagated to higher level

Group Failure Masking Redundant servers –Failed server masked by others in group –Allows failure semantics of group to be higher than individuals k-fault tolerant –Group can mask k concurrent group member failures from client May “upgrade” failure semantics –Example: Group detects non-responsive server, other member picks up the slack Omission failure becomes performance failure

Additional Issues Tradeoff of failure probabilities / semantics –Which is better? P[Catastophic] = 0.01 & P[Performance] = 0.5 P[Catastrophic] = 0.1 & P[Performance] = 0.01 Techniques for managing failure Recovery mechanisms