Introduction to Fault Tolerance By Sahithi Podila.

Slides:



Advertisements
Similar presentations
Chapter 8 Fault Tolerance
Advertisements

Fault Tolerance (I).
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.1 FAULT TOLERANT SYSTEMS Part 1 - Introduction.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Chapter 7 Fault Tolerance Basic Concepts Failure Models Process Design Issues Flat vs hierarchical group Group Membership Reliable Client.
Distributed Systems CS Fault Tolerance- Part I Lecture 13, Oct 17, 2011 Majd F. Sakr, Mohammad Hammoud andVinay Kolar 1.
CS 603 Failure Models April 12, Fault Tolerance in Distributed Systems Perfect world: No Failures –W–We don’t live in a perfect world Non-distributed.
Fault Tolerance A partial failure occurs when a component in a distributed system fails. Conjecture: build the system in a such a way that continues to.
Last Class: Weak Consistency
1 Fault Tolerance Chapter 7. 2 Fault Tolerance An important goal in distributed systems design is to construct the system in such a way that it can automatically.
Computer Science Lecture 16, page 1 CS677: Distributed OS Last Class:Consistency Semantics Consistency models –Data-centric consistency models –Client-centric.
1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University
Reliability and Fault Tolerance Setha Pan-ngum. Introduction From the survey by American Society for Quality Control [1]. Ten most important product attributes.
Scheduling in Distributed Systems There is not really a lot to say about scheduling in a distributed system. Each processor does its own local scheduling.
Introduction to Dependability. Overview Dependability: "the trustworthiness of a computing system which allows reliance to be justifiably placed on the.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Part.1.1 In The Name of GOD Welcome to Babol (Nooshirvani) University of Technology Electrical & Computer Engineering Department.
Building Dependable Distributed Systems Chapter 1 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Adaptive control and process systems. Design and methods and control strategies 1.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Chapter 8 Fault.
Faults and fault-tolerance
CS 505: Thu D. Nguyen Rutgers University, Spring CS 505: Computer Structures Fault Tolerance Thu D. Nguyen Spring 2005 Computer Science Rutgers.
Hwajung Lee. One of the selling points of a distributed system is that the system will continue to perform even if some components / processes fail.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Chapter 8 Fault.
Chap 7: Consistency and Replication
Fault Tolerance Chapter 7.
Fault Tolerance. Basic Concepts Availability The system is ready to work immediately Reliability The system can run continuously Safety When the system.
V1.7Fault Tolerance1. V1.7Fault Tolerance2 A characteristic of Distributed Systems is that they are tolerant of partial failures within the distributed.
Presentation-2 Group-A1 Professor:Mohamed Khalil Anita Kanuganti Hemanth Rao.
Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery.
1 Fault-Tolerant Computing Systems #1 Introduction Pattara Leelaprute Computer Engineering Department Kasetsart University
Fault Tolerance Chapter 7. Topics Basic Concepts Failure Models Redundancy Agreement and Consensus Client Server Communication Group Communication and.
Fault Tolerance Chapter 7. Basic Concepts Dependability Includes Availability Reliability Safety Maintainability.
A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.
1 CHAPTER 5 Fault Tolerance Chapter 5-- Fault Tolerance.
Fault Tolerance Chapter 7. Goal An important goal in distributed systems design is to construct the system in such a way that it can automatically recover.
Faults and fault-tolerance One of the selling points of a distributed system is that the system will continue to perform even if some components / processes.
Distributed Computing COEN 317 DC1: Introduction.
9.2 SECURE CHANNELS JEJI RAMCHAND VEDULLAPALLI. Content Introduction Authentication Message Integrity and Confidentiality Secure Group Communications.
Fault Tolerance in Distributed Systems. A system’s ability to tolerate failure-1 Reliability: the likelihood that a system will remain operational for.
Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.
Chapter 8 – Fault Tolerance Section 8.5 Distributed Commit Heta Desai Dr. Yanqing Zhang Csc Advanced Operating Systems October 14 th, 2015.
Fault Tolerance Prof. Orhan Gemikonakli
Fault Tolerance Chap 7.
Faults and fault-tolerance
Distributed Computing
Fault Tolerance & Reliability CDA 5140 Spring 2006
Fault Tolerance In Operating System
Chapter 8 Fault Tolerance Part I Introduction.
Dependability Dependability is the ability to avoid service failures that are more frequent or severe than desired. It is an important goal of distributed.
Fault Tolerance - Transactions
Distributed Systems CS
Fault Tolerance - Transactions
Reliability and Fault Tolerance
7.1. CONSISTENCY AND REPLICATION INTRODUCTION
Fault Tolerance Distributed Web-based Systems
Faults and fault-tolerance
Fault Tolerance - Transactions
Fault Tolerance CSC 8320 : AOS Class Presentation Shiraj Pokharel
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
Introduction to Fault Tolerance
Distributed Systems CS
Distributed Systems CS
Distributed Systems CS
Fault Tolerance - Transactions
Introduction To Distributed Systems
Distributed Systems CS
Abstractions for Fault Tolerance
Fault Tolerance - Transactions
Presentation transcript:

Introduction to Fault Tolerance By Sahithi Podila

Basic Concepts

 Distributed systems being fault tolerant is related to dependable systems. Dependability  Dependability is a term, that covers useful requirements for distributed systems. 1. Availability 2. Reliability 3. Safety 4. Maintainability Fault tolerance in distributed systems

Dependability  Availability is defined as the moment at which the system is ready to perform the functions on behalf of the user. If the system is highly available, then it is most likely be working at a given instant of time.  Reliability is defined as the time interval in which the system could run continuously without a failure. If the system is highly reliable then it is working for a relatively longer period of time with out interruption.

Dependability  Safety is defined as, when system fails temporarily nothing disastrous should happen.  Maintainability is defined as how easily the system could be repaired when failure happens.

Fault and Error  Fault means that when a system fails to do some required services.  Error is defined as the state of the system that leads to failure. Fault is the cause of an error.

Fault Tolerance  Fault tolerance is defined as the ability the system has to provide the services even in the presence of faults.  Types of Fault Transient: These faults occur once and disappear. Intermittent: These faults occurs and goes away but often comes back and goes at varied times. They are difficult to find. Permanent: These faults remain until they are diagnosed and replaced with the working ones. Ex: burnt-out chips.

Failure Models

Types of failure Type of failureDescription Crash failureA server halts, but is working correctly until it halts Omission failure Receive omission Send omission A server fails to respond to incoming requests. A server fails to receive incoming messages A server fails to send messages Timing failureA server’s response lies outside the specified time interval Response failure Value failure State transition failure A server’s response is incorrect The value of the response is wrong The server deviates from the correct flow of control Arbitrary failureA server may produce arbitrary responses at arbitrary times

Redundancy

Failure Masking- Redundancy  Three kinds of redundancy Information redundancy: Extra information(bits) is added in order to recover from grabbled bits. Time redundancy: Action is performed once again if needed. Example: Transactions. Physical redundancy: Extra physical component is added in order to handle any of the malfunctioning components.

Physical Redundancy  Physical redundancy is a well known technique for fault-tolerance.  The following example illustrates how fault tolerance is achieved by using physical redundancy technique in electronic circuit.

Triple modular redundancy  Triple modular redundancy is a general technique for fault tolerance.  Each device is replicated three times, if two or three inputs are correct then output is defined.  If A 1 device fails, the circuit still works of two more inputs A 2, A 3.  A fault in V 1 or in B 1 means the same.

Reference: Andrew S. Tanenbaum, and Maarten Van Steen. Distributed Systems Principles and paradigms. Second Edition, Thank You