Introduction to Fault Tolerance

Slides:



Advertisements
Similar presentations
Chapter 8 Fault Tolerance
Advertisements

Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.1 FAULT TOLERANT SYSTEMS Part 1 - Introduction.
5th Conference on Intelligent Systems
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Chapter 7 Fault Tolerance Basic Concepts Failure Models Process Design Issues Flat vs hierarchical group Group Membership Reliable Client.
Distributed Systems CS Fault Tolerance- Part I Lecture 13, Oct 17, 2011 Majd F. Sakr, Mohammad Hammoud andVinay Kolar 1.
CS 603 Failure Models April 12, Fault Tolerance in Distributed Systems Perfect world: No Failures –W–We don’t live in a perfect world Non-distributed.
Last Class: Weak Consistency
Computer Science Lecture 16, page 1 CS677: Distributed OS Last Class:Consistency Semantics Consistency models –Data-centric consistency models –Client-centric.
1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University
Reliability and Fault Tolerance Setha Pan-ngum. Introduction From the survey by American Society for Quality Control [1]. Ten most important product attributes.
2. Fault Tolerance. 2 Fault - Error - Failure Fault = physical defect or flow occurring in some component (hardware or software) Error = incorrect behavior.
PMIT-6102 Advanced Database Systems By- Jesmin Akhter Assistant Professor, IIT, Jahangirnagar University.
Part.1.1 In The Name of GOD Welcome to Babol (Nooshirvani) University of Technology Electrical & Computer Engineering Department.
Building Dependable Distributed Systems Chapter 1 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Adaptive control and process systems. Design and methods and control strategies 1.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Chapter 8 Fault.
ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability Copyright 2004 Daniel J. Sorin Duke University.
1 Note on Testing for Hardware Components. 2 Steps in successful hardware design (basic “process”): 1.Understand the requirements (“product’) 2.Write.
Prepared By: Md Rezaul Huda Reza
Hwajung Lee. One of the selling points of a distributed system is that the system will continue to perform even if some components / processes fail.
Chap 7: Consistency and Replication
Fault Tolerance. Basic Concepts Availability The system is ready to work immediately Reliability The system can run continuously Safety When the system.
V1.7Fault Tolerance1. V1.7Fault Tolerance2 A characteristic of Distributed Systems is that they are tolerant of partial failures within the distributed.
Presentation-2 Group-A1 Professor:Mohamed Khalil Anita Kanuganti Hemanth Rao.
Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery.
1 Fault-Tolerant Computing Systems #1 Introduction Pattara Leelaprute Computer Engineering Department Kasetsart University
Fault Tolerance Chapter 7. Basic Concepts Dependability Includes Availability Reliability Safety Maintainability.
Introduction to Fault Tolerance By Sahithi Podila.
A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.
Fault Tolerance Chapter 7. Goal An important goal in distributed systems design is to construct the system in such a way that it can automatically recover.
Faults and fault-tolerance One of the selling points of a distributed system is that the system will continue to perform even if some components / processes.
Topic: Reliability and Integrity. Reliability refers to the operation of hardware, the design of software, the accuracy of data or the correspondence.
CS203 – Advanced Computer Architecture Dependability & Reliability.
Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.
Fault Tolerance Prof. Orhan Gemikonakli
Fault Tolerance Chap 7.
Software Metrics and Reliability
Hardware & Software Reliability
Faults and fault-tolerance
Distributed Computing
Large Distributed Systems
Fault Tolerance & Reliability CDA 5140 Spring 2006
Software Reliability PPT BY:Dr. R. Mall 7/5/2018.
Fault Tolerance In Operating System
Chapter 8 Fault Tolerance Part I Introduction.
Software Reliability: 2 Alternate Definitions
Dependability Dependability is the ability to avoid service failures that are more frequent or severe than desired. It is an important goal of distributed.
Fault Tolerance - Transactions
Distributed Systems CS
Fault Tolerance - Transactions
Faults and fault-tolerance
Reliability and Fault Tolerance
7.1. CONSISTENCY AND REPLICATION INTRODUCTION
Fault Tolerance Distributed Web-based Systems
Faults and fault-tolerance
Fault Tolerance - Transactions
Fault Tolerance CSC 8320 : AOS Class Presentation Shiraj Pokharel
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
Mattan Erez The University of Texas at Austin July 2015
Distributed Systems CS
Distributed Systems CS
Distributed Systems CS
Fault Tolerance - Transactions
Introduction To Distributed Systems
Distributed Systems CS
Abstractions for Fault Tolerance
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
Fault Tolerance - Transactions
Presentation transcript:

Introduction to Fault Tolerance -Sandeep Karanam

Content Fault tolerance in distributed systems Failure models Failure masking

Fault tolerance in distributed systems Distributed systems being fault tolerant is related to dependable systems. Dependability Dependability is a term, that covers useful requirements for distributed systems. Availability Reliability Safety Maintainability

Availability is defined as the moment at which the system is ready to perform the functions on behalf of the user. If the system is highly available, then it is most likely be working at a given instant of time. Reliability is defined as the time interval in which the system could run continuously without a failure. If the system is highly reliable then it is working for a relatively longer period of time with out interruption. Safety is defined as, when system fails temporarily nothing disastrous should happen. Maintainability is defined as how easily the system could be repaired when failure happens.

Fault and error A fault is a physical defect, imperfection, or flaw that occurs in some hardware or software component. Examples are short-circuit between two adjacent interconnects, broken pin, or a software bug. An error is a deviation from correctness or accuracy in computation, which occurs as a result of a fault. Errors are usually associated with incorrect values in the system state. For example, a circuit or a program computed an incorrect value, an incorrect information was received while transmitting data

Types of Fault Transient: These faults occur once and disappear. Intermittent: These faults occurs and goes away but often comes back and goes at varied times. They are difficult to find. Permanent: These faults remain until they are diagnosed and replaced with the working ones. Ex: burnt-out chips. . Transient fault are dominant type of faults in computer memories. For example, about 98% of RAM faults are transient faults. The causes of transient faults are mostly environmental, such as alpha particles, cosmic rays, electrostatic discharge, electrical power drops, overheating or mechanical shock. Intermittent faults can be due to implementation flaws, aging and wear-out, and to unexpected operation conditions.

Fault tolerance Fault tolerance is the ability of a system to continue performing its intended function in spite of faults Fault tolerance is needed because it is practically impossible to build a perfect system. as the complexity of a system increases, its reliability drastically deteriorates, unless compensatory measures are taken.

Failure models Crash Failure- A server halts, but is working correctly until it halts Omission failures- A server fails to respond to incoming requests, incoming messages A server fails to send messages. Timing failure- A servers response lies outside of the specified time interval. Response Failure- A servers response is incorrect The value of the response is wrong

Failure masking By redundancy Information redundancy: Extra information(bits) is added in order to recover from grabbled bits. Time redundancy: Action is performed once again if needed. Example: Transactions. Physical redundancy: Extra physical component is added in order to handle any of the malfunctioning components.

Triple modular Redundancy Triple modular redundancy is a general technique for fault tolerance. Each device is replicated three times, if two or three inputs are correct then output is defined. If A1 device fails, the circuit still works of two more inputs A2, A3. A fault in V1 or in B1 means the same.

Reference: Andrew S. Tanenbaum , and Maarten Van Steen Reference: Andrew S. Tanenbaum , and Maarten Van Steen. Distributed Systems Principles and paradigms. Second Edition, 2007. Thank you