Fault Tolerance CSC 8320 : AOS Class Presentation Shiraj Pokharel

Slides:



Advertisements
Similar presentations
Chapter 8 Fault Tolerance
Advertisements

CS542: Topics in Distributed Systems Distributed Transactions and Two Phase Commit Protocol.
Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Chapter 7 Fault Tolerance Basic Concepts Failure Models Process Design Issues Flat vs hierarchical group Group Membership Reliable Client.
Distributed Systems CS Fault Tolerance- Part I Lecture 13, Oct 17, 2011 Majd F. Sakr, Mohammad Hammoud andVinay Kolar 1.
Last Class: Weak Consistency
Computer Science Lecture 16, page 1 CS677: Distributed OS Last Class:Consistency Semantics Consistency models –Data-centric consistency models –Client-centric.
Fault Tolerance Chapter 8 Part I Introduction
Chapter 19 Recovery and Fault Tolerance Copyright © 2008.
Distributed Transactions Chapter 13
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
CSE 486/586 CSE 486/586 Distributed Systems Concurrency Control Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Concurrency Control Steve Ko Computer Sciences and Engineering University at Buffalo.
Fault Tolerance CSCI 4780/6780. Distributed Commit Commit – Making an operation permanent Transactions in databases One phase commit does not work !!!
Fault Tolerance in CORBA and Wireless CORBA Chen Xinyu 18/9/2002.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Replication Steve Ko Computer Sciences and Engineering University at Buffalo.
Distributed Transactions Chapter – Vidya Satyanarayanan.
Fault Tolerance Chapter 7.
Fault Tolerance. Basic Concepts Availability The system is ready to work immediately Reliability The system can run continuously Safety When the system.
V1.7Fault Tolerance1. V1.7Fault Tolerance2 A characteristic of Distributed Systems is that they are tolerant of partial failures within the distributed.
Presentation-2 Group-A1 Professor:Mohamed Khalil Anita Kanuganti Hemanth Rao.
Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery.
A client transaction becomes distributed if it invokes operations in several different Servers There are two different ways that distributed transactions.
Fault Tolerance Chapter 7. Basic Concepts Dependability Includes Availability Reliability Safety Maintainability.
Introduction to Fault Tolerance By Sahithi Podila.
Fault Tolerance Chapter 7. Goal An important goal in distributed systems design is to construct the system in such a way that it can automatically recover.
Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.
Chapter 8 – Fault Tolerance Section 8.5 Distributed Commit Heta Desai Dr. Yanqing Zhang Csc Advanced Operating Systems October 14 th, 2015.
Reliable multicast Tolerates process crashes. The additional requirements are: Only correct processes will receive multicasts from all correct processes.
More on Fault Tolerance
Fault Tolerance Prof. Orhan Gemikonakli
Fault Tolerance Chap 7.
Fault Tolerance Part I Introduction Part II Process Resilience
Fault Tolerance In Operating System
Chapter 8 Fault Tolerance Part I Introduction.
Dependability Dependability is the ability to avoid service failures that are more frequent or severe than desired. It is an important goal of distributed.
Fault Tolerance - Transactions
Distributed Systems CS
Outline Announcements Fault Tolerance.
Ch 6 Fault Tolerance Fault tolerance Process resilience
Fault Tolerance Distributed Web-based Systems
EEC 688/788 Secure and Dependable Computing
Distributed Systems CS
Distributed Systems CS
Introduction to Fault Tolerance
Distributed Systems CS
EEC 688/788 Secure and Dependable Computing
CSE 486/586 Distributed Systems Concurrency Control --- 3
Slides for Chapter 14: Distributed transactions
Distributed Systems CS
EEC 688/788 Secure and Dependable Computing
Distributed Transactions
Distributed Systems CS
Distributed Transactions
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Distributed Transactions
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Fault Tolerance and Reliability in DS.
Distributed Systems - Comp 655
Distributed Transactions
EEC 688/788 Secure and Dependable Computing
Distributed Systems CS
Abstractions for Fault Tolerance
CSE 486/586 Distributed Systems Concurrency Control --- 3
Last Class: Fault Tolerance
Presentation transcript:

Fault Tolerance CSC 8320 : AOS Class Presentation Shiraj Pokharel Chapter 11.7 Fault Tolerance CSC 8320 : AOS Class Presentation Shiraj Pokharel

Outline What is Fault Tolerance? Availability & Reliability Failure Models Process Resilience and Replication. Case Study : Multicasting – Distributed Banking CORBA , Fault Tolerant CORBA Example : Distributed Producer Consumer Problem. Distributed Snapshots Algorithm. Checkpointing . Distributed Checkpointing. Future Research

Fault Tolerance A DS should be fault-tolerant Should be able to continue functioning in the presence of faults Fault tolerance is related to dependability

Availability & Reliability Availability: A measurement of whether a system is ready to be used immediately System is up and running at any given moment Reliability: A measurement of whether a system can run continuously without failure System continues to function for a long period of time

Faults A system fails when it cannot meet its promises (specifications) An error is part of a system state that may lead to a failure A fault is the cause of the error. Fault-Tolerance: the system can provide services even in the presence of faults Faults : Transient (appear once and disappear) Intermittent (appear-disappear-reappear behavior) A loose contact on a connector intermittent fault Permanent (appear and persist until repaired)

Failure Models Type of failure Description Crash failure A server halts, but is working correctly until it halts Omission failure Receive omission Send omission A server fails to respond to incoming requests A server fails to receive incoming messages A server fails to send messages Timing failure A server's response lies outside the specified time interval Response failure Value failure State transition failure The server's response is incorrect The value of the response is wrong The server deviates from the correct flow of control Arbitrary failure (Byzantine failure) A server may produce arbitrary responses at arbitrary times

Process Resilience Mask process failures by replication Organize processes into groups, a message sent to a group is delivered to all members If a member fails, another should fill in

Process Replication Replicate a process and group replicas in one group How many replicas do we create? A system is k fault-tolerant if it can survive and function even if it has k faulty processes For crash failures (a faulty process halts, but is working correctly until it halts) k+1 replicas For Byzantine failures (a faulty process may produce arbitrary responses at arbitrary times) 2k+1 replicas

Reliable Group Communication When a group is static and processes do not fail Reliable communication = deliver the message to all group members Any order delivery Ordered delivery

A Distributed Banking Transaction openTransaction join participant closeTransaction A a.withdraw(4); . . join BranchX T participant Client B b.withdraw(3); T = openTransaction BranchY a.withdraw(4); join c.deposit(4); participant b.withdraw(3); d.deposit(3); c.deposit(4); C closeTransaction D d.deposit(3); BranchZ

CORBA

Fault Tolerant CORBA

Fault Tolerant CORBA The architecture also shows the use of message-level interceptors. In the case of the Eternal system, each invocation is intercepted and passed to a separate replication component that maintains the required consistency for an object group and which ensures that messages are logged to enable recovery.

Producer Consumer problem q m p records its state

Example p q m

Example p q m q records its state

Example The recorded state q m m The sender has no record of the sending The receiver has the record of the receipt

p m q Result: Global state has record of the receive event but no send event violating the happens-before concept!!

Cut A consistent cut (meaningful global state) ?

Cut A consistent cut (meaningful global state) ?

Cuts A consistent cut (meaningful global state) An inconsistent cut

Distributed Snapshot Algorithm When a process finishes local snapshot, it collects its local state and sends it to the initiator of the distributed snapshot The initiator can then analyze the state One algorithm for distributed global snapshots, but it’s not particularly efficient for large systems

Checkpointing We’ve discussed distributed snapshots The most recent distributed snapshot in a system is also called the recovery line

Independent Checkpointing It is often difficult to find a recovery line in a system where every process just records its local state every so often - a domino effect or cascading rollback can result:

Coordinated Checkpointing To solve this problem, systems can implement coordinated checkpointing We’ve discussed one algorithm for distributed snapshots, but it’s not particularly efficient for large systems Another way to do it is to use a two-phase blocking protocol (with some coordinator) to get every process to checkpoint its local state “simultaneously”

Coordinated Checkpointing Make sure that processes are synchronized when doing the checkpoint Two-phase blocking protocol Coordinator multicasts CHECKPOINT_REQUEST Processes take local checkpoint Delay further sends Acknowledge to coordinator Send state Coordinator multicasts CHECKPOINT_DONE

Future Research More robust design fault issues in both hardware and software. Interdependent and Intradependent design components need to have better coordination mechanisms. Have better hardware and software design methodologies to minimize the scope of system faults.

References Distributed Operating Systems , by Andrew S. Tanenbaum , ISBN-10: 0132199084. Distributed Systems Concepts and Design : George Coulouris.et al. Enabling Multi-Level Data Fault Tolerance on Software-Defined Storage System Shuo-Han Chen; Yung-Chun Chang; Tseng-Yi Chen; Min-Hong Shen; Hsin-Wen Wei; Tsan-Sheng Hsu; Wei-Kuan Shih . 2017 IEEE 2nd International Workshops on Foundations and Applications of Self* Systems (FAS*W) Fault-tolerance control method for open-winding multi-disc motor system. Huan Zhang; Cheng Luo; Lixun Tang; Guoliang Xiao; Wenyi Tu; Kai Yang .2017 20th International Conference on Electrical Machines and Systems (ICEMS)

Thank You