Fault Tolerance Chap 7.

Slides:



Advertisements
Similar presentations
Dr. Kalpakis CMSC621 Advanced Operating Systems Fault Tolerance.
Advertisements

Chapter 8 Fault Tolerance
1 CS 194: Distributed Systems Process resilience, Reliable Group Communication Scott Shenker and Ion Stoica Computer Science Division Department of Electrical.
L-15 Fault Tolerance 1. Fault Tolerance Terminology & Background Byzantine Fault Tolerance Issues in client/server Reliable group communication 2.
Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Chapter 7 Fault Tolerance Basic Concepts Failure Models Process Design Issues Flat vs hierarchical group Group Membership Reliable Client.
Distributed Systems CS Fault Tolerance- Part III Lecture 15, Oct 26, 2011 Majd F. Sakr, Mohammad Hammoud andVinay Kolar 1.
Fault Tolerance A partial failure occurs when a component in a distributed system fails. Conjecture: build the system in a such a way that continues to.
Last Class: Weak Consistency
1 Fault Tolerance Chapter 7. 2 Fault Tolerance An important goal in distributed systems design is to construct the system in such a way that it can automatically.
Fault Tolerance Dealing successfully with partial failure within a Distributed System. Key technique: Redundancy.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Real Time Multimedia Lab Fault Tolerance Chapter – 7 (Distributed Systems) Mr. Imran Rao Ms. NiuYu 22 nd November 2005.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Chapter 8 Fault.
1 Distributed Systems Fault Tolerance Chapter 8. 2 Course/Slides Credits Note: all course presentations are based on those developed by Andrew S. Tanenbaum.
Fault Tolerance. Agenda Overview Introduction to Fault Tolerance Process Resilience Reliable Client-Server communication Reliable group communication.
Chapter 19 Recovery and Fault Tolerance Copyright © 2008.
Distributed Transactions Chapter 13
Distributed Systems CS Fault Tolerance- Part III Lecture 19, Nov 25, 2013 Mohammad Hammoud 1.
Distributed Systems Principles and Paradigms Chapter 07 Fault Tolerance 01 Introduction 02 Communication 03 Processes 04 Naming 05 Synchronization 06 Consistency.
1 8.3 Reliable Client-Server Communication So far: Concentrated on process resilience (by means of process groups). What about reliable communication channels?
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Fault Tolerance CSCI 4780/6780. Distributed Commit Commit – Making an operation permanent Transactions in databases One phase commit does not work !!!
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Chapter 8 Fault.
COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655.
More on Fault Tolerance Chapter 7. Topics Group Communication Virtual Synchrony Atomic Commit Checkpointing, Logging, Recovery.
Fault Tolerance Chapter 7.
Fault Tolerance. Basic Concepts Availability The system is ready to work immediately Reliability The system can run continuously Safety When the system.
Kyung Hee University 1/33 Fault Tolerance Chap 7.
Reliable Communication Smita Hiremath CSC Reliable Client-Server Communication Point-to-Point communication Established by TCP Masks omission failure,
V1.7Fault Tolerance1. V1.7Fault Tolerance2 A characteristic of Distributed Systems is that they are tolerant of partial failures within the distributed.
Fault Tolerance Chapter 7. Failures in Distributed Systems Partial failures – characteristic of distributed systems Goals: Construct systems which can.
- Manvitha Potluri. Client-Server Communication It can be performed in two ways 1. Client-server communication using TCP 2. Client-server communication.
Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery.
Reliable Client-Server Communication. Reliable Communication So far: Concentrated on process resilience (by means of process groups). What about reliable.
Fault Tolerance Chapter 7. Basic Concepts Dependability Includes Availability Reliability Safety Maintainability.
Distributed Systems CS Fault Tolerance- Part II Lecture 18, Nov 19, 2012 Majd F. Sakr and Mohammad Hammoud 1.
Introduction to Fault Tolerance By Sahithi Podila.
1 CHAPTER 5 Fault Tolerance Chapter 5-- Fault Tolerance.
Fault Tolerance Chapter 7. Goal An important goal in distributed systems design is to construct the system in such a way that it can automatically recover.
PROCESS RESILIENCE By Ravalika Pola. outline: Process Resilience  Design Issues  Failure Masking and Replication  Agreement in Faulty Systems  Failure.
Fault Tolerance CSCI 4780/6780. RPC Semantics in Presence of Failures 5 types of exceptions Client cannot locate server Request to server is lost Server.
Fault Tolerance (2). Topics r Reliable Group Communication.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.
Reliable multicast Tolerates process crashes. The additional requirements are: Only correct processes will receive multicasts from all correct processes.
More on Fault Tolerance
Fault Tolerance Prof. Orhan Gemikonakli
Fault Tolerance Part I Introduction Part II Process Resilience
CIS 620 Advanced Operating Systems
8.2. Process resilience Shreyas Karandikar.
Chapter 8 Fault Tolerance Part I Introduction.
Reliable group communication
Dependability Dependability is the ability to avoid service failures that are more frequent or severe than desired. It is an important goal of distributed.
Distributed Systems CS
Outline Announcements Fault Tolerance.
Ch 6 Fault Tolerance Fault tolerance Process resilience
Distributed Systems CS
Distributed Systems CS
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
Fault Tolerance CSC 8320 : AOS Class Presentation Shiraj Pokharel
Distributed Systems CS
Reliable Client-Server Communication
Distributed Systems CS
Distributed Systems CS
Fault Tolerance and Reliability in DS.
Distributed Systems - Comp 655
Last Class: Fault Tolerance
Presentation transcript:

Fault Tolerance Chap 7

Index Introduction to Fault Tolerance Process Resilience Reliable Client-Server communication Reliable group communication Distributed commit Recovery Summary

Basic Concepts Dependable systems: Availability: property that a system to be used immediately Reliability: the property that a system can run continuously without failure Safety: if a system temporarily fails to operate correctly, nothing catastrophic happens Maintainability: refers to how easy a failed system can be repaired

Different types of failures. Failure Models Type of failure Description Crash failure A server halts, but is working correctly until it halts Omission failure Receive omission Send omission A server fails to respond to incoming requests A server fails to receive incoming messages A server fails to send messages Timing failure A server's response lies outside the specified time interval Response failure Value failure State transition failure The server's response is incorrect The value of the response is wrong The server deviates from the correct flow of control Arbitrary failure A server may produce arbitrary responses at arbitrary times Different types of failures.

Failure Masking by Redundancy Three kind for masking faults: information redundancy, time redundancy, and physical redundancy. Triple modular redundancy.

Flat groups versus Hierarchical groups Communication in a flat group. Communication in a simple hierarchical group

Failure Masking and Replication Failure masking  having a group of identical processes A group of process is organized in a hierarchical fashion with one or more primary backups An important issue with using process groups to tolerate faults is how much replication is needed

Agreement in Faulty Systems (1) 5000 Red Troop Attack Attack 3000 3000 Blue Troop Command by Napoleon Blue Troop Command by Alexander  it is easy to show that Alexander and Napoleon will never reach agreement, no matter how many acknowledgements they send. (due to unreliable communication).

Agreement in Faulty Systems (2) The Byzantine generals problem for 3 loyal generals and1 traitor. The generals announce their troop strengths (in units of 1 kilosoldiers). The vectors that each general assembles based on (a) The vectors that each general receives in step 3.

Agreement in Faulty Systems (3) The same as in previous slide, except now with 2 loyal generals and one traitor. Lamport proved that in a system with m faulty processes, agreement can be achieved only if 2m+1 correctly functioning processes are present, for a total of 3m+1.

RPC Semantics in the Presence of Failures The client is unable to locate the server The request message from the client to the server is lost The server crashes after receiving a request The reply message from the server to the client is lost The client crashes after sending a request

RPC Semantics in the Presence of Failures The client is unable to locate the server  (solution)raise an exception (like divide by 0)  destroys the transparency The request messages from the client to the server is lost  using a timer for sending the request Timer expired  request message is sent a gain

Sever Crashes (1) Three philosophy exist on what to do here: A server in client-server communication Normal case Crash after execution Crash before execution Three philosophy exist on what to do here: At least once semantics At most once semantics Do nothing

Sever Crashes (2) Client Server Strategy M  P Strategy P  M Reissue strategy MPC MC(P) C(MP) PMC PC(M) C(PM) Always DUP OK Never ZERO Only when ACKed Only when not ACKed Different combinations of client and server strategies in the presence of server crashes.

Lost Reply Messages Client Crashes No reply  send the request (idempotent request) once more Assign each request a sequence number Client Crashes Appearance of orphan Extermination: check log and kill orphan Reincarnation: based on broadcasting message to all machines declaring the state of a new epoch (when client reboots) Gentle reincarnation: like reincarnation, orphan is killed only if owner cannot be found Expiration: each RPC is given a standard amount of time

Basic Reliable-Multicasting A simple solution to reliable multicasting when all receivers are known and are assumed not to fail Message transmission Reporting feedback

Nonhierarchical Feedback Control Several receivers have scheduled a request for retransmission, but the first retransmission request leads to the suppression of others.

Hierarchical Feedback Control The essence of hierarchical reliable multicasting. Each local coordinator forwards the message to its children. A local coordinator handles retransmission requests.

Virtual Synchrony (1) The logical organization of a distributed system to distinguish between message receipt and message delivery

The principle of virtual synchronous multicast. Virtual Synchrony (2) The principle of virtual synchronous multicast.

Message Ordering (1) Unordered multicast FIFO-ordered multicast Process P1 Process P2 Process P3 sends m1 receives m1 receives m2 sends m2 Process P1 Process P2 Process P3 Process P4 sends m1 receives m1 receives m3 sends m3 sends m2 sends m4 receives m2 receives m4

Basic Message Ordering Total-ordered Delivery? Reliable causally-ordered multicast delivers messages so that potential causality between different messages is preserved Total-ordered delivery Multicast Basic Message Ordering Total-ordered Delivery? Reliable multicast None No FIFO multicast FIFO-ordered delivery Causal multicast Causal-ordered delivery Atomic multicast Yes FIFO atomic multicast Causal atomic multicast

Implementing Virtual Synchrony Isis (point-to-point communication is reliable, using TCP) Process 4 notices that process 7 has crashed, sends a view change Process 6 sends out all its unstable messages, followed by a flush message Process 6 installs the new view when it has received a flush message from everyone else

Two-phase Commit (1) The finite state machine for the coordinator in 2PC. The finite state machine for a participant. Process crashes  other processes may be indefinite waiting for a message  This protocol can easily fail  timeout mechanisms are used

Two-phase Commit (2) actions by coordinator: while START _2PC to local log; multicast VOTE_REQUEST to all participants; while not all votes have been collected { wait for any incoming vote; if timeout { while GLOBAL_ABORT to local log; multicast GLOBAL_ABORT to all participants; exit; } record vote; } if all participants sent VOTE_COMMIT and coordinator votes COMMIT{ write GLOBAL_COMMIT to local log; multicast GLOBAL_COMMIT to all participants; } else { write GLOBAL_ABORT to local log; multicast GLOBAL_ABORT to all participants; } Outline of the steps taken by the coordinator in a two phase commit protocol

Three-phase Commit Finite state machine for the coordinator in 3PC Finite state machine for a participant

Recovery-Introduction Goal: replace an erroneous state with an error-free state Backward recovery: to restore such a recorded state when things go wrong  Combining checkpoints and message logging Forward recovery: an attempt is made to bring the system in a correct new state from which it can continue to execute

Stable Storage Stable Storage Crash after drive 1 is updated Bad spot Stable storage is well suited to applications that require a high degree of fault tolerance

Checkpointing A recovery line.

Independent Checkpointing The domino effect.

Message Logging Incorrect replay of messages after recovery, leading to an orphan process.

Summarization (1) Fault tolerance is defined as the characteristic by which a system cam mask the occurrence and recovery from failures Redundancy is the key technique needed to achieve fault tolerance Reliable group communication is suitable for small groups Atomic multicasting can be precisely formulated in terms of a virtual synchronous execution model

Summarization (2) Group membership change  agreement on the same list of members  using commit protocol Recovery in fault-tolerant systems is invariably achieved by checkpointing with message logging Problem: in RPC failures, they only mention about how to kill an orphan  why don’t use it again

Thank you for joining us! End of chapter 7 Thank you for joining us!