Kyung Hee University 1/33 Fault Tolerance Chap 7.

Slides:

Advertisements

Similar presentations

Dr. Kalpakis CMSC621 Advanced Operating Systems Fault Tolerance.

Advertisements

Chapter 8 Fault Tolerance

1 CS 194: Distributed Systems Process resilience, Reliable Group Communication Scott Shenker and Ion Stoica Computer Science Division Department of Electrical.

L-15 Fault Tolerance 1. Fault Tolerance Terminology & Background Byzantine Fault Tolerance Issues in client/server Reliable group communication 2.

Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.

CS 582 / CMPE 481 Distributed Systems Fault Tolerance.

Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.

Fault Tolerance Chapter 7.

Distributed Systems CS Fault Tolerance- Part II Lecture 14, Oct 19, 2011 Majd F. Sakr, Mohammad Hammoud andVinay Kolar 1.

Chapter 7 Fault Tolerance Basic Concepts Failure Models Process Design Issues Flat vs hierarchical group Group Membership Reliable Client.

Distributed Systems CS Fault Tolerance- Part III Lecture 15, Oct 26, 2011 Majd F. Sakr, Mohammad Hammoud andVinay Kolar 1.

Distributed Systems CS Fault Tolerance- Part I Lecture 13, Oct 17, 2011 Majd F. Sakr, Mohammad Hammoud andVinay Kolar 1.

"Failure is not an option. It comes bundled with your system.“ (--unknown)

Fault Tolerance A partial failure occurs when a component in a distributed system fails. Conjecture: build the system in a such a way that continues to.

Last Class: Weak Consistency

1 CS 194: Distributed Systems Distributed Commit, Recovery Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering.

1 Fault Tolerance Chapter 7. 2 Fault Tolerance An important goal in distributed systems design is to construct the system in such a way that it can automatically.

Fault Tolerance Dealing successfully with partial failure within a Distributed System. Key technique: Redundancy.

Computer Science Lecture 16, page 1 CS677: Distributed OS Last Class:Consistency Semantics Consistency models –Data-centric consistency models –Client-centric.

Fault Tolerance Chapter 8 Part I Introduction

Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.

Chapter 9: Fault Tolerance

Real Time Multimedia Lab Fault Tolerance Chapter – 7 (Distributed Systems) Mr. Imran Rao Ms. NiuYu 22 nd November 2005.

Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Chapter 8 Fault.

1 Distributed Systems Fault Tolerance Chapter 8. 2 Course/Slides Credits Note: all course presentations are based on those developed by Andrew S. Tanenbaum.

Fault Tolerance. Agenda Overview Introduction to Fault Tolerance Process Resilience Reliable Client-Server communication Reliable group communication.

Chapter 19 Recovery and Fault Tolerance Copyright © 2008.

Distributed Transactions Chapter 13

Distributed Systems CS Fault Tolerance- Part III Lecture 19, Nov 25, 2013 Mohammad Hammoud 1.

Distributed Systems Principles and Paradigms Chapter 07 Fault Tolerance 01 Introduction 02 Communication 03 Processes 04 Naming 05 Synchronization 06 Consistency.

1 8.3 Reliable Client-Server Communication So far: Concentrated on process resilience (by means of process groups). What about reliable communication channels?

Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.

Fault Tolerance CSCI 4780/6780. Distributed Commit Commit – Making an operation permanent Transactions in databases One phase commit does not work !!!

Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Chapter 8 Fault.

COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655.

ICS362 – Distributed Systems

More on Fault Tolerance Chapter 7. Topics Group Communication Virtual Synchrony Atomic Commit Checkpointing, Logging, Recovery.

Fault Tolerance Chapter 7.

Fault Tolerance. Basic Concepts Availability The system is ready to work immediately Reliability The system can run continuously Safety When the system.

Reliable Communication Smita Hiremath CSC Reliable Client-Server Communication Point-to-Point communication Established by TCP Masks omission failure,

V1.7Fault Tolerance1. V1.7Fault Tolerance2 A characteristic of Distributed Systems is that they are tolerant of partial failures within the distributed.

Fault Tolerance Chapter 7. Failures in Distributed Systems Partial failures – characteristic of distributed systems Goals: Construct systems which can.

- Manvitha Potluri. Client-Server Communication It can be performed in two ways 1. Client-server communication using TCP 2. Client-server communication.

Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery.

Fault Tolerance Chapter 7. Topics Basic Concepts Failure Models Redundancy Agreement and Consensus Client Server Communication Group Communication and.

Fault Tolerance Chapter 7. Basic Concepts Dependability Includes Availability Reliability Safety Maintainability.

Distributed Systems CS Fault Tolerance- Part II Lecture 18, Nov 19, 2012 Majd F. Sakr and Mohammad Hammoud 1.

Introduction to Fault Tolerance By Sahithi Podila.

1 CHAPTER 5 Fault Tolerance Chapter 5-- Fault Tolerance.

Fault Tolerance Chapter 7. Goal An important goal in distributed systems design is to construct the system in such a way that it can automatically recover.

PROCESS RESILIENCE By Ravalika Pola. outline: Process Resilience  Design Issues  Failure Masking and Replication  Agreement in Faulty Systems  Failure.

Fault Tolerance CSCI 4780/6780. RPC Semantics in Presence of Failures 5 types of exceptions Client cannot locate server Request to server is lost Server.

Fault Tolerance (2). Topics r Reliable Group Communication.

1 Fault Tolerance Chapter 8. 2 Basic Concepts Dependability Includes Availability Reliability Safety Maintainability.

Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.

Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.

More on Fault Tolerance

Fault Tolerance Prof. Orhan Gemikonakli

Fault Tolerance Chap 7.

Fault Tolerance Part I Introduction Part II Process Resilience

CIS 620 Advanced Operating Systems

Chapter 8 Fault Tolerance Part I Introduction.

Reliable group communication

Distributed Systems CS

Distributed Systems CS

DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S

Reliable Client-Server Communication

Fault Tolerance and Reliability in DS.

Distributed Systems - Comp 655

Last Class: Fault Tolerance

Presentation transcript:

Kyung Hee University 1/33 Fault Tolerance Chap 7

Kyung Hee University 2/33 Index  Introduction to Fault Tolerance  Process Resilience  Reliable Client-Server communication  Reliable group communication  Distributed commit  Recovery  Summary

Kyung Hee University 3/33 Basic Concepts Dependable systems:  Availability: property that a system to be used immediately  Reliability: the property that a system can run continuously without failure  Safety: if a system temporarily fails to operate correctly, nothing catastrophic happens  Maintainability: refers to how easy a failed system can be repaired

Kyung Hee University 4/33 Failure Models Different types of failures. Type of failureDescription Crash failureA server halts, but is working correctly until it halts Omission failure Receive omission Send omission A server fails to respond to incoming requests A server fails to receive incoming messages A server fails to send messages Timing failureA server's response lies outside the specified time interval Response failure Value failure State transition failure The server's response is incorrect The value of the response is wrong The server deviates from the correct flow of control Arbitrary failureA server may produce arbitrary responses at arbitrary times

Kyung Hee University 5/33 Failure Masking by Redundancy Three kind for masking faults: information redundancy, time redundancy, and physical redundancy. Triple modular redundancy.

Kyung Hee University 6/33 Flat groups versus Hierarchical groups a)Communication in a flat group. b)Communication in a simple hierarchical group

Kyung Hee University 7/33 Failure Masking and Replication  Failure masking  having a group of identical processes  A group of process is organized in a hierarchical fashion with one or more primary backups  An important issue with using process groups to tolerate faults is how much replication is needed

Kyung Hee University 8/33 Agreement in Faulty Systems (1) Red Troop Blue Troop Command by Napoleon Blue Troop Command by Alexander  it is easy to show that Alexander and Napoleon will never reach agreement, no matter how many acknowledgements they send. (due to unreliable communication). Attack

Kyung Hee University 9/33 Agreement in Faulty Systems (2) The Byzantine generals problem for 3 loyal generals and1 traitor. a)The generals announce their troop strengths (in units of 1 kilosoldiers). b)The vectors that each general assembles based on (a) c)The vectors that each general receives in step 3.

Kyung Hee University 10/33 Agreement in Faulty Systems (3) The same as in previous slide, except now with 2 loyal generals and one traitor. Lamport proved that in a system with m faulty processes, agreement can be achieved only if 2m+1 correctly functioning processes are present, for a total of 3m+1.

Kyung Hee University 11/33 RPC Semantics in the Presence of Failures 1.The client is unable to locate the server 2.The request message from the client to the server is lost 3.The server crashes after receiving a request 4.The reply message from the server to the client is lost 5.The client crashes after sending a request

Kyung Hee University 12/33 RPC Semantics in the Presence of Failures 1.The client is unable to locate the server  (solution)raise an exception (like divide by 0)  destroys the transparency 2.The request messages from the client to the server is lost  using a timer for sending the request Timer expired  request message is sent a gain

Kyung Hee University 13/33 Sever Crashes (1) Three philosophy exist on what to do here: At least once semantics At most once semantics Do nothing A server in client-server communication a)Normal case b)Crash after execution c)Crash before execution

Kyung Hee University 14/33 Sever Crashes (2) Different combinations of client and server strategies in the presence of server crashes. ClientServer Strategy M  PStrategy P  M Reissue strategyMPCMC(P)C(MP)PMCPC(M)C(PM) AlwaysDUPOK DUP OK NeverOKZERO OK ZERO Only when ACKedDUPOKZERODUPOKZERO Only when not ACKedOKZEROOK DUPOK

Kyung Hee University 15/33 Lost Reply Messages Client Crashes  Lost Reply Messages  No reply  send the request (idempotent request) once more  Assign each request a sequence number  Client Crashes  Appearance of orphan  Extermination: check log and kill orphan  Reincarnation: based on broadcasting message to all machines declaring the state of a new epoch (when client reboots)  Gentle reincarnation: like reincarnation, orphan is killed only if owner cannot be found  Expiration: each RPC is given a standard amount of time

Kyung Hee University 16/33 Basic Reliable-Multicasting A simple solution to reliable multicasting when all receivers are known and are assumed not to fail a)Message transmission b)Reporting feedback

Kyung Hee University 17/33 Nonhierarchical Feedback Control Several receivers have scheduled a request for retransmission, but the first retransmission request leads to the suppression of others.

Kyung Hee University 18/33 Hierarchical Feedback Control The essence of hierarchical reliable multicasting. a)Each local coordinator forwards the message to its children. b)A local coordinator handles retransmission requests.

Kyung Hee University 19/33 Virtual Synchrony (1) The logical organization of a distributed system to distinguish between message receipt and message delivery

Kyung Hee University 20/33 Virtual Synchrony (2) The principle of virtual synchronous multicast.

Kyung Hee University 21/33 Message Ordering (1) 1.Unordered multicast 2.FIFO-ordered multicast Process P1Process P2Process P3 sends m1receives m1receives m2 sends m2receives m2receives m1 Process P1Process P2Process P3Process P4 sends m1receives m1receives m3sends m3 sends m2receives m3receives m1sends m4 receives m2 receives m4

Kyung Hee University 22/33 Message Ordering (2) 3.Reliable causally-ordered multicast delivers messages so that potential causality between different messages is preserved 4.Total-ordered delivery MulticastBasic Message OrderingTotal-ordered Delivery? Reliable multicastNoneNo FIFO multicastFIFO-ordered deliveryNo Causal multicastCausal-ordered deliveryNo Atomic multicastNoneYes FIFO atomic multicastFIFO-ordered deliveryYes Causal atomic multicastCausal-ordered deliveryYes

Kyung Hee University 23/33 Implementing Virtual Synchrony Isis (point-to-point communication is reliable, using TCP) a)Process 4 notices that process 7 has crashed, sends a view change b)Process 6 sends out all its unstable messages, followed by a flush message c)Process 6 installs the new view when it has received a flush message from everyone else

Kyung Hee University 24/33 Two-phase Commit (1) Process crashes  other processes may be indefinite waiting for a message  This protocol can easily fail  timeout mechanisms are used a)The finite state machine for the coordinator in 2PC. b)The finite state machine for a participant.

Kyung Hee University 25/33 Two-phase Commit (2) Outline of the steps taken by the coordinator in a two phase commit protocol actions by coordinator: while START _2PC to local log; multicast VOTE_REQUEST to all participants; while not all votes have been collected { wait for any incoming vote; if timeout { while GLOBAL_ABORT to local log; multicast GLOBAL_ABORT to all participants; exit; } record vote; } if all participants sent VOTE_COMMIT and coordinator votes COMMIT{ write GLOBAL_COMMIT to local log; multicast GLOBAL_COMMIT to all participants; } else { write GLOBAL_ABORT to local log; multicast GLOBAL_ABORT to all participants; }

Kyung Hee University 26/33 Three-phase Commit a)Finite state machine for the coordinator in 3PC b)Finite state machine for a participant

Kyung Hee University 27/33 Recovery-Introduction  Goal: replace an erroneous state with an error- free state  Backward recovery: to restore such a recorded state when things go wrong  Combining checkpoints and message logging  Forward recovery: an attempt is made to bring the system in a correct new state from which it can continue to execute

Kyung Hee University 28/33 Stable Storage Stable storage is well suited to applications that require a high degree of fault tolerance a)Stable Storage b)Crash after drive 1 is updated c)Bad spot

Kyung Hee University 29/33 Checkpointing A recovery line.

Kyung Hee University 30/33 Independent Checkpointing The domino effect.

Kyung Hee University 31/33 Message Logging Incorrect replay of messages after recovery, leading to an orphan process.

Kyung Hee University 32/33 Summarization (1)  Fault tolerance is defined as the characteristic by which a system cam mask the occurrence and recovery from failures  Redundancy is the key technique needed to achieve fault tolerance  Reliable group communication is suitable for small groups  Atomic multicasting can be precisely formulated in terms of a virtual synchronous execution model

Kyung Hee University 33/33 Summarization (2)  Group membership change  agreement on the same list of members  using commit protocol  Recovery in fault-tolerant systems is invariably achieved by checkpointing with message logging  Problem: in RPC failures, they only mention about how to kill an orphan  why don’t use it again

Kyung Hee University 34/33 End of chapter 7 Thank you for joining us!