Real Time Multimedia Lab Fault Tolerance Chapter – 7 (Distributed Systems) Mr. Imran Rao Ms. NiuYu 22 nd November 2005.

Slides:



Advertisements
Similar presentations
Dr. Kalpakis CMSC621 Advanced Operating Systems Fault Tolerance.
Advertisements

Chapter 8 Fault Tolerance
1 CS 194: Distributed Systems Process resilience, Reliable Group Communication Scott Shenker and Ion Stoica Computer Science Division Department of Electrical.
L-15 Fault Tolerance 1. Fault Tolerance Terminology & Background Byzantine Fault Tolerance Issues in client/server Reliable group communication 2.
Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Systems of Distributed Systems Module 2 -Distributed algorithms Teaching unit 3 – Advanced algorithms Ernesto Damiani University of Bozen Lesson 6 – Two.
Group Communications Group communication: one source process sending a message to a group of processes: Destination is a group rather than a single process.
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Fault Tolerance Chapter 7.
Non-blocking Atomic Commitment Aaron Kaminsky Presenting Chapter 6 of Distributed Systems, 2nd edition, 1993, ed. Mullender.
Distributed Systems CS Fault Tolerance- Part II Lecture 14, Oct 19, 2011 Majd F. Sakr, Mohammad Hammoud andVinay Kolar 1.
Chapter 7 Fault Tolerance Basic Concepts Failure Models Process Design Issues Flat vs hierarchical group Group Membership Reliable Client.
Distributed Systems CS Fault Tolerance- Part III Lecture 15, Oct 26, 2011 Majd F. Sakr, Mohammad Hammoud andVinay Kolar 1.
Fault Tolerance A partial failure occurs when a component in a distributed system fails. Conjecture: build the system in a such a way that continues to.
Last Class: Weak Consistency
1 Fault Tolerance Chapter 7. 2 Fault Tolerance An important goal in distributed systems design is to construct the system in such a way that it can automatically.
1 More on Distributed Coordination. 2 Who’s in charge? Let’s have an Election. Many algorithms require a coordinator. What happens when the coordinator.
Fault Tolerance Dealing successfully with partial failure within a Distributed System. Key technique: Redundancy.
1 ICS 214B: Transaction Processing and Distributed Data Management Distributed Database Systems.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Chapter 8 Fault.
1 Distributed Systems Fault Tolerance Chapter 8. 2 Course/Slides Credits Note: all course presentations are based on those developed by Andrew S. Tanenbaum.
Fault Tolerance. Agenda Overview Introduction to Fault Tolerance Process Resilience Reliable Client-Server communication Reliable group communication.
Chapter 19 Recovery and Fault Tolerance Copyright © 2008.
Distributed Transactions Chapter 13
Distributed Systems CS Fault Tolerance- Part III Lecture 19, Nov 25, 2013 Mohammad Hammoud 1.
Distributed Systems Principles and Paradigms Chapter 07 Fault Tolerance 01 Introduction 02 Communication 03 Processes 04 Naming 05 Synchronization 06 Consistency.
1 8.3 Reliable Client-Server Communication So far: Concentrated on process resilience (by means of process groups). What about reliable communication channels?
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Fault Tolerance CSCI 4780/6780. Distributed Commit Commit – Making an operation permanent Transactions in databases One phase commit does not work !!!
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Chapter 8 Fault.
COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Fault Tolerance 11/13/20151Distributed Systems - COMP 655.
ICS362 – Distributed Systems
More on Fault Tolerance Chapter 7. Topics Group Communication Virtual Synchrony Atomic Commit Checkpointing, Logging, Recovery.
Fault Tolerance Chapter 7.
Fault Tolerance. Basic Concepts Availability The system is ready to work immediately Reliability The system can run continuously Safety When the system.
Kyung Hee University 1/33 Fault Tolerance Chap 7.
Reliable Communication Smita Hiremath CSC Reliable Client-Server Communication Point-to-Point communication Established by TCP Masks omission failure,
V1.7Fault Tolerance1. V1.7Fault Tolerance2 A characteristic of Distributed Systems is that they are tolerant of partial failures within the distributed.
Fault Tolerance Chapter 7. Failures in Distributed Systems Partial failures – characteristic of distributed systems Goals: Construct systems which can.
- Manvitha Potluri. Client-Server Communication It can be performed in two ways 1. Client-server communication using TCP 2. Client-server communication.
Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery.
Revisiting failure detectors Some of you asked questions about implementing consensus using S - how does it differ from reaching consensus using P. Here.
Fault Tolerance Chapter 7. Basic Concepts Dependability Includes Availability Reliability Safety Maintainability.
Distributed Systems CS Fault Tolerance- Part II Lecture 18, Nov 19, 2012 Majd F. Sakr and Mohammad Hammoud 1.
Introduction to Fault Tolerance By Sahithi Podila.
1 CHAPTER 5 Fault Tolerance Chapter 5-- Fault Tolerance.
Fault Tolerance Chapter 7. Goal An important goal in distributed systems design is to construct the system in such a way that it can automatically recover.
PROCESS RESILIENCE By Ravalika Pola. outline: Process Resilience  Design Issues  Failure Masking and Replication  Agreement in Faulty Systems  Failure.
Fault Tolerance CSCI 4780/6780. RPC Semantics in Presence of Failures 5 types of exceptions Client cannot locate server Request to server is lost Server.
Fault Tolerance (2). Topics r Reliable Group Communication.
1 Fault Tolerance Chapter 8. 2 Basic Concepts Dependability Includes Availability Reliability Safety Maintainability.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.
More on Fault Tolerance
Fault Tolerance Prof. Orhan Gemikonakli
Fault Tolerance Chap 7.
8.2. Process resilience Shreyas Karandikar.
Chapter 8 Fault Tolerance Part I Introduction.
Reliable group communication
Dependability Dependability is the ability to avoid service failures that are more frequent or severe than desired. It is an important goal of distributed.
Distributed Systems CS
Outline Announcements Fault Tolerance.
Distributed Systems CS
Distributed Systems CS
Distributed Systems CS
Fault Tolerance and Reliability in DS.
Distributed Systems - Comp 655
Last Class: Fault Tolerance
Presentation transcript:

Real Time Multimedia Lab Fault Tolerance Chapter – 7 (Distributed Systems) Mr. Imran Rao Ms. NiuYu 22 nd November 2005

Real Time Multimedia Lab Today ’ s Agenda Overview Introduction to Fault Tolerance Process Resilience Reliable Client-Server communication Reliable group communication Distributed commit Recovery Summary

Real Time Multimedia Lab Objectives After taking this presentation, one can –Define Fault Tolerance –Distinguish among different kinds of Faults –Apply Failure Masking Techniques Process Resilience Communication Failure Masking –What is Distributed Commit and Related Issues? –What are the Recovery Techniques?

Real Time Multimedia Lab Overview Introduction to Fault Tolerance –Basic Concepts –Failure Modes –Failure Masking Process Resilience –Design Issues Reliable Communication –P2P Communication –Client Server Communication (RPC, RMI) –Group Communication (Multicasting) Distributed Commit –Multi Phase Commit (Two & Three Phase) Recovery Techniques –Check Pointing, Message Logging

Real Time Multimedia Lab Basic Concepts (1/3) What is Failure? –System is said to be in failure state when it cannot meet its promise. Why do Failure occurs? –Failures occurs because of the error state of the system. What is the reason for Error? –The cause of an error is called a fault Is there some thing ‘Partial Failure’? Faults can be Prevented, Removed and Forecasted. Can Faults be Tolerated by a system also?

Real Time Multimedia Lab Basic Concepts (2/3) What characteristics makes a system Fault Tolerant? –Availability: System is ready to used immediately. –Reliability: System can run continuously without failure. –Safety: Nothing catastrophic happens if a system temporarily fails. –Maintainability: How easy a failed system can be repaired. –Dependability: ??? What is the reliability and availability of following systems? –If a system goes down for one millisecond every hour –If a System never crashes but is shut down for two weeks every August.

Real Time Multimedia Lab Basic Concepts (3/3) Classification of Faults –Transient: Occurs once and than disappears –Intermittent: Occurs, vanishes on its own accord, than reappears and so on –Permanent: They occurs and doesn’t vanish until fixed manually. Can you classify the Faults caused by following situations? –A flying bird obstructing the transmitting waves signals –A loosely connected power plug –Burnt out chips –Software Bugs Which Fault you think is more difficult to detect and why?

Real Time Multimedia Lab Faults in Distributed Systems If in a Distributed Systems some fault occurs, the error may by in any of –The collection of servers or –Communication Channel or –Even both However, malfunctioning server itself may not always be the fault we are looking for. Why? Dependency relations appear in abundance in DS. Hence, we need to classify failures to know how serious a failure actually is.

Real Time Multimedia Lab Failure Models Type of Failure Description Crash FailureA server halts, but is working correctly until it halts Omission Failure Receive omission Send omission A server fails to respond to incoming requests A server fails to receive incoming messages A server fails to send messages Timing FailureA server's response lies outside the specified time interval Response Failure Value State Transition The server's response is incorrect The value of the response is wrong The server deviates from the correct flow of control Arbitrary FailureA server may produce arbitrary responses at arbitrary times

Real Time Multimedia Lab Failure Masking by Redundancy (1/3) A system to be fault tolerant, the best it can do is try to hide the occurrence of failure from other processes Key technique to masking faults is to use Redundancy. –Information redundancy: Extra bits are added to allow recovery from garbled bits –Time redundancy: An action is performed, and then, if need be, it is performed again. –Physical redundancy: Extra equipment or processes are added Issue: How much redundancy is needed?

Real Time Multimedia Lab Failure Masking by Redundancy (2/3) Some Examples of Redundancy Schemes –Hamming Code –Transactions –Replicated Processes or Components –Aircraft has four engines, can fly with only three –Sports game has extra referee. –Mammals. Can you point out How?

Real Time Multimedia Lab Failure Masking by Redundancy (3/3) Triple modular redundancy: –If two or three of the input are the same, the output is equal to that input. –If all three inputs are different, the output is undefined. Figure: Fault Tolerance in Electronic Circuits

Real Time Multimedia Lab What's Next  We have studied –What are Faults? –What are the characteristics of a fault tolerant system? –How to distinguish between different types of faults? –What are the failure models? –How Failures can be Masked? What’s Next? –How Fault Tolerance is achieved in DS –How Process are made Resilient against Faults?

Real Time Multimedia Lab Process Resilience Problem: –How fault tolerance in distributed system is achieved, especially against Process Failures? Solution: –Replicating processes into groups. –Groups are analogous to Social Organizations. –Consider collections of process as a single abstraction –All members of the group receive the same message, if one process fails, the others can take over for it. –Process groups are dynamic and a Process can be member of several groups. –Hence we need some management scheme for groups.

Real Time Multimedia Lab Process Groups (1/2) Flat Group Advantage: Symmetrical and has no single point failure Disadvantage: Decision making is more complicated.  Voting Hierarchical Group Advantage: Make decision without bothering others Disadvantage: Lost coordinator  Entire group halts Flat Group vs. Hierarchical Group

Real Time Multimedia Lab Process Groups (2/2) Group Server (Client Server Model) –Straight forward, simple and easy to implement –Major disadvantage  Single point of failure Distributed Approach (P2P Model) –Broadcast message to join and leave the group –In case of fault, how to identify between a really dead and a dead slow member –Joining and Leaving must be synchronized  on joining send all previous messages to the new member –Another issue is how to create a new group? Group Membership

Real Time Multimedia Lab Failure Masking & Replication Replicate Process and organize them into groups Replace a single vulnerable process with the whole fault tolerant Group A system is said to be K fault tolerant if it can survive faults in K components and still meet its specifications. How much replication is needed to support K Fault Tolerance? –K+1 or 2K+1 ? Case: 1)If K processes stop, then the answer from the other one can be used.  K+1 2)If meet Byzantine failure, the number is  2K+1  Problem?

Real Time Multimedia Lab Agreement in Faulty Systems Why we need Agreements? Goal of Agreement –Make all the non-faulty processes reach consensus on some issue –Establish that consensus within a finite number of steps. Problems of two cases –Good process, but unreliable communication Example: Two-army problem –Good communication, but crashed process Example: Byzantine generals problem

Real Time Multimedia Lab Two-army problem Red Troop Blue Troop Command by Napoleon Blue Troop Command by Alexander  It is easy to show that Alexander and Napoleon will never reach agreement, no matter how many acknowledgements they send. (Due to unreliable communication). Attack 1. Let us attack at 6 AM. 2. Ok, that is good. 3. I got your message. 4. I knew that you got my message.

Real Time Multimedia Lab Byzantine generals problem The Byzantine generals problem for 3 loyal generals and1 traitor. a)The generals announce their troop strengths (in units of 1 thousand soldiers). b)The vectors that each general assembles based on (a) c)The vectors that each general receives in step 3.

Real Time Multimedia Lab Go forward one more step The same as in previous slide, except now with 2 loyal generals and one traitor. Lamport proved that in a system with m faulty processes, agreement can be achieved only if 2m+1 correctly functioning processes are present, for a total of 3m+1. More than two-thirds  agreement

Real Time Multimedia Lab Reliable client-server communication TCP masks omission failures –… by using ACKs & retransmissions … but it does not mask crash failures ! –E.g.: When a connection is broken, the client is only notified via an exception What about reliable point-to-point transport protocols ?

Real Time Multimedia Lab Five classes of failures in RPC Client is unable to locate server –Binding exception … at the expense of transparency Request message is lost –Is it safe to retransmit ? Allow server to detect it is dealing with a retry Server crashes after receiving a request Reply message is lost Client crashes after sending a request

Real Time Multimedia Lab Server Crashes (I) A server in client-server communication a)Normal case b)Crash after execution c)Crash before execution

Real Time Multimedia Lab Server Crashes (II) At-least-once semantics –Client keeps retransmitting until it gets a response At-most-once semantics –Give up immediately & report failure Guarantee nothing Ideal would be exactly-once semantics … no general way to arrange this !

Real Time Multimedia Lab Server Crashes (III) Print server scenario: –M: server ’ s completion message Server may send M either before or after printing –P: server ’ s print operation –C: server ’ s crash Possible event orderings: –M  P  C –M  C (  P) –P  M  C –P  C (  M) –C (  P  M) –C (  M  P)

Real Time Multimedia Lab Server Crashes (IV) Different combinations of client & server strategies in the presence of server crashes. ClientServer Strategy M -> PStrategy P -> M Reissue strategyMPCMC(P)C(MP)PMCPC(M)C(PM) AlwaysDUPOK DUP OK NeverOKZERO OK ZERO Only when ACKedDUPOKZERODUPOKZERO Only when not ACKedOKZEROOK DUPOK No combination of client & server strategy is correct for all cases !

Real Time Multimedia Lab Lost Reply Messages Is it safe to retransmit the request ? –Idempotent requests Example: Read a file ’ s first 1024 bytes Counterexample: money transfer order Assign sequence number to request –Server keeps track of client ’ s most recently received sequence # –… additionally, set a RETRANSMISSION bit in the request header

Real Time Multimedia Lab Client Crashes (I) Orphan computation: –No process waiting for the result Waste of resources (CPU cycles, locks) Possible confusion upon client ’ s recovery 4 alternative strategies proposed by Nelson (1981) Extermination: –Client keeps log of requests to be issued Upon recovery, explicitly kill orphans –Overhead of logging (for every RPC) –Problems with grand-orphans –Problems with network partitions

Real Time Multimedia Lab Client Crashes (II) Reincarnation: –Divide time up into epochs (sequentially numbered) –Upon reboot, client broadcasts start-of-epoch Upon receipt, all remote computations on behalf of this client are killed After a network partition, an orphan ’ s response will contain an obsolete epoch number  easily detected Gentle reincarnation: –Upon receipt of start-of-epoch, each server checks to see if it has any remote computations If the owner cannot be found, the computation is killed Expiration: –Each RPC is given a time quantum T to complete … must explicitly ask for another if it cannot finish in time After reboot, client only needs to wait a time T … How to select a reasonable value for T ?

Real Time Multimedia Lab Agenda Introduction to Fault Tolerance Process Resilience Reliable Client-Server communication Reliable group communication Distributed commit Recovery Summary

Real Time Multimedia Lab Basic Reliable-Multicasting Schemes A simple solution to reliable multicasting when all receivers are known & are assumed not to fail a)Message transmission b)Reporting feedback

Real Time Multimedia Lab Scalability in Reliable Multicasting The scheme described above can not support large numbers of receivers. Reason:  Feedback Implosion  Receivers are spread across a wide-area network Solution: Reduce the number of feedback messages that are returned to the sender. Model: Feedback suppression

Real Time Multimedia Lab Several receivers have scheduled a request for retransmission, but the first retransmission request leads to the suppression of others. Nonhierarchical Feedback Control

Real Time Multimedia Lab Hierarchical Feedback Control The essence of hierarchical reliable multicasting: a)Each coordinator forwards the message to its children. b)A coordinator handles retransmission requests.

Real Time Multimedia Lab Atomic Multicast We need to achieve reliable multicasting in the presence of process failures. Atomic multicast problem: a message is delivered to either all processors or to none at all all messages are delivered in the same order to all processes Virtually synchronous reliable multicasting offering totally-ordered delivery of messages is called atomic multicasting

Real Time Multimedia Lab Virtual Synchrony (I) The logical organization of a distributed system to distinguish between message receipt and message delivery

Real Time Multimedia Lab Virtual Synchrony (II) Reliable multicast guarantees that a message multicast to group view G is delivered to each nonfaulty process in G. If the sender of the message crashes during the multicast, the message may either be delivered to all remaining processes, or ignored by each of them. A reliable multicast with this property is said to be virtually synchronous All multicasts take place between view changes. A view change acts as a barrier across which no multicast can pass

Real Time Multimedia Lab Virtual Synchrony (III) The principle of virtual synchronous multicast.

Real Time Multimedia Lab Implementing Virtual Synchrony a)Process 4 notices that process 7 has crashed, sends a view change b)Process 6 sends out all its unstable messages, followed by a flush message c)Process 6 installs the new view when it has received a flush message from everyone else

Real Time Multimedia Lab Message Ordering ( I ) 1. Unordered multicast 2. FIFO-ordered multicast Process P1Process P2Process P3 sends m1receives m1receives m2 sends m2receives m2receives m1 Process P1Process P2Process P3Process P4 sends m1receives m1receives m3sends m3 sends m2receives m3receives m1sends m4 receives m2 receives m4

Real Time Multimedia Lab 3.Reliable causally-ordered multicast delivers messages so that potential causality between different messages is preserved 4.Total-ordered delivery MulticastBasic Message OrderingTotal-ordered Delivery? Reliable multicastNoneNo FIFO multicastFIFO-ordered deliveryNo Causal multicastCausal-ordered deliveryNo Atomic multicastNoneYes FIFO atomic multicastFIFO-ordered deliveryYes Causal atomic multicastCausal-ordered deliveryYes Message Ordering (II)

Real Time Multimedia Lab Agenda Introduction to Fault Tolerance Process Resilience Reliable Client-Server communication Reliable group communication Distributed commit Recovery Summary

Real Time Multimedia Lab Two-phase Commit (I) Process crashes  other processes may be indefinite waiting for a message  This protocol can easily fail  timeout mechanisms are used a)The finite state machine for the coordinator in 2PC. b)The finite state machine for a participant.

Real Time Multimedia Lab Failure handling in 2PC Participant times out waiting for coordinator ’ s Request-to- prepare –It decide to abort. Coordinator times out waiting for a participant ’ s vote –It decides to abort. A participant that voted Prepared times out waiting for the coordinator ’ s decision –It ’ s blocked. –Use a termination protocol to decide what to do. –Native termination protocol – wait until coordinator recovers. The coordinator times out waiting for ACK message –It must resolicit them, so it can forget the decision

Real Time Multimedia Lab Actions by coordinator while START _2PC to local log; multicast VOTE_REQUEST to all participants; while not all votes have been collected { wait for any incoming vote; if timeout { while GLOBAL_ABORT to local log; multicast GLOBAL_ABORT to all participants; exit; } record vote; } if all participants sent VOTE_COMMIT and coordinator votes COMMIT{ write GLOBAL_COMMIT to local log; multicast GLOBAL_COMMIT to all participants; } else { write GLOBAL_ABORT to local log; multicast GLOBAL_ABORT to all participants; }

Real Time Multimedia Lab write INIT to local log; wait for VOTE_REQUEST from coordinator; if timeout { write VOTE_ABORT to local log; exit; } if participant votes COMMIT { write VOTE_COMMIT to local log; send VOTE_COMMIT to coordinator; wait for DECISION from coordinator; if timeout { multicast DECISION_REQUEST to other participants; wait until DECISION is received; /* remain blocked */ write DECISION to local log; } if DECISION == GLOBAL_COMMIT write GLOBAL_COMMIT to local log; else if DECISION == GLOBAL_ABORT write GLOBAL_ABORT to local log; } else { write VOTE_ABORT to local log; send VOTE ABORT to coordinator; } Actions by participant

Real Time Multimedia Lab 3 Phase Commit Protocol (I) Problem of 2PC –In some cases, participants cannot reach a final decision –What case? Solution 1.No single state from which to make a transition to either COMMIT or ABORT. 2.No state in which not possible to make a final decision and from which a transition to COMMIT can be made

Real Time Multimedia Lab Three-phase Commit Protocol (II) Introduction of another Phase a)Finite state machine for the coordinator in 3PC b)Finite state machine for a participant

Real Time Multimedia Lab Failure handling in 3PC (I) Coordinator Timeout/Recovery –Coordinator WAIT: timeout –> ABORT PRECOMMIT: timeout –> COMMIT –Recovering participant?

Real Time Multimedia Lab Failure handling in 3PC (II) Participant Timeout/Recovery –Participant INIT: timeout -> ABORT READY/PRECOMMIT: timeout (coordinator crashed) -> Ask other participants –Any ABORT/COMMIT: follow it –Any INIT: ABORT (no PRECOMMIT possible?) –All(majority) READY: -> ABORT »Recovering participant = INIT: -> ABORT = READY: contact other participants = ABORT = PRECOMMIT: -> ABORT (cf. 2PL) –All(majority) PRECOMMIT: COMMIT »Recovering participant = READY = recover to: READY/PRECOMMIT/COMMIT

Real Time Multimedia Lab Agenda Introduction to Fault Tolerance Process Resilience Reliable Client-Server communication Reliable group communication Distributed commit Recovery Summary

Real Time Multimedia Lab Recovery-Introduction  Goal: replace an erroneous state with an error-free state  Backward recovery: to restore such a recorded state when things go wrong  Combining checkpoints and message logging  Forward recovery: an attempt is made to bring the system in a correct new state from which it can continue to execute

Real Time Multimedia Lab Stable Storage Stable storage is well suited to applications that require a high degree of fault tolerance a)Stable Storage b)Crash after drive 1 is updated c)Bad spot

Real Time Multimedia Lab Checkpointing A recovery line: the most recent distributed snapshot

Real Time Multimedia Lab Independent Checkpointing The domino effect.

Real Time Multimedia Lab Message Logging Incorrect replay of messages after recovery, leading to an orphan process.

Real Time Multimedia Lab Summarization (I)  Fault tolerance is defined as the characteristic by which a system can mask the occurrence and recovery from failures  There exist several types of failures : Crash failure, Omission failure, Timing failure, Response failure, Arbitrary/Byzantine failure  Redundancy is the key technique needed to achieve fault tolerance  Reliable group communication is suitable for small groups

Real Time Multimedia Lab Summarization (II)  Atomic multicasting can be precisely formulated in terms of a virtual synchronous execution model  Group membership change  agreement on the same list of members  using commit protocol  Recovery in fault-tolerant systems is invariably achieved by checkpointing with message logging

Real Time Multimedia Lab QUESTIONS & ANSWERS