Fault Tolerance Chapter 7. Topics Basic Concepts Failure Models Redundancy Agreement and Consensus Client Server Communication Group Communication and.

Slides:



Advertisements
Similar presentations
Fault Tolerance. Basic System Concept Basic Definitions Failure: deviation of a system from behaviour described in its specification. Error: part of.
Advertisements

Impossibility of Distributed Consensus with One Faulty Process
Chapter 8 Fault Tolerance
Byzantine Generals. Outline r Byzantine generals problem.
Agreement: Byzantine Generals UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau Paper: “The.
The Byzantine Generals Problem Boon Thau Loo CS294-4.
L-15 Fault Tolerance 1. Fault Tolerance Terminology & Background Byzantine Fault Tolerance Issues in client/server Reliable group communication 2.
Byzantine Generals Problem: Solution using signed messages.
1 Principles of Reliable Distributed Systems Lecture 3: Synchronous Uniform Consensus Spring 2006 Dr. Idit Keidar.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 3 – Distributed Systems.
Systems of Distributed Systems Module 2 -Distributed algorithms Teaching unit 3 – Advanced algorithms Ernesto Damiani University of Bozen Lesson 6 – Two.
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Chapter 7 Fault Tolerance Basic Concepts Failure Models Process Design Issues Flat vs hierarchical group Group Membership Reliable Client.
Distributed Systems CS Fault Tolerance- Part I Lecture 13, Oct 17, 2011 Majd F. Sakr, Mohammad Hammoud andVinay Kolar 1.
2/23/2009CS50901 Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial Fred B. Schneider Presenter: Aly Farahat.
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 5: Synchronous Uniform.
1 Principles of Reliable Distributed Systems Lecture 5: Failure Models, Fault-Tolerant Broadcasts and State-Machine Replication Spring 2005 Dr. Idit Keidar.
Fault Tolerance A partial failure occurs when a component in a distributed system fails. Conjecture: build the system in a such a way that continues to.
Last Class: Weak Consistency
1 More on Distributed Coordination. 2 Who’s in charge? Let’s have an Election. Many algorithms require a coordinator. What happens when the coordinator.
Distributed Algorithms: Agreement Protocols. Problems of Agreement l A set of processes need to agree on a value (decision), after one or more processes.
Computer Science Lecture 16, page 1 CS677: Distributed OS Last Class:Consistency Semantics Consistency models –Data-centric consistency models –Client-centric.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Chapter 8 Fault.
Fault Tolerance via the State Machine Replication Approach Favian Contreras.
Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.
Chapter 19 Recovery and Fault Tolerance Copyright © 2008.
Practical Byzantine Fault Tolerance
Byzantine fault-tolerance COMP 413 Fall Overview Models –Synchronous vs. asynchronous systems –Byzantine failure model Secure storage with self-certifying.
1 Resilience by Distributed Consensus : Byzantine Generals Problem Adapted from various sources by: T. K. Prasad, Professor Kno.e.sis : Ohio Center of.
Agenda Fail Stop Processors –Problem Definition –Implementation with reliable stable storage –Implementation without reliable stable storage Failure Detection.
1 Distributed Coordination. 2 Topics Event Ordering Mutual Exclusion Atomicity of Transactions– Two Phase Commit (2PC) Deadlocks  Avoidance/Prevention.
Commit Algorithms Hamid Al-Hamadi CS 5204 November 17, 2009.
Hwajung Lee. Reaching agreement is a fundamental problem in distributed computing. Some examples are Leader election / Mutual Exclusion Commit or Abort.
CSE 486/586 CSE 486/586 Distributed Systems Byzantine Fault Tolerance Steve Ko Computer Sciences and Engineering University at Buffalo.
Chap 15. Agreement. Problem Processes need to agree on a single bit No link failures A process can fail by crashing (no malicious behavior) Messages take.
Fault Tolerance Chapter 7.
Fault Tolerance. Basic Concepts Availability The system is ready to work immediately Reliability The system can run continuously Safety When the system.
Kyung Hee University 1/33 Fault Tolerance Chap 7.
V1.7Fault Tolerance1. V1.7Fault Tolerance2 A characteristic of Distributed Systems is that they are tolerant of partial failures within the distributed.
Fault Tolerance Chapter 7. Failures in Distributed Systems Partial failures – characteristic of distributed systems Goals: Construct systems which can.
Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery.
UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department
Spring 2003CS 4611 Replication Outline Failure Models Mirroring Quorums.
Fault Tolerance Chapter 7. Basic Concepts Dependability Includes Availability Reliability Safety Maintainability.
Introduction to Fault Tolerance By Sahithi Podila.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Byzantine Fault Tolerance Steve Ko Computer Sciences and Engineering University at Buffalo.
Fault Tolerance Chapter 7. Goal An important goal in distributed systems design is to construct the system in such a way that it can automatically recover.
PROCESS RESILIENCE By Ravalika Pola. outline: Process Resilience  Design Issues  Failure Masking and Replication  Agreement in Faulty Systems  Failure.
Distributed Computing COEN 317 DC1: Introduction.
Distributed Agreement. Agreement Problems High-level goal: Processes in a distributed system reach agreement on a value Numerous problems can be cast.
Fault Tolerance in Distributed Systems. A system’s ability to tolerate failure-1 Reliability: the likelihood that a system will remain operational for.
Fail-Stop Processors UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau One paper: Byzantine.
1 AGREEMENT PROTOCOLS. 2 Introduction Processes/Sites in distributed systems often compete as well as cooperate to achieve a common goal. Mutual Trust/agreement.
Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.
Fault Tolerance Prof. Orhan Gemikonakli
Fault Tolerance Chap 7.
When Is Agreement Possible
Distributed Computing
8.2. Process resilience Shreyas Karandikar.
DC7: More Coordination Chapter 11 and 14.2
Chapter 8 Fault Tolerance Part I Introduction.
COMP28112 – Lecture 14 Byzantine fault tolerance: dealing with arbitrary failures The Byzantine Generals’ problem (Byzantine Agreement) 13-Oct-18 COMP28112.
Dependability Dependability is the ability to avoid service failures that are more frequent or severe than desired. It is an important goal of distributed.
COMP28112 – Lecture 13 Byzantine fault tolerance: dealing with arbitrary failures The Byzantine Generals’ problem (Byzantine Agreement) 19-Nov-18 COMP28112.
Distributed Consensus
Agreement Protocols CS60002: Distributed Systems
Outline Announcements Fault Tolerance.
COMP28112 – Lecture 13 Byzantine fault tolerance: dealing with arbitrary failures The Byzantine Generals’ problem (Byzantine Agreement) 22-Feb-19 COMP28112.
Abstractions for Fault Tolerance
Presentation transcript:

Fault Tolerance Chapter 7

Topics Basic Concepts Failure Models Redundancy Agreement and Consensus Client Server Communication Group Communication and Virtual Synchrony Atomic Commit, Recovery, Checkpointing

Basic Concepts An important goal in DS is to make the system resilient to failures of some of the components. Fault tolerance (FT) is frequently one of the reasons for making it distributed in the first place. Dependability Includes: Availability Reliability Safety Maintainability

Goals Availability: Can I use it now? Probability of being up at any given time. Reliability: Will it be up as long as I need it? Ability to run continuously without failure. If system crashes briefly every hour, it may still have good availability (it is up most of the time) but has poor reliability because it cannot run for very long before crashing. Safety: If it fails, ensure nothing bad happens? Maintainability: How easy is it to fix if it breaks?

Definitions FAULT A fault is the cause of an error FAULT TOLERANCE - A system can continue to function even in the presence of faults. Classification of faults: –Transient faults - occur once then disappear. –Intermittent faults - occurs, goes away, then comes back, goes away … –Permanent faults - doesn't go away by itself, like disk failures.

Failure Models Different types of failures. Type of failureDescription Crash failure or fail-stopA server halts, but is working correctly until it halts Omission failure Receive omission Send omission A server fails to respond to incoming requests A server fails to receive incoming messages A server fails to send messages Timing failureA server's response lies outside the specified time interval Response failure Value failure State transition failure The server's response is incorrect The value of the response is wrong The server deviates from the correct flow of control Arbitrary or ByzantineA server may produce arbitrary responses at arbitrary times

Network Failures Link failure (one way or 2 way): 5 can talk to 6, but 6 can not talk to 5 Network partitions: the network 1,2,3,4,5,6 is partitioned into 1,2,3,4 and 5,

Are The Models Realistic? No, of course not! Synch vs Asynch –Asynchronous model is too weak (real systems have clocks, “most” timing meets expectations… but heavy tails) –Synchronous model is too strong (real systems lack a way to implement synchronize rounds) Failure Types –Crash fail (fail-stop) model is too weak (systems usually display some odd behavior before dying) –Byzantine model is too strong (assumes an adversary of arbitrary speed who designs the “ultimate attack”)

Models: Justification If we can do something in the asynchronous model, we can probably do it even better in a real network –Clocks, a-priori knowledge can only help… If we can’t do something in the synchronous model, we can’t do it in a real network –After all, synchronized rounds are a powerful, if unrealistic, capability to introduce If we can survive Byzantine failures, we can probably survive a real distributed system.

Fault Tolerance Strategies Redundancy –Hardware,software,informational,temporal Hierarchy –Confinement of errors

Failure Masking by Redundancy Triple modular redundancy. Voter circuits choose majority of inputs to determine correct output

Flat Groups versus Hierarchical Groups a)Communication in a flat group. b)Communication in a simple hierarchical group

Identical Processes, Fail-stop A system is K fault tolerant if it can withstand faults in K components and still produce correct results. Example: FT through replication - each replica reports a result. If the nodes in a DS are fail-stop and there are K+1 identical processes, then the system can tolerate K failures: the result comes from the remaining one

Identical Processes, Byzantine Failures If K failures are Byzantine (with K-collusion) then 2K+1 processes are needed for K FT. Example: K processes can be faulty and "lie" about their result. (If they simply fail to report a result, that is not a problem). If there are 2K+1 processes, at least K+1 will be correct and report the same correct answer. So by taking the result reported by at least K+1 (which is a majority), we get the correct answer.

Agreement section Distributed agreement or "distributed consensus" is the fundamental problem in DS. –Distributed mutual exclusion and election are basically getting processes to agree on something. –Agreeing on time or the update of replicated data are special cases of the distributed consensus problem. Agreement sometimes means one process proposes a value and the others agree on it while consensus means all processes propose values and all agree on some function of those values.

Consensus (Agreement) There are M processes, P1, P2, … Pm in a DS that are trying to reach agreement. A subset F of the processes are faulty. Each process Pi stores a value Vi. During agreement, the processes each calculate a value Ai. At the end of the algorithm: – All non-faulty processes reach a decision. – For every pair of non-faulty processes Pi and Pj, Ai = Aj. This is the agreement value. – The agreement value is a function of the initial values {Vi} of the non-faulty processes. The function is often max (as in the case of election) or average or one of the Vi. If all non-faulty processes have the same Vi, then that must be the agreement value.

Consensus: Easy Case: No Failures No failures, synchronous, M processes If there can be no failures, reaching consensus is easy. Every process sends his value to every other process. All processes now have identical info. All processes do the same calculation and come up with the same value. Processes need to maintain an array of M values. P1 has {1,2,3,4} P2 has {1,2,3,4} P3 has {1,2,3,4} P4 has {1,2,3,4}

Consensus: Fail-stop Fairly Easy case: fail-stop, synchronous If faulty processes are fail-stop, reaching consensus is reasonably easy, all non-faulty processes send their values to all others. However, K of them may fail at sometime during the process... P1 has {1,2,3,4} P2 has {1,2,3,4} P3 has {x,2,3,4} P4 has {x,2,3,4}

Consensus: Fail-stop Solution is after all processes send their values to all others, then all processes now broadcast all the values they received (and who from). This continues for f+1 rounds where f = |F|. Processes maintain a tree of values. After second round P4 has 1st round{x,2,3,4} from P2 {1,2,3,4} from P3 {x,2,3,4} {x,2,3,4} {1,2,3,4}

Consensus: Fail-stop If M=4 and F=1 then we need f+1=2 rounds to get consensus (previous example). Do we really need f+1 rounds? Consider M=4, F=2 P1 crashes during 1st round after sending to P2. P2 crashes during 2nd round after sending to P P3:{x,2,3,4} P4:{x,2,3,4} P2:{1,2,3,4}

Consensus: Fail stop What do P3 and P4 see? Round 1 {1,2,3,4}{X,2,3,4}{X,2,3,4} Round 2 send to P3 {1,2,3,4}{X,2,3,4} and die Round 3 {1,2,3,4}{1,2,3,4} If processes are fail-stop, we can tolerate any number of faulty processes, however we need f+1 rounds 4 3 2

Difficult Case: Agreement with Byzantine Failures We will look at agreement (single proposer) rather than consensus (all propose values). The faulty process may respond like a non-faulty process so the non-faulty processes do not know who is faulty. Faulty process can send a fake value to throw off the calculation and can send one value to some and a different value to others. Faulty process is an adversary and can see the global state: has more information than non-faulty nodes. But, can only affect the faulty processes.

Variations on Byzantine Agreement Process always knows who sent the received message. Default value - some algorithms assume a default value (retreat) when there is no agreement. Oral messages - message content is controlled by latest sender (relayer) so receiver doesn’t know whether or not it was tampered with. Signed messages - messages can be authenticated with digital signatures. Assume faulty processes can send arbitrary messages but they cannot forge signatures.

BA with Oral Messages(1) Commanding general coordinates other generals. If all loyal generals attack victory is certain. If none attack, the Empire survives. If some attack, Empire is lost. Gong keeps time. Attack!

BA with Oral Messages(2) How it works. Disloyal generals have corrupt soldiers. Orders are distributed by exchange of messages, corrupt soldiers violate protocol at will. But corrupt soldiers can’t intercept and modify messages between loyal generals. The gong sounds slowly: there is ample time for exchange of messages. Commanding general sends his order. Then all other generals relay to all what they received.

BA with Oral Messages(3) Limitations Let t be the maximum number of faulty processes (disloyal generals). Byzantine agreement is not possible with fewer than 3t+1 processes Same result holds for fault-tolerant consensus in the Byzantine model

Byzantine Consensus Oral Messages(1) The Byzantine generals problem for 3 loyal generals and1 traitor. a)The generals announce their troop strengths (in units of 1 kilosoldiers) to all other generals. b)The vectors that each general assembles based on (a) c)Additional vectors that each general receives in next round (all send what they received to all). Decide by majority.

ByzantineConsensus Oral Messages(2) The same as in previous slide, except now with 2 loyal generals and one traitor. Majority decision does not guarantee consensus.

BA with Signed Messages (1) Faulty process can send arbitrary message, but cannot forge signatures. All messages are digitally signed for authentication. Assume at most f faulty nodes. At the start, coordinator sends signed message to each node. Each process at round I –endorses (authenticate) and forwards all messages received in round I-1

BA with Signed Messages (2) At round f+1, either: –1 value endorsed by at least f+1 nodes, decide majority –else, coordinator is faulty If coordinator is faulty: –either abort, –or retry after leader election to choose new coordinator f+1 rounds proven to be necessary and sufficient. Must have f+2 processes.

Consensus in Asynchronous Systems All of the preceding agreement and consensus algorithms are for synchronous systems, that is the algorithm works by sending messages in rounds or phases. What about Byzantine Consensus in an asynchronous system? Provably impossible [FLP1985]

Client-Server Communications Possible problems: 1. client unable to locate server 2. request message from client to server gets lost 3. server crashes after receiving request 4. reply message from the server to client is lost 5. client crashes after sending request client server

Client-Server Communications Possible Solutions 1. client cannot locate server: client reports exception to user. 2. Request message lost: use timeouts and message numbers 3. Server crashes: client cannot distinguish #2,3, and 4. What to do? Application dependent. 4. Reply lost: see #3: timeout and try again (resend original request and hope that it is recognized as a duplicate and that reply needs to be sent again).

Client-Server Communications 5. Client crashes before reply is received; resources are locked up; orphan processes may exist. Upon recovery, release resources and kill processes? Solution 1 "log and exterminate", keep log of activity and write to stable storage before you send each request - drawback: expense of writing to disk. Solution 2 "reincarnation": release everything, kill local processes, broadcast msg to kill orphans associated with this process. Solution 3 "gentle reincarnation": remote process killed if owner cannot be found. Solution 4 "expiration": remote processes get a timeout value, if not renewed, they can be killed.