1 Chapter 12 Consensus ( Fault Tolerance). 2 Reliable Systems Distributed processing creates faster systems by exploiting parallelism but also improve.

Slides:



Advertisements
Similar presentations
Fault Tolerance. Basic System Concept Basic Definitions Failure: deviation of a system from behaviour described in its specification. Error: part of.
Advertisements

Impossibility of Distributed Consensus with One Faulty Process
Chapter 8 Fault Tolerance
DISTRIBUTED SYSTEMS II FAULT-TOLERANT AGREEMENT Prof Philippas Tsigas Distributed Computing and Systems Research Group.
Byzantine Generals. Outline r Byzantine generals problem.
Agreement: Byzantine Generals UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau Paper: “The.
BASIC BUILDING BLOCKS -Harit Desai. Byzantine Generals Problem If a computer fails, –it behaves in a well defined manner A component always shows a zero.
Teaser - Introduction to Distributed Computing
The Byzantine Generals Problem Boon Thau Loo CS294-4.
The Byzantine Generals Problem Leslie Lamport, Robert Shostak, Marshall Pease Distributed Algorithms A1 Presented by: Anna Bendersky.
Prepared by Ilya Kolchinsky.  n generals, communicating through messengers  some of the generals (up to m) might be traitors  all loyal generals should.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Byzantine Fault Tolerance Steve Ko Computer Sciences and Engineering University at Buffalo.
DISTRIBUTED SYSTEMS II FAULT-TOLERANT AGREEMENT Prof Philippas Tsigas Distributed Computing and Systems Research Group.
Byzantine Generals Problem: Solution using signed messages.
Failure Detectors. Can we do anything in asynchronous systems? Reliable broadcast –Process j sends a message m to all processes in the system –Requirement:
 Two armies are camped on the outskirts of either side of an enemy city.
Distributed Algorithms: Asynch R/W SM Computability Eli Gafni, UCLA Summer Course, CRI, Haifa U, Israel.
CPSC 668Set 9: Fault Tolerant Consensus1 CPSC 668 Distributed Algorithms and Systems Fall 2006 Prof. Jennifer Welch.
Copyright 2006 Koren & Krishna ECE655/ByzGen.1 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Fault Tolerant Computing ECE 655.
CPSC 668Set 9: Fault Tolerant Consensus1 CPSC 668 Distributed Algorithms and Systems Spring 2008 Prof. Jennifer Welch.
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 6: Synchronous Byzantine.
1 Fault-Tolerant Consensus. 2 Failures in Distributed Systems Link failure: A link fails and remains inactive; the network may get partitioned Crash:
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 5: Synchronous Uniform.
Last Class: Weak Consistency
Distributed Algorithms: Agreement Protocols. Problems of Agreement l A set of processes need to agree on a value (decision), after one or more processes.
Consensus and Related Problems Béat Hirsbrunner References G. Coulouris, J. Dollimore and T. Kindberg "Distributed Systems: Concepts and Design", Ed. 4,
Distributed Consensus Reaching agreement is a fundamental problem in distributed computing. Some examples are Leader election / Mutual Exclusion Commit.
Lecture #12 Distributed Algorithms (I) CS492 Special Topics in Computer Science: Distributed Algorithms and Systems.
Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.
Chapter 19 Recovery and Fault Tolerance Copyright © 2008.
Review for Exam 2. Topics included Deadlock detection Resource and communication deadlock Graph algorithms: Routing, spanning tree, MST, leader election.
Ch11 Distributed Agreement. Outline Distributed Agreement Adversaries Byzantine Agreement Impossibility of Consensus Randomized Distributed Agreement.
1 Lectures on Parallel and Distributed Algorithms COMP 523: Advanced Algorithmic Techniques Lecturer: Dariusz Kowalski Lectures on Parallel and Distributed.
DISTRIBUTED SYSTEMS II FAULT-TOLERANT AGREEMENT Prof Philippas Tsigas Distributed Computing and Systems Research Group.
Byzantine Fault Tolerance in Stateful Web Service Yilei ZHANG 30/10/2009.
1 Resilience by Distributed Consensus : Byzantine Generals Problem Adapted from various sources by: T. K. Prasad, Professor Kno.e.sis : Ohio Center of.
Copyright © George Coulouris, Jean Dollimore, Tim Kindberg This material is made available for private study and for direct.
CS 425/ECE 428/CSE424 Distributed Systems (Fall 2009) Lecture 9 Consensus I Section Klara Nahrstedt.
CSE 60641: Operating Systems Implementing Fault-Tolerant Services Using the State Machine Approach: a tutorial Fred B. Schneider, ACM Computing Surveys.
Hwajung Lee. Reaching agreement is a fundamental problem in distributed computing. Some examples are Leader election / Mutual Exclusion Commit or Abort.
CSE 486/586 CSE 486/586 Distributed Systems Byzantine Fault Tolerance Steve Ko Computer Sciences and Engineering University at Buffalo.
Chap 15. Agreement. Problem Processes need to agree on a single bit No link failures A process can fail by crashing (no malicious behavior) Messages take.
V1.7Fault Tolerance1. V1.7Fault Tolerance2 A characteristic of Distributed Systems is that they are tolerant of partial failures within the distributed.
UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department
DISTRIBUTED ALGORITHMS Spring 2014 Prof. Jennifer Welch Set 9: Fault Tolerant Consensus 1.
1 Fault-Tolerant Consensus. 2 Communication Model Complete graph Synchronous, network.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Byzantine Fault Tolerance Steve Ko Computer Sciences and Engineering University at Buffalo.
Fault Tolerance Chapter 7. Goal An important goal in distributed systems design is to construct the system in such a way that it can automatically recover.
PROCESS RESILIENCE By Ravalika Pola. outline: Process Resilience  Design Issues  Failure Masking and Replication  Agreement in Faulty Systems  Failure.
Unreliable Failure Detectors for Reliable Distributed Systems Tushar Deepak Chandra Sam Toueg Presentation for EECS454 Lawrence Leinweber.
1 AGREEMENT PROTOCOLS. 2 Introduction Processes/Sites in distributed systems often compete as well as cooperate to achieve a common goal. Mutual Trust/agreement.
Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.
CSE 486/586, Spring 2014 CSE 486/586 Distributed Systems Byzantine Fault Tolerance Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 486/586 Distributed Systems Byzantine Fault Tolerance
reaching agreement in the presence of faults
COMP28112 – Lecture 15 Byzantine fault tolerance: dealing with arbitrary failures The Byzantine Generals’ problem (Byzantine Agreement) 26-Jan-18 COMP28112.
CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS
8.2. Process resilience Shreyas Karandikar.
COMP28112 – Lecture 14 Byzantine fault tolerance: dealing with arbitrary failures The Byzantine Generals’ problem (Byzantine Agreement) 13-Oct-18 COMP28112.
CSE 486/586 Distributed Systems Byzantine Fault Tolerance
COMP28112 – Lecture 13 Byzantine fault tolerance: dealing with arbitrary failures The Byzantine Generals’ problem (Byzantine Agreement) 19-Nov-18 COMP28112.
Distributed Consensus
The Byzantine Generals Problem
COMP28112 – Lecture 13 Byzantine fault tolerance: dealing with arbitrary failures The Byzantine Generals’ problem (Byzantine Agreement) 22-Feb-19 COMP28112.
CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS
Abstraction.
COMP28112 – Lecture 13 Byzantine fault tolerance: dealing with arbitrary failures The Byzantine Generals’ problem (Byzantine Agreement) 24-Apr-19 COMP28112.
Byzantine Generals Problem
CSE 486/586 Distributed Systems Byzantine Fault Tolerance
Presentation transcript:

1 Chapter 12 Consensus ( Fault Tolerance)

2 Reliable Systems Distributed processing creates faster systems by exploiting parallelism but also improve reliability by replicating a computation in several processors Distributed processing creates faster systems by exploiting parallelism but also improve reliability by replicating a computation in several processors A reliable system can be: A reliable system can be: Fail-safe if one or more failures do not cause damage to the system or to its users Fail-safe if one or more failures do not cause damage to the system or to its usersand/or Fault-tolerant if it continues to fulfill its requirements even if there are one or more failures Fault-tolerant if it continues to fulfill its requirements even if there are one or more failures

3 Typical Architectures for a Reliable System

4 The Problem Statement A group of Byzantine armies is surrounding an enemy city. The balance of force is such that if all armies attack together, they can capture the city; otherwise, they must all retreat to avoid defeat. The generals of the armies have reliable messengers who successfully deliver any message sent from one general to another. However, some of the generals may be traitors endeavoring to bring about the defeat of the Byzantine armies. A group of Byzantine armies is surrounding an enemy city. The balance of force is such that if all armies attack together, they can capture the city; otherwise, they must all retreat to avoid defeat. The generals of the armies have reliable messengers who successfully deliver any message sent from one general to another. However, some of the generals may be traitors endeavoring to bring about the defeat of the Byzantine armies. Devise an algorithm so that all loyal generals come to a consensus on a plan. Devise an algorithm so that all loyal generals come to a consensus on a plan. The final decision should be almost the same as a majority vote of their initial choices; if the vote is tied. The final decision should be almost the same as a majority vote of their initial choices; if the vote is tied.

5 The Problem Statement (Cont.) In distributed systems, the generals are the nodes and the messengers model communication channels In distributed systems, the generals are the nodes and the messengers model communication channels Generals may fail (being traitors), but the messengers are assumed to be reliable Generals may fail (being traitors), but the messengers are assumed to be reliable Models for node failures: Models for node failures: Crash failures: A traitor (failure node) simply stops sending messages at any arbitrary point during the execution of the algorithm Crash failures: A traitor (failure node) simply stops sending messages at any arbitrary point during the execution of the algorithm Byzantine failures: A traitor can send arbitrary messages, not just the messages required by the algorithm Byzantine failures: A traitor can send arbitrary messages, not just the messages required by the algorithm

6 Consensus – One-round Algorithm The values for planType are A for attack and R for retreat The values for planType are A for attack and R for retreat Each general chooses a plan, sends its plan to the other generals and receives their plans Each general chooses a plan, sends its plan to the other generals and receives their plans The final plan is the majority vote among all plans, both the general’s own plan and the plans received from the others The final plan is the majority vote among all plans, both the general’s own plan and the plans received from the others

7 Messages Sent in a One-round Algorithm Generals Zoe and Leo are loyal, Basil is a traitor Generals Zoe and Leo are loyal, Basil is a traitor Basil and Zoe chooses to attack, Leo chooses to retreat Basil and Zoe chooses to attack, Leo chooses to retreat Messages are exchanged, but Basil has crashed after sending an attack message to Leo. No message is received by Zoe from Basil Messages are exchanged, but Basil has crashed after sending an attack message to Leo. No message is received by Zoe from Basil Zoe decides to retreat (may have chosen attack), Leo decides to attack by majority voting Zoe decides to retreat (may have chosen attack), Leo decides to attack by majority voting Basil has crashed, Zoe retreats – ties are resolved in favour of retreat – common sense, Leo decides to attack; no consensus Basil has crashed, Zoe retreats – ties are resolved in favour of retreat – common sense, Leo decides to attack; no consensus If a general crashes, it can cause the remaining loyal generals to fail to come to a consensus; no consensus If a general crashes, it can cause the remaining loyal generals to fail to come to a consensus; no consensus

8 The Byzantine Generals Algorithm In the one-round algorithm, the fact that certain generals been loyal is not considered. Leo should somehow be able to attribute more weight to the plan received from loyal Zoe than the traitor Basil In the one-round algorithm, the fact that certain generals been loyal is not considered. Leo should somehow be able to attribute more weight to the plan received from loyal Zoe than the traitor Basil In a distributed system an individual node can not know the identities of the traitors directly; rather, it must ensure that the plan of the traitors can not cause the loyal generals to fail to reach consensus In a distributed system an individual node can not know the identities of the traitors directly; rather, it must ensure that the plan of the traitors can not cause the loyal generals to fail to reach consensus

9 Algorithm in Brief The Byzantine Generals algorithm sends messages twice: The Byzantine Generals algorithm sends messages twice: In the first round each general sends its own plan In the first round each general sends its own plan In the second round each general sends what is received from other generals In the second round each general sends what is received from other generals Loyal generals relay exactly what they received, so that if there are enough loyal generals, they can reach to a consensus Loyal generals relay exactly what they received, so that if there are enough loyal generals, they can reach to a consensus

10 First round sends plans and receives plans. At the end each general has the plan of each general First round sends plans and receives plans. At the end each general has the plan of each general In the second round, these plans are send to the other generals (except himself) and received back again In the second round, these plans are send to the other generals (except himself) and received back again Byzantine Generals Algorithm

11 Two Loyal, One Traitor – Crash Failure Same scenario in the one-round algorithm, where Basil (traitor) crashes after sending the first round message to Leo, but before sending to Zoe Same scenario in the one-round algorithm, where Basil (traitor) crashes after sending the first round message to Leo, but before sending to Zoe 2’nd column is the first round plans (Zoe: gets Leo’s plan and nothing from crashed Basil, Leo: has all plans) 2’nd column is the first round plans (Zoe: gets Leo’s plan and nothing from crashed Basil, Leo: has all plans) 3’rd and 4’th are the second round plans (Zoe: No plan from Basil - crashed, Basil’s A from Leo, Leo does not send its plan –R- again; Leo: No plan from Basil, No plan from Zoe – sent in the first round) 3’rd and 4’th are the second round plans (Zoe: No plan from Basil - crashed, Basil’s A from Leo, Leo does not send its plan –R- again; Leo: No plan from Basil, No plan from Zoe – sent in the first round) Majority voting : Basil: crashed; Zoe: Attack; Leo: Attack Majority voting : Basil: crashed; Zoe: Attack; Leo: Attack Two of the generals reached to a consensus Two of the generals reached to a consensus

12 Another Scenario Basil, the traitor, sends all its first round messages and reports to Leo before crashing Basil, the traitor, sends all its first round messages and reports to Leo before crashing Second Round; Leo: Basil sends Zoe’s Attack plan, Zoe sends Basil’s Attack plan; Zoe: No plan from Basil, Attack from Leo Second Round; Leo: Basil sends Zoe’s Attack plan, Zoe sends Basil’s Attack plan; Zoe: No plan from Basil, Attack from Leo Majorty voting: Both decide to attack Majorty voting: Both decide to attack

13 Byzantine Failures with Three Generals (One-round Algorithm) Basil, the traitor, sends a retreat message to Zoe and attack to Leo Basil, the traitor, sends a retreat message to Zoe and attack to Leo One round algorithm fails – no consensus – like the crash failure case One round algorithm fails – no consensus – like the crash failure case

14 In the first round, Basil sends an A message to both Zoe and Leo In the first round, Basil sends an A message to both Zoe and Leo In the second round, he correctly reports to Zoe that Leo’s plan is R, but erroneously reports to Leo that Zoe’s plan is R In the second round, he correctly reports to Zoe that Leo’s plan is R, but erroneously reports to Leo that Zoe’s plan is R Leo decides to retreat (ties are broken in favour of retreat), Zoe decides to attack – no consensus again Leo decides to retreat (ties are broken in favour of retreat), Zoe decides to attack – no consensus again The algorithm is not correct for three generals of whom one is a traitor The algorithm is not correct for three generals of whom one is a traitor Byzantine Failures with Three Generals (Two-round Algorithm)

15 Byzantine Failures with Four Generals John, Basil, Leo are loyal generals; Zoe is the traitor John, Basil, Leo are loyal generals; Zoe is the traitor Zoe sends first-round messages of R to Basil and Leo and A to John. These messages are relayed correctly by loyal generals and Basil has the table shown on the left Zoe sends first-round messages of R to Basil and Leo and A to John. These messages are relayed correctly by loyal generals and Basil has the table shown on the left The final decision will be a 2-1 vote in favor of R for Zoe’s plan The final decision will be a 2-1 vote in favor of R for Zoe’s plan So, if the loyal generals choose the same plan initially, the final decision would be this plan, regardless of the actions of the traitor So, if the loyal generals choose the same plan initially, the final decision would be this plan, regardless of the actions of the traitor

16 Consensus Crash Failures: consensus is reached in t+1 rounds where t is number of traitors Crash Failures: consensus is reached in t+1 rounds where t is number of traitors Byzantine Failures: If more then two-thirds of the generals are loyal, there is a solution regardless the messages issued by traitorous generals. If one-third or more of the generals are traitors then there is no solution. In the case of one traitor, there is a solution for four generals and none for three. That is, the total number of generals must be at least 3t+1, where t is the number of traitors Byzantine Failures: If more then two-thirds of the generals are loyal, there is a solution regardless the messages issued by traitorous generals. If one-third or more of the generals are traitors then there is no solution. In the case of one traitor, there is a solution for four generals and none for three. That is, the total number of generals must be at least 3t+1, where t is the number of traitors