“Revisiting Fault Diagnosis Agreement in a New Territory” S. C. Wang and K. Q. Yan Operating Systems Review, April 2004, p. 41– 61. An extension of the.

Slides:



Advertisements
Similar presentations
Fault Tolerance. Basic System Concept Basic Definitions Failure: deviation of a system from behaviour described in its specification. Error: part of.
Advertisements

Impossibility of Distributed Consensus with One Faulty Process
Chapter 8 Fault Tolerance
Byzantine Generals. Outline r Byzantine generals problem.
Agreement: Byzantine Generals UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau Paper: “The.
BASIC BUILDING BLOCKS -Harit Desai. Byzantine Generals Problem If a computer fails, –it behaves in a well defined manner A component always shows a zero.
Gossip and its application Presented by Anna Kaplun.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Byzantine Fault Tolerance Steve Ko Computer Sciences and Engineering University at Buffalo.
The Byzantine Generals Problem Boon Thau Loo CS294-4.
The Byzantine Generals Problem Leslie Lamport, Robert Shostak, Marshall Pease Distributed Algorithms A1 Presented by: Anna Bendersky.
Prepared by Ilya Kolchinsky.  n generals, communicating through messengers  some of the generals (up to m) might be traitors  all loyal generals should.
DISTRIBUTED SYSTEMS II FAULT-TOLERANT AGREEMENT Prof Philippas Tsigas Distributed Computing and Systems Research Group.
Byzantine Generals Problem: Solution using signed messages.
CPSC 668Set 10: Consensus with Byzantine Failures1 CPSC 668 Distributed Algorithms and Systems Fall 2009 Prof. Jennifer Welch.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 3 – Distributed Systems.
CPSC 668Set 9: Fault Tolerant Consensus1 CPSC 668 Distributed Algorithms and Systems Spring 2008 Prof. Jennifer Welch.
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
CPSC 668Set 10: Consensus with Byzantine Failures1 CPSC 668 Distributed Algorithms and Systems Fall 2006 Prof. Jennifer Welch.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 15 Wenbing Zhao Department of Electrical and Computer Engineering.
1 Fault-Tolerant Consensus. 2 Failures in Distributed Systems Link failure: A link fails and remains inactive; the network may get partitioned Crash:
Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 4 – Consensus and reliable.
Distributed Algorithms: Agreement Protocols. Problems of Agreement l A set of processes need to agree on a value (decision), after one or more processes.
On the Cost of Fault-Tolerant Consensus When There are no Faults Idit Keidar & Sergio Rajsbaum Appears in SIGACT News; MIT Tech. Report.
Systems of Distributed systems Module 2 - Distributed algorithms Teaching unit 2 – Properties of distributed algorithms Ernesto Damiani University of Bozen.
Consensus and Related Problems Béat Hirsbrunner References G. Coulouris, J. Dollimore and T. Kindberg "Distributed Systems: Concepts and Design", Ed. 4,
1 Fault Tolerance in Collaborative Sensor Networks for Target Detection IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 3, MARCH 2004.
Distributed Consensus Reaching agreement is a fundamental problem in distributed computing. Some examples are Leader election / Mutual Exclusion Commit.
Distributed Consensus Reaching agreement is a fundamental problem in distributed computing. Some examples are Leader election / Mutual Exclusion Commit.
Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.
CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2 Lecture 10 Instructor: Haifeng YU.
Consensus and Its Impossibility in Asynchronous Systems.
Review for Exam 2. Topics included Deadlock detection Resource and communication deadlock Graph algorithms: Routing, spanning tree, MST, leader election.
Ch11 Distributed Agreement. Outline Distributed Agreement Adversaries Byzantine Agreement Impossibility of Consensus Randomized Distributed Agreement.
DISTRIBUTED SYSTEMS II FAULT-TOLERANT AGREEMENT Prof Philippas Tsigas Distributed Computing and Systems Research Group.
1 Chapter 12 Consensus ( Fault Tolerance). 2 Reliable Systems Distributed processing creates faster systems by exploiting parallelism but also improve.
1 Resilience by Distributed Consensus : Byzantine Generals Problem Adapted from various sources by: T. K. Prasad, Professor Kno.e.sis : Ohio Center of.
DISTRIBUTED ALGORITHMS AND SYSTEMS Spring 2014 Prof. Jennifer Welch Set 11: Asynchronous Consensus 1.
Agenda Fail Stop Processors –Problem Definition –Implementation with reliable stable storage –Implementation without reliable stable storage Failure Detection.
CS 425/ECE 428/CSE424 Distributed Systems (Fall 2009) Lecture 9 Consensus I Section Klara Nahrstedt.
Hwajung Lee. Reaching agreement is a fundamental problem in distributed computing. Some examples are Leader election / Mutual Exclusion Commit or Abort.
Chap 15. Agreement. Problem Processes need to agree on a single bit No link failures A process can fail by crashing (no malicious behavior) Messages take.
1 Distributed Resilient Consensus Nitin Vaidya University of Illinois at Urbana-Champaign.
UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department
Iterative Byzantine Vector Consensus in Incomplete Graphs Nitin Vaidya University of Illinois at Urbana-Champaign ICDCN presentation by Srikanth Sastry.
Agreement in Distributed Systems n definition of agreement problems n impossibility of consensus with a single crash n solvable problems u consensus with.
1 Fault tolerance in distributed systems n Motivation n robust and stabilizing algorithms n failure models n robust algorithms u decision problems u impossibility.
DISTRIBUTED ALGORITHMS Spring 2014 Prof. Jennifer Welch Set 9: Fault Tolerant Consensus 1.
1 Fault-Tolerant Consensus. 2 Communication Model Complete graph Synchronous, network.
Fault Tolerance Chapter 7. Goal An important goal in distributed systems design is to construct the system in such a way that it can automatically recover.
PROCESS RESILIENCE By Ravalika Pola. outline: Process Resilience  Design Issues  Failure Masking and Replication  Agreement in Faulty Systems  Failure.
Byzantine Agreement in the Presence of Mixed Faults on Processor and Links Hin-Sing Siu, Yeh-Hao Chin, Wei-Pang Yang Senior Member, IEEE Computer Society,
Distributed Agreement. Agreement Problems High-level goal: Processes in a distributed system reach agreement on a value Numerous problems can be cast.
Fail-Stop Processors UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau One paper: Byzantine.
Unreliable Failure Detectors for Reliable Distributed Systems Tushar Deepak Chandra Sam Toueg Presentation for EECS454 Lawrence Leinweber.
1 AGREEMENT PROTOCOLS. 2 Introduction Processes/Sites in distributed systems often compete as well as cooperate to achieve a common goal. Mutual Trust/agreement.
Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.
The Consensus Problem in Fault Tolerant Computing
Synchronizing Processes
Coordination and Agreement
8.2. Process resilience Shreyas Karandikar.
Dependability Dependability is the ability to avoid service failures that are more frequent or severe than desired. It is an important goal of distributed.
Outline Distributed Mutual Exclusion Distributed Deadlock Detection
Distributed Consensus
Agreement Protocols CS60002: Distributed Systems
Distributed Consensus
Distributed Systems CS
Byzantine Faults definition and problem statement impossibility
Consensus in Synchronous Systems: Byzantine Generals Problem
Broadcasting with failures
Presentation transcript:

“Revisiting Fault Diagnosis Agreement in a New Territory” S. C. Wang and K. Q. Yan Operating Systems Review, April 2004, p. 41– 61. An extension of the Byzantine General’s algorithm – and hot off the press

Agreement Problem In the Byzantine General problem there is a commanding general that issues an “order” and all loyal lieutenant generals must come to the same agreement on the order. A related subproblem is the consensus problem – each processor, which has its own initial value, has to communicate with all other processors to reach a common value among the healthy processors.

Consensus constraints All the healthy processors agree on the common value (Consensus) If there exists a common initial value v_i among ALL the processors, then all the healthy processors must agree on v_i Most protocols for solving Byzantine Agreement or consensus are fault- masking protocols – come to consensus without the fault affecting the outcome.

Fault Diagnosis Agreement (FDA) Goal is to make each healthy processor able to detect and locate the faulty components in the distributed system ALL the healthy processor identify the common set of faulty components in the process of reaching consensus (Agreement) No healthy component is falsely detected as faulty by any healthy processor (Fairness)

Paper assumes dual failure mode on the network Most previous papers assume that the faulty components are processors only and that the network is fault-free.  Here we assume that the processors are fault-free and that the network may have a fault. Also, most other papers assume that the fault is malicious only. Here we assume dual failure:  Malicious faults (a random value is sent), and  Dormant faults (no value/crash or a stuck-at value is sent). Assume that a healthy process can detect components with dormant faults.

Assumptions A synchronous distributed system whose processors are reliable during the protocol execution Some faults, crash, stuck-at, noise or an intruder may interfere with message transmission N-processor fully connected network, with m malicious faults, d dormant faults, m<=ceiling[(n-d-3)/2]

Dual Fault Detection Consensus (DFDC) Algorithm Three phases:  Message exchange phase  Decision making phase  Fault detection phase Message exchange phase and the decision making phase is (similar to) OM(1) in the Byzantine General paper. This results in a matrix of information at each processor, MAT_i, which is used to construct a majority vector, MAJ_i

Fault detection phase Each processor sends every other processor its MAT_i. The MAT_i is used to find the faults by each healthy processor i:  Take the majority value in each position of the matrix to get FDMAT_i  If no majority exists for the i,jth position, use the negative value of the i,jth position of the MAT_j that was sent

P2=0 P4=1 P5=1 P3=0 P1=0 dormant faulty malcious faulty Initial value V10 V20 V30 V41 V51 V1V2V3V4V5 0001x x1111 Vectors received after the first round

P2 =0 P4 =1 P5 =1 P3 =0 P1 =0 dormant faulty malcious faulty V1V2V3V4V5 0001x x1111 Vectors received after the first round 0001x0 0001x0 0000x0 0111x1 x110x1 MAT_1MAJ_1

P2 =0 P4 =1 P5 =1 P3 =0 P1 =0 dormant faulty malcious faulty V1V2V3V4V5 0001x x1111 Vectors received after the first round 0001x x MAT_2,3MAJ_2,3

P2 =0 P4 =1 P5 =1 P3 =0 P1 =0 dormant faulty malcious faulty V1V2V3V4V5 0001x x1111 Vectors received after the first round 0001x MAT_4MAJ_4

P2 =0 P4 =1 P5 =1 P3 =0 P1 =0 dormant faulty malcious faulty V1V2V3V4V5 0001x x1111 Vectors received after the first round x001x0 X X X X MAT_5MAJ_5

0001x 0001x 0000x 0111x x110x MAT from P1 0001x x1111 MAT from P2 0001X x1111 MAT from P MAT from P4 XXXXX XXXXX XXXXX XXXXX XXXXX MAT from P5 0001x x1111 FDMAT Fault detection phase with processor P1