The Consensus Problem in Fault Tolerant Computing

Slides:



Advertisements
Similar presentations
Fault Tolerance. Basic System Concept Basic Definitions Failure: deviation of a system from behaviour described in its specification. Error: part of.
Advertisements

Chapter 8 Fault Tolerance
Byzantine Generals. Outline r Byzantine generals problem.
Agreement: Byzantine Generals UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau Paper: “The.
Teaser - Introduction to Distributed Computing
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Byzantine Fault Tolerance Steve Ko Computer Sciences and Engineering University at Buffalo.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 3 – Distributed Systems.
Last Class: Weak Consistency
Aran Bergman, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Recitation 5: Reliable.
Distributed Algorithms: Agreement Protocols. Problems of Agreement l A set of processes need to agree on a value (decision), after one or more processes.
Consensus and Related Problems Béat Hirsbrunner References G. Coulouris, J. Dollimore and T. Kindberg "Distributed Systems: Concepts and Design", Ed. 4,
1 Fault Tolerance in Collaborative Sensor Networks for Target Detection IEEE TRANSACTIONS ON COMPUTERS, VOL. 53, NO. 3, MARCH 2004.
Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.
“Revisiting Fault Diagnosis Agreement in a New Territory” S. C. Wang and K. Q. Yan Operating Systems Review, April 2004, p. 41– 61. An extension of the.
Practical Byzantine Fault Tolerance
Agenda Fail Stop Processors –Problem Definition –Implementation with reliable stable storage –Implementation without reliable stable storage Failure Detection.
CS 425/ECE 428/CSE424 Distributed Systems (Fall 2009) Lecture 9 Consensus I Section Klara Nahrstedt.
CSE 60641: Operating Systems Implementing Fault-Tolerant Services Using the State Machine Approach: a tutorial Fred B. Schneider, ACM Computing Surveys.
Chap 15. Agreement. Problem Processes need to agree on a single bit No link failures A process can fail by crashing (no malicious behavior) Messages take.
Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery.
UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department
Reaching Agreement in the Presence of Faults M. Pease, R. Shotak and L. Lamport Sanjana Patel Dec 3, 2003.
A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Byzantine Fault Tolerance Steve Ko Computer Sciences and Engineering University at Buffalo.
Faults and fault-tolerance One of the selling points of a distributed system is that the system will continue to perform even if some components / processes.
PROCESS RESILIENCE By Ravalika Pola. outline: Process Resilience  Design Issues  Failure Masking and Replication  Agreement in Faulty Systems  Failure.
Fail-Stop Processors UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau One paper: Byzantine.
1 AGREEMENT PROTOCOLS. 2 Introduction Processes/Sites in distributed systems often compete as well as cooperate to achieve a common goal. Mutual Trust/agreement.
Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.
CSE 486/586 Distributed Systems Byzantine Fault Tolerance
Synchronizing Processes
Exercises for Chapter 11: COORDINATION AND AGREEMENT
Coordination and Agreement
Static and Dynamic Fault Diagnosis
The consensus problem in distributed systems
Faults and fault-tolerance
When Is Agreement Possible
CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS
8.2. Process resilience Shreyas Karandikar.
COMP28112 – Lecture 14 Byzantine fault tolerance: dealing with arbitrary failures The Byzantine Generals’ problem (Byzantine Agreement) 13-Oct-18 COMP28112.
Dependability Dependability is the ability to avoid service failures that are more frequent or severe than desired. It is an important goal of distributed.
Byzantine Fault Tolerance
Outline Distributed Mutual Exclusion Distributed Deadlock Detection
CSE 486/586 Distributed Systems Byzantine Fault Tolerance
Faults and fault-tolerance
COMP28112 – Lecture 13 Byzantine fault tolerance: dealing with arbitrary failures The Byzantine Generals’ problem (Byzantine Agreement) 19-Nov-18 COMP28112.
Alternating Bit Protocol
Distributed Consensus
Agreement Protocols CS60002: Distributed Systems
Fault Tolerance.
Distributed Systems, Consensus and Replicated State Machines
Distributed Consensus
Faults and fault-tolerance
Distributed Systems CS
Byzantine Generals Problem
PERSPECTIVES ON THE CAP THEOREM
Byzantine Faults definition and problem statement impossibility
EEC 688/788 Secure and Dependable Computing
Distributed Systems CS
EEC 688/788 Secure and Dependable Computing
COMP28112 – Lecture 13 Byzantine fault tolerance: dealing with arbitrary failures The Byzantine Generals’ problem (Byzantine Agreement) 22-Feb-19 COMP28112.
Consensus and Related Problems
CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Abstractions for Fault Tolerance
CSE 486/586 Distributed Systems Byzantine Fault Tolerance
Presentation transcript:

The Consensus Problem in Fault Tolerant Computing Sajayasree K K ME(CSE) E0 245 Fault Tolerant Computing

The Problem The consensus problem is to form an agreement among the fault-free members of the resource population on a quantum of information in order to maintain the performance and integrity of the system.

Organisation Background Different approaches Problem formulation The PMC model The Byzantine Agreement Fault Classification Testing Conclusion

Background What is the need for consensus? Connect computer resources to get a system with greater power and availability than any of its parts. The reverse can happen if faulty elements are allowed to corrupt the system.

Two Approaches Fault Contain the fault Diagnose the fault How to overcome the inadvertent or malicious spread of information by the faulty segment of the population? Diagnose the fault System Diagnosis Perperata et al. 1967 Contain the fault Fault Byzantine Generals Lamport et al. 1982

General Problem Formulation Reconfiguration Fault Diagnosis or masking Reliable Communication Unreliable communication medium Synchronization General layered approach to fault management

General Problem Formulation Problems: Performance Cost Distributed and Central voting P3 General NMR system

The PMC Model 1967, Preparata, Metze and Chien. Each processor tests another PE. Construct a graph and a syndrome. Conditions: All failures are hard or permanent failures A fault-free processor is always able to determine accurately the condition of the PE it is testing. A faulty processor produces unreliable test results. No more than t PEs may be faulty

The PMC Model A 1 x B E D C

The Byzantine Agreement Started by work of Wensley et al. in 1978. Software Implemented Fault Tolerance (SIFT) The number of PEs (n) must be greater than 3t, where t is the number of faulty elements. Each processor has a secret value. Values are exchanged by messages. Interactive Consistency: Consistency: Each fault free PE should form an identical vector of values whose ith element corresponds to the ith processor in the system. Meaningfulness: A vector element corresponding to a fault-free processor should be the actual secret value of that processor.

An Example

Byzantine General Problem The Byzantine Generals Problem introduced by Lamport, Shodtak and Pease 1982. Byzantine commanding general, who has surrounded the enemy with his many armies each led by a lieutenant general, wishes to organize a concerted plan of action, i.e., to attack or to retreat.

Fault Classification Analysis of characteristics of fault faulty processor results in proposition of fault models. Fault models proposed define the behavior of a PE once it has become faulty. System Diagnosis: description of test results given the status of tester and tested Byzantine agreement: description of limitations of a faulty processor. In general, the more constraints in the fault model, the easier it will be to form consensus.

Fault Classification: A failure in system Diagnosis Interactions of a faulty PE Model Group Description PMC Symmetric Invalidation Faulty PEs report unreliable results. Non-faulty PEs always produce correct results. BGM Asymmetric Invalidation A faulty PE would always test faulty regardless of the condition of the testing PE HK1, HK2 Reflexive and Irreflexive Invalidation A faulty PE will always report a non-faulty PE as being faulty.

Test Validity Models

Fault Classification: A failure in system Diagnosis Description Transient Resulting for the system's environment. Intermittent Internal to the system. Will not occur consistently. Permanent Internal to the system. Will always produce errors when exercised.

Fault Classification: A failure in Byzantine Agreement In worst case faulty PEs are assumed to work with complete knowledge about the state of the system :Adversary Model Limitations to adversary model. Defining algorithms that work only for this model can be limiting and impractical. So another classification of faults are introduces where stronger class is a subset of weaker class.

Fault Classification: A failure in Byzantine Agreement Description Fail-Stop Fault Faulty PE ceases operation and alerts other PEs Crash Fault Occurs when a PE loses its internal state or halts. Omission Fault Occurs when a PE fails to meet a deadline or begin a task. Timing Fault Occurs when a PE complete a task either before or after its specified time frame. Incorrect Computation Fault Occurs when a PE fails to produce correct results. Authenticated Byzantine Fault Arbitrary or malicious fault. Cannot imperceptibly alter an authenticated message Byzantine Fault Every fault is possible. The universal set.

Fault Classification: A failure in Byzantine Agreement Fail Stop Byzantine Fault

Testing Test type Description Self-Testing Testing performed by every PE on itself in a series of self-tests. A tests B by a simple request to get status. Comparison-Testing A test consists of performing an action and comparing the result of that action with what is expected. Group Testing A test may be able only to determine if A group of PEs is a faulty or not. Reaching a single PE resolution might require multiple tests. Time Domain Testing Testing of PEs with respect to time. If a PE fails to complete a task or exchange msg in the specified time, an error has occurred.

Conclusion Despite their different characteristics, the Byzantine agreement and system diagnosis have very similar goals, namely to produce a correct agreement despite the number of faults. Show similarities of both approaches to allow future research to draw from both areas rather than continuing apart.

References Michael Barborak, Miroslaw Malek and Anton Dahbura, “The Consensus Problem in Fault-Tolerant Computing”, ACM Computing Surveys, Vol. 25, No. 2, June 1993. Michael Fischer, Nancy Lynch and Michael Paterson, “Impossibility of Distributed Consensus with One Faulty Process”, Journal of the ACM, April 1985. PODC Influential Paper Award 2001, http://www.podc.org/influential/2001.html