Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Consensus Problem in Fault Tolerant Computing

Similar presentations


Presentation on theme: "The Consensus Problem in Fault Tolerant Computing"— Presentation transcript:

1 The Consensus Problem in Fault Tolerant Computing
Sajayasree K K ME(CSE) E Fault Tolerant Computing

2 The Problem The consensus problem is to form an agreement
among the fault-free members of the resource population on a quantum of information in order to maintain the performance and integrity of the system.

3 Organisation Background Different approaches Problem formulation
The PMC model The Byzantine Agreement Fault Classification Testing Conclusion

4 Background What is the need for consensus?
Connect computer resources to get a system with greater power and availability than any of its parts. The reverse can happen if faulty elements are allowed to corrupt the system.

5 Two Approaches Fault Contain the fault Diagnose the fault
How to overcome the inadvertent or malicious spread of information by the faulty segment of the population? Diagnose the fault System Diagnosis Perperata et al. 1967 Contain the fault Fault Byzantine Generals Lamport et al. 1982

6 General Problem Formulation
Reconfiguration Fault Diagnosis or masking Reliable Communication Unreliable communication medium Synchronization General layered approach to fault management

7 General Problem Formulation
Problems: Performance Cost Distributed and Central voting P3 General NMR system

8 The PMC Model 1967, Preparata, Metze and Chien.
Each processor tests another PE. Construct a graph and a syndrome. Conditions: All failures are hard or permanent failures A fault-free processor is always able to determine accurately the condition of the PE it is testing. A faulty processor produces unreliable test results. No more than t PEs may be faulty

9 The PMC Model A 1 x B E D C

10 The Byzantine Agreement
Started by work of Wensley et al. in Software Implemented Fault Tolerance (SIFT) The number of PEs (n) must be greater than 3t, where t is the number of faulty elements. Each processor has a secret value. Values are exchanged by messages. Interactive Consistency: Consistency: Each fault free PE should form an identical vector of values whose ith element corresponds to the ith processor in the system. Meaningfulness: A vector element corresponding to a fault-free processor should be the actual secret value of that processor.

11 An Example

12 Byzantine General Problem
The Byzantine Generals Problem introduced by Lamport, Shodtak and Pease 1982. Byzantine commanding general, who has surrounded the enemy with his many armies each led by a lieutenant general, wishes to organize a concerted plan of action, i.e., to attack or to retreat.

13 Fault Classification Analysis of characteristics of fault faulty processor results in proposition of fault models. Fault models proposed define the behavior of a PE once it has become faulty. System Diagnosis: description of test results given the status of tester and tested Byzantine agreement: description of limitations of a faulty processor. In general, the more constraints in the fault model, the easier it will be to form consensus.

14 Fault Classification: A failure in system Diagnosis
Interactions of a faulty PE Model Group Description PMC Symmetric Invalidation Faulty PEs report unreliable results. Non-faulty PEs always produce correct results. BGM Asymmetric Invalidation A faulty PE would always test faulty regardless of the condition of the testing PE HK1, HK2 Reflexive and Irreflexive Invalidation A faulty PE will always report a non-faulty PE as being faulty.

15 Test Validity Models

16 Fault Classification: A failure in system Diagnosis
Description Transient Resulting for the system's environment. Intermittent Internal to the system. Will not occur consistently. Permanent Internal to the system. Will always produce errors when exercised.

17 Fault Classification: A failure in Byzantine Agreement
In worst case faulty PEs are assumed to work with complete knowledge about the state of the system :Adversary Model Limitations to adversary model. Defining algorithms that work only for this model can be limiting and impractical. So another classification of faults are introduces where stronger class is a subset of weaker class.

18 Fault Classification: A failure in Byzantine Agreement
Description Fail-Stop Fault Faulty PE ceases operation and alerts other PEs Crash Fault Occurs when a PE loses its internal state or halts. Omission Fault Occurs when a PE fails to meet a deadline or begin a task. Timing Fault Occurs when a PE complete a task either before or after its specified time frame. Incorrect Computation Fault Occurs when a PE fails to produce correct results. Authenticated Byzantine Fault Arbitrary or malicious fault. Cannot imperceptibly alter an authenticated message Byzantine Fault Every fault is possible. The universal set.

19 Fault Classification: A failure in Byzantine Agreement
Fail Stop Byzantine Fault

20 Testing Test type Description Self-Testing
Testing performed by every PE on itself in a series of self-tests. A tests B by a simple request to get status. Comparison-Testing A test consists of performing an action and comparing the result of that action with what is expected. Group Testing A test may be able only to determine if A group of PEs is a faulty or not. Reaching a single PE resolution might require multiple tests. Time Domain Testing Testing of PEs with respect to time. If a PE fails to complete a task or exchange msg in the specified time, an error has occurred.

21 Conclusion Despite their different characteristics, the Byzantine
agreement and system diagnosis have very similar goals, namely to produce a correct agreement despite the number of faults. Show similarities of both approaches to allow future research to draw from both areas rather than continuing apart.

22 References Michael Barborak, Miroslaw Malek and Anton Dahbura, “The Consensus Problem in Fault-Tolerant Computing”, ACM Computing Surveys, Vol. 25, No. 2, June 1993. Michael Fischer, Nancy Lynch and Michael Paterson, “Impossibility of Distributed Consensus with One Faulty Process”, Journal of the ACM, April 1985. PODC Influential Paper Award 2001,


Download ppt "The Consensus Problem in Fault Tolerant Computing"

Similar presentations


Ads by Google