Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.

Slides:



Advertisements
Similar presentations
Impossibility of Distributed Consensus with One Faulty Process
Advertisements

Impossibility of Consensus in Asynchronous Systems (FLP) Ali Ghodsi – UC Berkeley / KTH alig(at)cs.berkeley.edu.
DISTRIBUTED SYSTEMS II FAULT-TOLERANT BROADCAST Prof Philippas Tsigas Distributed Computing and Systems Research Group.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Consensus Steve Ko Computer Sciences and Engineering University at Buffalo.
Teaser - Introduction to Distributed Computing
IMPOSSIBILITY OF CONSENSUS Ken Birman Fall Consensus… a classic problem  Consensus abstraction underlies many distributed systems and protocols.
6.852: Distributed Algorithms Spring, 2008 Class 7.
Distributed Systems Overview Ali Ghodsi
Distributed Computing 8. Impossibility of consensus Shmuel Zaks ©
Announcements. Midterm Open book, open note, closed neighbor No other external sources No portable electronic devices other than medically necessary medical.
Distributed Algorithms – 2g1513 Lecture 10 – by Ali Ghodsi Fault-Tolerance in Asynchronous Networks.
Consensus Hao Li.
Distributed Computing 8. Impossibility of consensus Shmuel Zaks ©
Byzantine Generals Problem: Solution using signed messages.
Consensus Algorithms Willem Visser RW334. Why do we need consensus? Distributed Databases – Need to know others committed/aborted a transaction to avoid.
Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
1 Principles of Reliable Distributed Systems Lecture 3: Synchronous Uniform Consensus Spring 2006 Dr. Idit Keidar.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Asynchronous Consensus (Some Slides borrowed from ppt on Web.(by Ken Birman) )
CPSC 668Set 9: Fault Tolerant Consensus1 CPSC 668 Distributed Algorithms and Systems Fall 2006 Prof. Jennifer Welch.
CPSC 668Set 9: Fault Tolerant Consensus1 CPSC 668 Distributed Algorithms and Systems Spring 2008 Prof. Jennifer Welch.
1 Fault-Tolerant Consensus. 2 Failures in Distributed Systems Link failure: A link fails and remains inactive; the network may get partitioned Crash:
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 5: Synchronous Uniform.
Impossibility of Distributed Consensus with One Faulty Process Michael J. Fischer Nancy A. Lynch Michael S. Paterson Presented by: Oren D. Rubin.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 4 – Consensus and reliable.
Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 2 – Distributed Systems.
Aran Bergman, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Recitation 5: Reliable.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 12: Impossibility.
On the Cost of Fault-Tolerant Consensus When There are no Faults Idit Keidar & Sergio Rajsbaum Appears in SIGACT News; MIT Tech. Report.
1 A Modular Approach to Fault-Tolerant Broadcasts and Related Problems Author: Vassos Hadzilacos and Sam Toueg Distributed Systems: 526 U1580 Professor:
Distributed Consensus Reaching agreement is a fundamental problem in distributed computing. Some examples are Leader election / Mutual Exclusion Commit.
Distributed Consensus Reaching agreement is a fundamental problem in distributed computing. Some examples are Leader election / Mutual Exclusion Commit.
Lecture 8-1 Computer Science 425 Distributed Systems CS 425 / CSE 424 / ECE 428 Fall 2010 Indranil Gupta (Indy) September 16, 2010 Lecture 8 The Consensus.
CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2 Lecture 10 Instructor: Haifeng YU.
Consensus and Its Impossibility in Asynchronous Systems.
CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2 Lecture 8 Instructor: Haifeng YU.
Distributed Algorithms Lecture 10b – by Ali Ghodsi Fault-Tolerance in Asynchronous Networks – Probabilistic Consensus.
CS294, Yelick Consensus revisited, p1 CS Consensus Revisited
CS 425/ECE 428/CSE424 Distributed Systems (Fall 2009) Lecture 9 Consensus I Section Klara Nahrstedt.
Distributed systems Consensus Prof R. Guerraoui Distributed Programming Laboratory.
CSE 60641: Operating Systems Implementing Fault-Tolerant Services Using the State Machine Approach: a tutorial Fred B. Schneider, ACM Computing Surveys.
Chap 15. Agreement. Problem Processes need to agree on a single bit No link failures A process can fail by crashing (no malicious behavior) Messages take.
SysRép / 2.5A. SchiperEté The consensus problem.
Impossibility of Distributed Consensus with One Faulty Process By, Michael J.Fischer Nancy A. Lynch Michael S.Paterson.
Agreement in Distributed Systems n definition of agreement problems n impossibility of consensus with a single crash n solvable problems u consensus with.
1 Fault tolerance in distributed systems n Motivation n robust and stabilizing algorithms n failure models n robust algorithms u decision problems u impossibility.
Alternating Bit Protocol S R ABP is a link layer protocol. Works on FIFO channels only. Guarantees reliable message delivery with a 1-bit sequence number.
Fault tolerance and related issues in distributed computing Shmuel Zaks GSSI - Feb
DISTRIBUTED ALGORITHMS Spring 2014 Prof. Jennifer Welch Set 9: Fault Tolerant Consensus 1.
1 Fault-Tolerant Consensus. 2 Communication Model Complete graph Synchronous, network.
PROCESS RESILIENCE By Ravalika Pola. outline: Process Resilience  Design Issues  Failure Masking and Replication  Agreement in Faulty Systems  Failure.
CSE 486/586 CSE 486/586 Distributed Systems Consensus Steve Ko Computer Sciences and Engineering University at Buffalo.
Fundamentals of Fault-Tolerant Distributed Computing In Asynchronous Environments Paper by Felix C. Gartner Graeme Coakley COEN 317 November 23, 2003.
Unreliable Failure Detectors for Reliable Distributed Systems Tushar Deepak Chandra Sam Toueg Presentation for EECS454 Lawrence Leinweber.
Consensus, impossibility results and Paxos Ken Birman.
The consensus problem in distributed systems
When Is Agreement Possible
CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS
Alternating Bit Protocol
Distributed Consensus
Distributed Systems, Consensus and Replicated State Machines
Distributed Consensus
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Presentation transcript:

Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems

2 Fault-Tolerance So far, assumed the system has been reliable Fault-tolerant computing by sequential algorithms and single-processors is limited Distributed systems have the partial-failure property  As the number of components (e.g. processes) are increased: It is extremely likely that failure will occur in some component It is extremely unlikely that failure will occur in all components

3 Example The PASS (Primary Avionics Software System) developed by IBM in 1981, was used in a space shuttle  Could have been done on one computer  But 4 separate processes were used for fault- tolerance Voting on the outcome

4 Space Shuttle DS hardware

5 Two main approaches to Fault-tolerance The two main approaches to fault-tolerance in distributed systems are  Robust Algorithms  Stabilizing Algorithms (sometimes Self-stabilizing)

6 Fault-tolerance: Robust Algorithms Robust algorithms  Correct processes should continue behaving correctly in spite of failures  Tolerate failures by using replication and voting  Never wait for all processes (Tree algorithm, Echo algorithm) because processes could fail  Usually deal with permanent faults

7 Fault-tolerance: Stabilizing Algorithms Stabilizing algorithms  Correct processes might affected by failure, but will eventually become correct  The system can start in any state (possibly faulty), but should eventually resume correct behavior  Usually deal with transitory faults

8 Failure models and their hierarchies Initially dead processes (benign fault)  A subset of the processors never ever start Crash model (benign fault)  A process functions properly according to its local algorithm until a certain point where it stops indefinitely  Never restarts again Byzantine behavior (malign fault)  The algorithm may execute any possible local algorithm  May send and receive arbitrary messages

9 Failure hierarchy Dead process special case of crashed process  Case when the crashed process crashes before it starts executing Crashed process special case of Byzantine process  Case when the Byzantine process crashes, and then keeps staying in that state for all future transitions

10 How good is a robust algorithm? Byzantine-robust implies Crash-robust Crash-robust implies Initially-dead-robust

11 Typical Guarantees Turns out, robust algorithms can typically tolerate:  N/2 benign failures for N processes  N/3 malign failures for N processes

12 Decision Problems Robust algorithms typically try to solve some decision problem  Each correct process irreversibly “decides”

13 Requirement Three requirements on decision problems:  Termination All correct processes eventually decide  Consistency Constraint on different processes decisions:  Consensus problem: every decide should be equal  Election: Every decide except one should be the same  Non-triviality Fixed trivial outputs (e.g. always decide “yes”) excluded Processes should need to communicate to be able to solve the problem

14 Consensus Problems Consensus problems very important in DS  Distributed Databases All processes must agree whether to commit or abort a transaction If any process says abort, all processes should abort

15 Fundamental Result: FLP85 Reliable Broadcast  All correct processes deliver the same set of messages  The set only contains messages from correct processes Atomic Broadcast (reliable)  A reliable broadcast where it is guaranteed that every process receives its messages in the same order as all the other processes

16 Atomic Broadcast  Consensus Given a reliable atomic broadcast  We can implement a consensus algorithm Let every node broadcast either 0 or 1  Decide on the first number that is received  Since every correct process will receive the messages in the same order, they will all decide on the same value Solving Reliable Atomic Broadcast is equivalent to solving consensus

17 Fischer, Lynch, Paterson 1983/85 Consensus cannot be solved in asynchronous model  With possibility of one process crashing Most influential paper award PODC 2001

18 Reminder, order of events Intuition  The order in which two applicable events are executed is not important! Order Theorem  Let e p and e q be two events on two different processors p and q which are both applicable in configuration . Then e p can be applied to e q (  ), and e q can be applied to e p (  ).  Moreover, e p ( e q (  )) = e q ( e p (  ) ).

19 Definitions A sequence of events  =( e 1, e 2,…,e k ) Ais applicable in configuration  if e 1 is applicable in , e 2 applicable in e 1 (  )... If the resulting configuration is  we write  (  )=  or     If  only contains events of a subset of the processes P, we write  P 

20 Order of sequences Diamond Theorem  Let sequences  1 and  2 be applicable in configuration , and let no process participate in both  1 and  2. Then  2 is applicable in  1 (  ),  2 is applicable in  2 (  ), and  1 (  2 (  ))=  2 (  1 (  )) Proof  By induction using the order theorem

21 Illustration of the Diamond Theorem  11 22 1()1() 2()2()  22 11  =  2 (  1 (  ) )=  1 (  2 (  ))

22 Summary  Distributed Systems attractive when it comes to fault- tolerance  Two main methods to achieve fault-tolerance Robust algorithms Stabilizing algorithms  Different failure models Initially dead Crashed Byzantine  Consensus important Transactions Atomic Broadcast  Consensus impossible in asynchronous systems with one possible fault