Composition Model and its code. bound:=bound+1.

Slides:



Advertisements
Similar presentations
Fault Tolerance. Basic System Concept Basic Definitions Failure: deviation of a system from behaviour described in its specification. Error: part of.
Advertisements

Impossibility of Distributed Consensus with One Faulty Process
CS 542: Topics in Distributed Systems Diganta Goswami.
CS425 /CSE424/ECE428 – Distributed Systems – Fall 2011 Material derived from slides by I. Gupta, M. Harandi, J. Hou, S. Mitra, K. Nahrstedt, N. Vaidya.
Teaser - Introduction to Distributed Computing
6.852: Distributed Algorithms Spring, 2008 Class 7.
Distributed Systems Overview Ali Ghodsi
1 © P. Kouznetsov On the weakest failure detector for non-blocking atomic commit Rachid Guerraoui Petr Kouznetsov Distributed Programming Laboratory Swiss.
UPV / EHU Efficient Eventual Leader Election in Crash-Recovery Systems Mikel Larrea, Cristian Martín, Iratxe Soraluze University of the Basque Country,
Byzantine Generals Problem: Solution using signed messages.
Failure Detectors. Can we do anything in asynchronous systems? Reliable broadcast –Process j sends a message m to all processes in the system –Requirement:
Consensus Algorithms Willem Visser RW334. Why do we need consensus? Distributed Databases – Need to know others committed/aborted a transaction to avoid.
1 Complexity of Network Synchronization Raeda Naamnieh.
UPV - EHU An Evaluation of Communication-Optimal P Algorithms Mikel Larrea Iratxe Soraluze Roberto Cortiñas Alberto Lafuente Department of Computer Architecture.
1 Principles of Reliable Distributed Systems Lecture 3: Synchronous Uniform Consensus Spring 2006 Dr. Idit Keidar.
Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 3 – Distributed Systems.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 7: Failure Detectors.
Asynchronous Consensus (Some Slides borrowed from ppt on Web.(by Ken Birman) )
Non-blocking Atomic Commitment Aaron Kaminsky Presenting Chapter 6 of Distributed Systems, 2nd edition, 1993, ed. Mullender.
1 Principles of Reliable Distributed Systems Lecture 5: Failure Models, Fault-Tolerant Broadcasts and State-Machine Replication Spring 2005 Dr. Idit Keidar.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Failure Detectors Steve Ko Computer Sciences and Engineering University at Buffalo.
1 Principles of Reliable Distributed Systems Recitation 8 ◊S-based Consensus Spring 2009 Alex Shraer.
Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 4 – Consensus and reliable.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 6: Impossibility.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 12: Impossibility.
1 Failure Detectors: A Perspective Sam Toueg LIX, Ecole Polytechnique Cornell University.
Cloud Computing Concepts
Distributed Systems Tutorial 4 – Solving Consensus using Chandra-Toueg’s unreliable failure detector: A general Quorum-Based Approach.
Distributed Systems Terminating Reliable Broadcast Prof R. Guerraoui Distributed Programming Laboratory.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 7: Failure Detectors.
Efficient Algorithms to Implement Failure Detectors and Solve Consensus in Distributed Systems Mikel Larrea Departamento de Arquitectura y Tecnología de.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 8: Failure Detectors.
Distributed Systems – CS425/CSE424/ECE428 – Fall Nikita Borisov — UIUC1.
1 A Modular Approach to Fault-Tolerant Broadcasts and Related Problems Author: Vassos Hadzilacos and Sam Toueg Distributed Systems: 526 U1580 Professor:
Distributed Consensus Reaching agreement is a fundamental problem in distributed computing. Some examples are Leader election / Mutual Exclusion Commit.
Failure detection and consensus Ludovic Henrio CNRS - projet OASIS Distributed Algorithms.
Eventual Leader Election in Infinite Arrival Message- passing System Model with Bounded Concurrency Sara Tucci Piergiovanni, Roberto Baldoni University.
Consensus and Its Impossibility in Asynchronous Systems.
CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2 Lecture 8 Instructor: Haifeng YU.
1 © R. Guerraoui Regular register algorithms R. Guerraoui Distributed Programming Laboratory lpdwww.epfl.ch.
Distributed systems Consensus Prof R. Guerraoui Distributed Programming Laboratory.
Sliding window protocol The sender continues the send action without receiving the acknowledgements of at most w messages (w > 0), w is called the window.
Chap 15. Agreement. Problem Processes need to agree on a single bit No link failures A process can fail by crashing (no malicious behavior) Messages take.
SysRép / 2.5A. SchiperEté The consensus problem.
1 © R. Guerraoui Distributed algorithms Prof R. Guerraoui Assistant Marko Vukolic Exam: Written, Feb 5th Reference: Book - Springer.
Chapter 21 Asynchronous Network Computing with Process Failures By Sindhu Karthikeyan.
Failure Detectors n motivation n failure detector properties n failure detector classes u detector reduction u equivalence between classes n consensus.
PROCESS RESILIENCE By Ravalika Pola. outline: Process Resilience  Design Issues  Failure Masking and Replication  Agreement in Faulty Systems  Failure.
CSE 486/586 CSE 486/586 Distributed Systems Leader Election Steve Ko Computer Sciences and Engineering University at Buffalo.
Distributed Systems Lecture 9 Leader election 1. Previous lecture Middleware RPC and RMI – Marshalling 2.
Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.
EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
CSE 486/586 Distributed Systems Failure Detectors
CSE 486/586 Distributed Systems Failure Detectors
Alternating Bit Protocol
Distributed Consensus
Parallel and Distributed Algorithms
Distributed Systems, Consensus and Replicated State Machines
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Distributed systems Consensus
CSE 486/586 Distributed Systems Failure Detectors
CSE 486/586 Distributed Systems Leader Election
Presentation transcript:

Composition Model and its code

bound:=bound+1

Processes Processes and messages Automata

Distributed Algorithms/protocols A set of automata one per process Specification of the correctness Safety: –No bad things happen Liveness: –Eventually something good happens

Links

Models of Synchrony Synchronous Asynchronous Partially synchronous

Synchronous System Characterized by three properties 1.Synchronous processing Known Upper Bound on the time taken by a process to execute a basic step 2.Synchronous Communication Known Upper Bound on the time taken by a message to reach a destination 3.Synchronous physical clocks Known Upper Bound on drift of a local clock wrt real time

Services provided in Synchronous systems Timed failure detection Measure of transit delay Coordination based on time Worst case performance (e.g. response time of a service in case of failures) Synchronized clocks Major problem the coverage of the synchrony assumption!!!! This turns out in difficulty of building a system where timing Assumptions hold with high probability

Partial synchrony Generally distributed systems are synchronous most of the time and then they experience bounded asynchrony periods One way to capture partial synchrony is “eventual synchrony” I.e., there is an unknown time t after which the system becomes synchronous This assumption captures the fact that the system does not behave always as synchronous It does not mean that –After t all the system (including hardware, software and network components) becomes synchronous forever –The system start asynchronous and then after some (may be long) time it becomes synchronous

What do we expect from partial synchrony There is a period of synchrony long enough to terminate the distributed algorithm

Failure Detection Partial synchrony requires abstract timing assumptions Two choices: –Put assumption on the system model (including links and processes) –Create a separate abstractions that encapsulates those timing assumptions (Failure detector)

Failure Detectors Software module to be used together with process and link abstractions It encapsulates timing assumptions of a synchronous (partially or fully) system The stronger are the timing assumption, the more accurate the information provided by a failure detector will be. Described by two properties: –Accuracy –Completeness Note:manipulating time inside a protocol/algorithm is complex and the correctness proof become very involved

Perfect Failure detectors (P) Assume a synchronous system and a crash- stop abstraction Using a own clock and the bounds of the synchrony model, a process can infer if another process has crashed.

Perfect failure detectors (P) Characterized by two properties

Discuss message pattern and correctness (note that links are perfect)

Eventually perfect failure detectors (  P) Assume partial synchrony There is a (unknown) time t after that crashes can be accurately detected Before t the systems behaves as an asynchronous one The failure detector makes mistake in that periods assuming correct processes as crashed. The notion of detection becomes suspicious

Example Use timeouts to suspect processes that did not sent expected messages A suspect may be wrong. A process p may suspect another one q because the chosen timeout was too short P is ready to reverse its judgment as soon as it receives a message from q (updating also the timeout value) If q has actually crashed, p does not change its judgment anymore.

suspect Discuss message pattern and correctness (note that links are fair lossy)

Correctness Strong completeness. If a process crashes, it will stop to send messages. Therefore the process will be suspected by any correct process and no process will revise the judgement. Eventual strong accuracy. After time T the system becomes synchronous. I.e., after that time a message sent by a correct process p to another one q will be delivered within a bounded time. If p was wrongly suspected by q, then q will revise its suspicious.

Eventual leader election (  ) Agree on a process that has not failed This process could act as a coordinator

Eventual leader election (  ) Using Crash-stop process abstraction –Obtained directly by <>P by using a deterministic rule on precesses that are not suspected by <>P – trust the process with the highest identifier among all processes that are not suspected by <>P Assume the existence of a correct process (otherwise  cannot be built)

Eventual leader election (  ) Assuming Crash-Recovery and Partial synchrony Under this assumption, a correct process means: 1.A process that does not crash or 2.A process that crashes, eventually recovers and never crashes again

 With crash recovery, fair lossy links and timeouts

Epoch number possible list of “non-crashed” processes of process P potentially candidates to be a leader Select (possible), deterministic function returning one process among all processes in possible (the same at each process) Deterministic rule: return the process with the lowest epoch number and among the ones with the same epoch number the one with the lowest identifier. Needed to avoid that the leader will be a process that goes up and down infinitely many times

Correctness Eventual accuracy Assume by contradiction a correct process Pi trusts permanently a faulty one Pj. Two cases 1.Pj eventually crashes and never recovers again…stop sending heartbeats! 2.Pj keeps crashing and recoveries forever…. 1.Pi stops receiving messages from Pj……… a process stays up only for finite periods of time 2. Pi continues receiving messages from Pj….. epoch number increases forever and

Reliable Broadcast

Best-Effort Broadcast Beb ensures the delivery of messages as long as the sender does not fail If the sender fails processes may disagree on whether or not deliver the Message (perfect links does not ensure delivery if the sender fail!)

Regular Reliable Broadcast Stronger form of reliability than BeB Liveness: agreement Implementation: relaying messages

Regular Reliable Broadcast Use of a perfect failure detector P The algorithm is lazy. It retransmits a message only if the sender has been detected as crashed

Does not use any FD It is eager in the sense that it forwards any delivered message. Performance Best case: 1 communication step and N messages Worst Case: N communication steps and N*N messages

Uniform Reliable Broadcast Stronger form of reliability than RRB Agreement on a message delivered by any process (crashed or not)! Uniformity states that the set of messages delivered by a correct process is a superset of the ones delivered by a faulty one

Assumption: majority of correct processes

Probabilistic broadcast Message delivered 99% of the times Not fully reliable Large & dynamic groups Acks make reliable broadcast not scalable

Ack Implosion and ack tree Problems: Process spends all its time by doing the ack task Maintaining the tree structure

Probabilistic Broadcast

Gossip Dissemination A process sends a message to a set of randomly choosen k processes A process receiving a message for the first time forwards it to a set of k randomly choosen processes (this operation is also called a round) The algorithm performs a maximum number of r rounds