How to Choose a Timing Model? Idit Keidar and Alexander Shraer Technion – Israel Institute of Technology.

Slides:



Advertisements
Similar presentations
Timed Distributed System Models  A. Mok 2014 CS 386C.
Advertisements

CS 542: Topics in Distributed Systems Diganta Goswami.
The weakest failure detector question in distributed computing Petr Kouznetsov Distributed Programming Lab EPFL.
A General Characterization of Indulgence R. Guerraoui EPFL joint work with N. Lynch (MIT)
Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services Authored by: Seth Gilbert and Nancy Lynch Presented by:
IMPOSSIBILITY OF CONSENSUS Ken Birman Fall Consensus… a classic problem  Consensus abstraction underlies many distributed systems and protocols.
6.852: Distributed Algorithms Spring, 2008 Class 7.
Distributed Systems Overview Ali Ghodsi
P. Kouznetsov, 2006 Abstracting out Byzantine Behavior Peter Druschel Andreas Haeberlen Petr Kouznetsov Max Planck Institute for Software Systems.
Distributed Computing 8. Impossibility of consensus Shmuel Zaks ©
Sliding window protocol The sender continues the send action without receiving the acknowledgements of at most w messages (w > 0), w is called the window.
Timeliness, Failure Detectors, and Consensus Performance Alex Shraer Joint work with Dr. Idit Keidar Technion – Israel Institute of Technology In PODC.
Consensus Hao Li.
Byzantine Generals Problem: Solution using signed messages.
1 Principles of Reliable Distributed Systems Lectures 11: Authenticated Byzantine Consensus Spring 2005 Dr. Idit Keidar.
1 Principles of Reliable Distributed Systems Lecture 6: Synchronous Uniform Consensus Spring 2005 Dr. Idit Keidar.
1 Complexity of Network Synchronization Raeda Naamnieh.
Failure Detectors & Consensus. Agenda Unreliable Failure Detectors (CHANDRA TOUEG) Reducibility ◊S≥◊W, ◊W≥◊S Solving Consensus using ◊S (MOSTEFAOUI RAYNAL)
1 Principles of Reliable Distributed Systems Lecture 3: Synchronous Uniform Consensus Spring 2006 Dr. Idit Keidar.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 3 – Distributed Systems.
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 5: Synchronous (Uniform)
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 7: Failure Detectors.
Asynchronous Consensus (Some Slides borrowed from ppt on Web.(by Ken Birman) )
Timeliness, Failure Detectors, and Consensus Performance Idit Keidar and Alexander Shraer Technion – Israel Institute of Technology.
Non-blocking Atomic Commitment Aaron Kaminsky Presenting Chapter 6 of Distributed Systems, 2nd edition, 1993, ed. Mullender.
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 5: Synchronous Uniform.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 9: SMR with Paxos.
1 Principles of Reliable Distributed Systems Lecture 5: Failure Models, Fault-Tolerant Broadcasts and State-Machine Replication Spring 2005 Dr. Idit Keidar.
1 Principles of Reliable Distributed Systems Recitation 8 ◊S-based Consensus Spring 2009 Alex Shraer.
Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 4 – Consensus and reliable.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 6: Impossibility.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 12: Impossibility.
Josef Widder1 Why, Where and How to Use the  - Model Josef Widder Embedded Computing Systems Group INRIA Rocquencourt, March 10,
Distributed Algorithms: Agreement Protocols. Problems of Agreement l A set of processes need to agree on a value (decision), after one or more processes.
CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 19: Paxos All slides © IG.
Distributed Systems Tutorial 4 – Solving Consensus using Chandra-Toueg’s unreliable failure detector: A general Quorum-Based Approach.
On the Cost of Fault-Tolerant Consensus When There are no Faults Idit Keidar & Sergio Rajsbaum Appears in SIGACT News; MIT Tech. Report.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 7: Failure Detectors.
Efficient Algorithms to Implement Failure Detectors and Solve Consensus in Distributed Systems Mikel Larrea Departamento de Arquitectura y Tecnología de.
1 Principles of Reliable Distributed Systems Recitation 7 Byz. Consensus without Authentication ◊S-based Consensus Spring 2008 Alex Shraer.
Composition Model and its code. bound:=bound+1.
Consensus and Related Problems Béat Hirsbrunner References G. Coulouris, J. Dollimore and T. Kindberg "Distributed Systems: Concepts and Design", Ed. 4,
State Machines CS 614 Thursday, Feb 21, 2002 Bill McCloskey.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 8: Failure Detectors.
Paxos Made Simple Jinghe Zhang. Introduction Lock is the easiest way to manage concurrency Mutex and semaphore. Read and write locks. In distributed system:
1 A Modular Approach to Fault-Tolerant Broadcasts and Related Problems Author: Vassos Hadzilacos and Sam Toueg Distributed Systems: 526 U1580 Professor:
Distributed Consensus Reaching agreement is a fundamental problem in distributed computing. Some examples are Leader election / Mutual Exclusion Commit.
Fault Tolerance via the State Machine Replication Approach Favian Contreras.
Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.
Consensus and Its Impossibility in Asynchronous Systems.
Byzantine fault-tolerance COMP 413 Fall Overview Models –Synchronous vs. asynchronous systems –Byzantine failure model Secure storage with self-certifying.
Paxos: Agreement for Replicated State Machines Brad Karp UCL Computer Science CS GZ03 / M st, 23 rd October, 2008.
CS 425/ECE 428/CSE424 Distributed Systems (Fall 2009) Lecture 9 Consensus I Section Klara Nahrstedt.
Distributed systems Consensus Prof R. Guerraoui Distributed Programming Laboratory.
CSE 60641: Operating Systems Implementing Fault-Tolerant Services Using the State Machine Approach: a tutorial Fred B. Schneider, ACM Computing Surveys.
Sliding window protocol The sender continues the send action without receiving the acknowledgements of at most w messages (w > 0), w is called the window.
Chap 15. Agreement. Problem Processes need to agree on a single bit No link failures A process can fail by crashing (no malicious behavior) Messages take.
SysRép / 2.5A. SchiperEté The consensus problem.
Replication predicates for dependent-failure algorithms Flavio Junqueira and Keith Marzullo University of California, San Diego Euro-Par Conference, Lisbon,
Alternating Bit Protocol S R ABP is a link layer protocol. Works on FIFO channels only. Guarantees reliable message delivery with a 1-bit sequence number.
CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2 Lecture 9 Instructor: Haifeng YU.
Randomized Algorithms for Distributed Agreement Problems Peter Robinson.
Fundamentals of Fault-Tolerant Distributed Computing In Asynchronous Environments Paper by Felix C. Gartner Graeme Coakley COEN 317 November 23, 2003.
Unreliable Failure Detectors for Reliable Distributed Systems Tushar Deepak Chandra Sam Toueg Presentation for EECS454 Lawrence Leinweber.
Consensus, impossibility results and Paxos Ken Birman.
The consensus problem in distributed systems
Alternating Bit Protocol
Distributed Systems, Consensus and Replicated State Machines
PERSPECTIVES ON THE CAP THEOREM
Presentation transcript:

How to Choose a Timing Model? Idit Keidar and Alexander Shraer Technion – Israel Institute of Technology

How do you survive failures and achieve high availability?

Replication

State Machine Replication aaa bb c Replicas are identical deterministic state machines Process operations in the same order  remain consistent

Consensus Building block for state machine replication Each process has an input, should decide on an output so that– Agreement: decisions are the same Validity: decision is input of one process Termination: eventually all correct processes decide

Basic Model Message passing Channels between every pair of processes –do not create, duplicate or alter messages (integrity) Failures What about timing?

Synchronous Model a b c Very convenient for algorithms –understanding performance –early decision with no/few failures a b c a b c d d d

Synchronous Model: Limitation Requires very conservative timeouts –in practice: avg. latency < max. latency 100 [Cardwell, Savage, Anderson 2000], [Bakr-Keidar 2002] long timeout

Asynchronous Model Unbounded message delay Much more practical Fault-tolerant consensus impossible [FLP85]

Eventually Stable (Indulgent) Models Initially asynchronous –for unbounded period of time Eventually reach stabilization –GST (Global Stabilization Time) –following GST certain assumptions hold Examples –ES (Eventual Synchrony) – all links are ◊timely [Dwork, Lynch, Stockmeyer 88] –failure detectors:  (eventual leader), ◊S [Chandra, Toueg 96], [Chandra, Hadzilacos, Toueg 96]

Why Eventual Stabilization? Because “eventually” formally models “most of the time” (in stable periods). In practice, does not have to be forever, just “long enough” for the algorithm (T A ) T A depends on our synchrony assumptions !

Our Goals 1.Understand the relationship between: –Assumptions (number of timely links, with or without , etc.) and –performance of algorithms that exploit them In runs that eventually satisfy these assumptions –unlike stable runs in previous work And only these assumptions –unlike synchronous runs in previous work 2.Understand how message complexity affects performance.

Reminder – GIRAF [Keidar&Shraer 06] General Round-based Algorithm Framework Organize algorithms into rounds Separate algorithm logic from waiting condition Does not require rounds to be synchronized among processes Allows messages to arrive in any round Can capture any oracle model of [CHT 96] Can express models that cannot be expressed in RRFD [Gafni 98].

Algorithm for process p i upon receive m add m to M (msg buffer) upon end-of-round FD ← oracle (k) if (k = 0) then ← initialize(FD) else ← compute(k, M, FD) k ← k+1 enable sending of out_msg to Dest waiting condition controlled by env. GIRAF – The Generic Algorithm Your pet algorithm here

Defining Properties in GIRAF Environment can have –perpetual properties, φ –eventual properties, ◊φ In every run r, there exists a round GSR(r) GSR(r) – the first round from which: –no process fails –all eventual properties hold in each round

Example Communication Properties timely link in round k: p d receives the round k message of p s, in round k –if p d is correct, and p s executes round k (end-of-round s occurs in round k) j-source: same j timely outgoing links in every round j-source v : j timely outgoing links in every round (can vary in each round) j-destination: same j incoming timely links from correct processes in every round

Example Oracle Properties leader:  correct process p i s.t.  round k and  process p j : oracle j (k)=i –range of oracle( ) is   failure detector: ◊leader

Timing Models ES (Eventual Synchrony) [Dwork et al. 88] –All links between correct processes are ◊timely –Consensus in 3 rounds (optimal) [Dutta et al. 04] ◊AFM (All-From-Majority) simplified: –every correct process ◊majority–destination v, ◊majority–source v –Consensus in 5 rounds [Keidar&Shraer 06] ◊LM (Leader and Majority): –Ω, leader is ◊n–source, every correct process is ◊majority-destination v –Consensus in 3 rounds [Keidar&Shraer 06] From some round onward, one process is trusted by all (leader) From some round onward, the link delivers messages in the round they were sent majority of  timely incoming links (v means majority can change each round) majority of  timely outgoing links

New Model: ◊WLM Ω, leader is ◊n–source, ◊majority-destination v –unlike all processes in ◊LM –similar to [Malkhi et al. 05], a little stronger Previous Work Most Ω-based algorithms wait for majority in each round Paxos [Lamport 98] can make progress in  WLM –Takes constant number of rounds in ES –But how many rounds without ES?

Paxos Run in ES (Commit, 21,v 1 ) (“prepare”,21) yes decide v 1 (Commit, 21, v 1 ) Ω Leader BallotNum BallotNum – number of attempts to decide initiated by leaders no yes (“prepare”,2)

Paxos in ◊WLM (w/out ES) 2 (“prepare”,2) (“prepare”,9) (“prepare”,14) Ω Leader ok no (5) no (8) ok no (13) GSRGSR+1GSR+2GSR+3 BallotNum Commit takes O(n) rounds!

New Consensus Algorithm for  WLM Tolerates unbounded periods of asynchrony Minority can crash Message efficient: O(n) stable-state message complexity Achieves global decision in 4 rounds if leader is stable before GSR –5 otherwise

Our ◊WLM Algorithm in a Nutshell Commit with increasing ballot numbers, decide on value committed by majority –like Paxos, etc. Challenge: Don’t know all ballots, how to choose the new one to be highest one? Solution: Choose it to be the round number Challenge: rounds are wasted if a prepare/commit fails. Solution: pipeline prepares and commits: try in each round Challenge: do they really need to say no? Solution: support leader’s prepare even if have a higher ballot number –challenge: higher number may reflect later decision! Won’t agreement be compromised? –solution: new field “trustMe” ensures supported leader doesn't miss real decisions: it is set in round k+1 if majority trust the leader in round k

Example Run: GSR= Ω Leader GSR+1GSR GSR All Prepare with ! trustMe All Commit Did not lead to decision GSR+3GSR Leader Decides All Decide

Comparing The Models

Probabilistic Analysis Each link is timely with probability p in each round –Independent and Identically Distributed (IID) Bernoulli random variables Other simplifying assumptions: –Synchronous rounds –No failures Good starting point to understand behavior in real systems For each model M, calculate: –P M – probability of requirements of M to hold in a round –Expected number of rounds until the requirements of M hold long enough –E(D M ) – expected number of rounds until (global) decision in M

Comparing the Models (IID) Expected number of rounds for global decision (n=8) ES requires 350 rounds for p=0.97

LAN measurements How frequent is a stable round in each model ? –compare measured P M to IID prediction For IID: p = fraction of timely messages (over all rounds) –Example: for timeout = 0.1ms, p=0.7; timeout=0.2, p=0.976 ES is slightly better in practice (a slow round)  AFM is slightly worse in practice (a slow node)  WLM,  LM are better in practice (good leader)  WLM,  LM are better in practice (good leader)  WLM rounds are the most frequent !

GIRAF implementation for WAN Some round synchronization is needed for all models –In LAN, computers often have synchronized clocks A simple algorithm to implement GIRAF: –L i [j]: average latency between n i and n j as measured by n i (pings) –timeout – input parameter Receiver thread: Sender thread: upon receive m send message to peers add m to M (msg buffer) wait for timeout if m belongs to round k j > k i, compute next round msg. notify sender thread upon notify: stop waiting jump to round k j duration: timeout – L i [j]

WAN measurements Questions: How frequent is a stable round in each model ? (P M ) For each model M, measure time and #rounds until global decision in M How to set the timeout? The experiment: 33 runs for each timeout, 300 rounds per run –A run is represented by average on 15 different points in the run Asynchronous node startup –don’t consider rounds before the first stable round of the model

Question 1: Stable Rounds (P M ) Up to 99% of messages arrive till timeout = 350ms. –Waiting for 100% requires orders of magnitude longer [Cadwell et al. 98]  LM is sensitive to a single slow node In some runs P  LM = 95%, in others P  LM = 15%  AFM is constantly low: around 40% ES is constantly rare for small timeouts Occasionally good for larger timeouts – sensitive to a individual slow messages  WLM rounds are the most frequent (15% better than  LM for 160ms), with lowest variance !

Question 2: Global decision  WLM is best for timeout < 180ms. Same as others for higher timeouts. Choice of leader matters… With a bad leader – use  AFM

Question 3: Choosing the Timeout Tradeoff: –Longer timeouts: more stable rounds, less time/rounds for decision –But: each round takes longer and decision time is longer –The values are right for our system – might be different for yours Less rounds, each one is longer More rounds, each one is shorter With their optimal timeouts,  WLM is just 80ms worse

Conclusions  WLM – new timing model New algorithm for  WLM –Tolerates unbounded periods of asynchrony – O(n) stable-state message complexity –Achieves global decision in 4 or 5 rounds Thanks to the weak stability requirements, our algorithm has better/comparable performance compared to algorithms that take less rounds. –Even though other algorithms send more messages (Ω(n 2 ))