Timeliness, Failure Detectors, and Consensus Performance Idit Keidar and Alexander Shraer Technion – Israel Institute of Technology
Keidar & Shraer, Technion, Israel PODC 2006 Basic Model Message passing Links between every pair of processes –do not create, duplicate or alter messages (integrity) Process and link failures
Keidar & Shraer, Technion, Israel PODC 2006 Eventually Stable (Indulgent) Models Initially asynchronous –for unbounded period of time Eventually reach stabilization –GST (Global Stabilization Time) –following GST certain assumptions hold Examples –ES (Eventual Synchrony) – starting from GST all links have a bound on message delay [Dwork, Lynch, Stockmeyer 88] –failure detectors [Chandra, Toueg 96], [Chandra, Hadzilacos, Toueg 96]
Keidar & Shraer, Technion, Israel PODC 2006 Indulgent Models: Research Trend Weaken post-GST assumptions as much as possible [Guerraoui, Schiper96], [Aguilera et al. 03, 04], [Malkhi et al. 05] Weaker = better?
Keidar & Shraer, Technion, Israel PODC 2006 You only need ONE machine with eventually ONE timely link. Buy the hardware to ensure it, set the timeout accordingly, and EVERYTHING WILL WORK. Indulgent Models: Research Trend
Keidar & Shraer, Technion, Israel PODC 2006 Consensus with Weak Assumptions Network Why isn’t anything happening ???Don’t worry! It will eventually happen!
Keidar & Shraer, Technion, Israel PODC 2006 Consensus with Weak Assumptions Network
Keidar & Shraer, Technion, Israel PODC 2006 What’s Going On? In practice, bounds just need to hold “long enough” for the algorithm (T A ) to finish But T A depends on our synchrony assumptions –with weak assumptions, T A might be unbounded For practical systems, eventual completion of the job is not enough!
Keidar & Shraer, Technion, Israel PODC 2006 Our Goal Understand the relationship between: –assumptions (1 timely link, failure detectors, etc.) that eventually hold –performance of algorithms that exploit these assumptions, and only them Challenge: How do we understand the performance of asynchronous algorithms that make very different assumptions?
Keidar & Shraer, Technion, Israel PODC 2006 Typical Metric: Count “Rounds” Algorithms normally progress in rounds, though rounds are not synchronized among processes at process p i : forever do send messages receive messages while (!some conditions) compute… Previous work: –look at synchronous runs (every message takes exactly time) –count rounds or “ s” [Keidar, Rajsbaum 01], [Dutta, Guerraoui 02], [Guerraoui, Raynal 04] [Dutta et al. 03], etc.
Keidar & Shraer, Technion, Israel PODC 2006 Are All “Rounds” the Same? Algorithm 1 waits for messages from a majority that includes a pre-defined leader in each round –takes 3 rounds Algorithm 2 waits for messages from all (unsuspected) processes in each round –E.g., group membership –takes 2 rounds
GIRAF General Round-based Algorithm Framework Inspired by Gafni ’ s RRFD, generalizes it Organize algorithms into rounds Separate algorithm logic from waiting condition Waiting condition defines model Allows reasoning about lower and upper bounds for rounds of different types
Keidar & Shraer, Technion, Israel PODC 2006 waiting condition controlled by env. GIRAF – The Generic Algorithm Your pet algorithm here Algorithm for process p i upon receive m add m to M (msg buffer) upon end-of-round FD ← oracle (k) if (k = 0) then out_msg ← initialize(FD) else out_msg ← compute(k, M, FD) k ← k+1 enable sending of out_msg to all
Keidar & Shraer, Technion, Israel PODC 2006 GIRAF’s Generality Does not require rounds to be synchronized among processes Can capture any oracle model –in [CHT96] general failure detector model –leader oracle + majority in each round Messages can arrive in any round –allows for untimely albeit reliable links
Keidar & Shraer, Technion, Israel PODC 2006 Defining Properties in GIRAF Environment can have –perpetual properties –eventual properties In every run r, there exists a round GSR(r) GSR(r) – the first round from which: –no process fails –all eventual properties hold in each round
Keidar & Shraer, Technion, Israel PODC 2006 Defining Timeliness Timely link in round k: p d receives the round k message of p s, in round k –if p d is correct, and p s executes round k (end-of-round s occurs in round k) Time – free!
Keidar & Shraer, Technion, Israel PODC 2006 Some Results: Context Consensus problem Global decision time metric –Time until all correct processes decide Message passing Crash failures –t 1 processes
Keidar & Shraer, Technion, Israel PODC 2006 ◊LM Model: Leader and Majority Nothing required before GSR In every round k ≥ GSR –Every correct process receives a round k message from a majority of processes, one of which is the Ω-leader. Practically requires much shorter timeouts than Eventual Synchrony [Bakr, Keidar]
Keidar & Shraer, Technion, Israel PODC 2006 ◊LM: Previous Work Most Ω-based algorithms wait for majority in each round (not ◊LM) Paxos [Lamport 98] works for ◊LM –Takes constant number of rounds in Eventual Synchrony (ES) –But how many rounds without ES?
Keidar & Shraer, Technion, Israel PODC 2006 Paxos Run in ES (Commit, 21,v 1 ) (“prepare”,21) yes decide v 1 (Commit, 21, v 1 ) Ω Leader BallotNum number of attempts to decide initiated by leaders no yes (“prepare”,2)
Keidar & Shraer, Technion, Israel PODC 2006 Paxos in ◊LM (w/out ES) 2 (“prepare”,2) (“prepare”,9) (“prepare”,14) Ω Leader ok no (5) no (8) ok no (13) GSRGSR+1GSR+2GSR+3 BallotNum Commit may take Ω(n) rounds!
Keidar & Shraer, Technion, Israel PODC 2006 What Can We Hope For? Tight lower bound for ES: 3 rounds from GSR [DGK05] ◊LM weaker than ES One might expect it to take a longer time in ◊LM than in ES
Keidar & Shraer, Technion, Israel PODC 2006 Result 1: Don't Need ES Leader and majority can give you the same performance! Algorithm that matches lower bound for ES!
Keidar & Shraer, Technion, Israel PODC 2006 Our ◊LM Algorithm in a Nutshell Commit with increasing ballot numbers, decide on value committed by majority –like Paxos, etc. Challenge: Don’t know all ballots, how to choose the new one to be highest one? Solution: Choose it to be the round number Challenge: rounds are wasted if a prepare/commit fails. Solution: pipeline prepares and commits: try in each round Challenge: do they really need to say no? Solution: support leader’s prepare even if have a higher ballot number –challenge: higher number may reflect later decision! Won’t agreement be compromised? –solution: new field “trustMe” ensures supported leader doesn't miss real decisions
Keidar & Shraer, Technion, Israel PODC 2006 Example Run: GSR= Ω Leader Rounds: GSR+1 GSR GSR All PREPARE with ! trustMe All COMMIT 101 All DECIDE Did not lead to decision
Keidar & Shraer, Technion, Israel PODC 2006 Question 2: ◊S and Ω Equivalent? ◊S and Ω equivalent in the “classical” sense [Chandra, Hadzilacos, Toueg 96] –Weakest for consensus ◊S: eventually (from GSR onward), –all faulty processes are suspected by every correct process –there exists one correct process that is not suspected by any correct process. Can we substitute Ω with ◊S in ◊LM?
Keidar & Shraer, Technion, Israel PODC 2006 Result 2: ◊S and Ω not that Equivalent Consensus takes linear time from GSR By reduction to mobile failure model [Santoro, Widmayer 89]
Keidar & Shraer, Technion, Israel PODC 2006 Result 3: Do We Need Oracles? Timely communication with majority suffices! ◊AFM (All-From-Majority) simplified: –In every round k ≥ GSR, every correct process p receives round k message from a majority of processes, and p’s message reaches a majority of processes. Decision in 5 rounds from GSR –1 st constant time algorithm w/out oracle or ES –idea: information passes to all nodes in 2 rounds
Keidar & Shraer, Technion, Israel PODC 2006 ◊MFM: Majority from Majority –The rest receive a message from a minority Only a little missing for ◊AFM Stronger than models in literature [Aguilera et al. 03, 04], [Malkhi et al. 05] Bounded time from GSR impossible! Result 4: Can We Assume Less?
Conclusions Which guarantees should one implement ? –weaker ≠ better some previously suggested assumptions are too weak –sometimes a little stronger = much better worth longer timeouts / better hardware –ES is not essential not worth longer timeouts / better hardware –future: more models, bounds to explore GIRAF