 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2007 1 Principles of Reliable Distributed Systems Lecture 9: Paxos Spring.

Slides:



Advertisements
Similar presentations
CS 542: Topics in Distributed Systems Diganta Goswami.
Advertisements

CS425 /CSE424/ECE428 – Distributed Systems – Fall 2011 Material derived from slides by I. Gupta, M. Harandi, J. Hou, S. Mitra, K. Nahrstedt, N. Vaidya.
Teaser - Introduction to Distributed Computing
CS 5204 – Operating Systems1 Paxos Student Presentation by Jeremy Trimble.
Paxos Lamport the archeologist and the “Part-time Parliament” of Paxos: – The Part-time Parliament, TOCS 1998 – Paxos Made Simple, ACM SIGACT News 2001.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Paxos Steve Ko Computer Sciences and Engineering University at Buffalo.
1 Indranil Gupta (Indy) Lecture 8 Paxos February 12, 2015 CS 525 Advanced Distributed Systems Spring 2015 All Slides © IG 1.
6.852: Distributed Algorithms Spring, 2008 Class 7.
Distributed Systems Overview Ali Ghodsi
Consensus Hao Li.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Paxos Steve Ko Computer Sciences and Engineering University at Buffalo.
Byzantine Generals Problem: Solution using signed messages.
Failure Detectors. Can we do anything in asynchronous systems? Reliable broadcast –Process j sends a message m to all processes in the system –Requirement:
Consensus Algorithms Willem Visser RW334. Why do we need consensus? Distributed Databases – Need to know others committed/aborted a transaction to avoid.
1 Principles of Reliable Distributed Systems Lectures 11: Authenticated Byzantine Consensus Spring 2005 Dr. Idit Keidar.
1 Principles of Reliable Distributed Systems Lecture 3: Synchronous Uniform Consensus Spring 2006 Dr. Idit Keidar.
Eddie Bortnikov & Aran Bergman, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Recitation.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 7: Failure Detectors.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 10: SMR with Paxos.
1 Principles of Reliable Distributed Systems Lecture 12: Disk Paxos and Quorum Systems Spring 2009 Idit Keidar.
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 6: Synchronous Byzantine.
Eddie Bortnikov & Aran Bergman, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Recitation.
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 5: Synchronous Uniform.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 9: SMR with Paxos.
1 Principles of Reliable Distributed Systems Lecture 5: Failure Models, Fault-Tolerant Broadcasts and State-Machine Replication Spring 2005 Dr. Idit Keidar.
Sergio Rajsbaum 2006 Lecture 4 Introduction to Principles of Distributed Computing Sergio Rajsbaum Math Institute UNAM, Mexico.
1 Principles of Reliable Distributed Systems Recitation 8 ◊S-based Consensus Spring 2009 Alex Shraer.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 8: Paxos Spring.
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 6: Synchronous Byzantine.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 12: Impossibility.
1 Principles of Reliable Distributed Systems Lecture 11: Disk Paxos and Quorum Systems Spring 2007 Prof. Idit Keidar.
CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 19: Paxos All slides © IG.
On the Cost of Fault-Tolerant Consensus When There are no Faults Idit Keidar & Sergio Rajsbaum Appears in SIGACT News; MIT Tech. Report.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 7: Failure Detectors.
1 Principles of Reliable Distributed Systems Recitation 7 Byz. Consensus without Authentication ◊S-based Consensus Spring 2008 Alex Shraer.
Eddie Bortnikov & Aran Bergman, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Recitation.
State Machines CS 614 Thursday, Feb 21, 2002 Bill McCloskey.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 8: Failure Detectors.
Paxos Made Simple Jinghe Zhang. Introduction Lock is the easiest way to manage concurrency Mutex and semaphore. Read and write locks. In distributed system:
Bringing Paxos Consensus in Multi-agent Systems Andrei Mocanu Costin Bădică University of Craiova.
CSE 486/586, Spring 2014 CSE 486/586 Distributed Systems Paxos Steve Ko Computer Sciences and Engineering University at Buffalo.
Paxos A Consensus Algorithm for Fault Tolerant Replication.
Distributed systems Consensus Prof R. Guerraoui Distributed Programming Laboratory.
CSE 60641: Operating Systems Implementing Fault-Tolerant Services Using the State Machine Approach: a tutorial Fred B. Schneider, ACM Computing Surveys.
SysRép / 2.5A. SchiperEté The consensus problem.
Fault Tolerance (2). Topics r Reliable Group Communication.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Paxos Steve Ko Computer Sciences and Engineering University at Buffalo.
Distributed Systems Lecture 9 Leader election 1. Previous lecture Middleware RPC and RMI – Marshalling 2.
Unreliable Failure Detectors for Reliable Distributed Systems Tushar Deepak Chandra Sam Toueg Presentation for EECS454 Lawrence Leinweber.
CSE 486/586, Spring 2014 CSE 486/586 Distributed Systems Paxos Steve Ko Computer Sciences and Engineering University at Buffalo.
Detour: Distributed Systems Techniques
The consensus problem in distributed systems
CS 525 Advanced Distributed Systems Spring 2013
Distributed Systems – Paxos
CSE 486/586 Distributed Systems Paxos
Distributed Consensus Paxos
Distributed Systems: Paxos
CS 525 Advanced Distributed Systems Spring 2018
Distributed Systems, Consensus and Replicated State Machines
Fault-tolerance techniques RSM, Paxos
CS 425 / ECE 428 Distributed Systems Fall 2017 Indranil Gupta (Indy)
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EECS 498 Introduction to Distributed Systems Fall 2017
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Implementing Consistency -- Paxos
CSE 486/586 Distributed Systems Paxos
Presentation transcript:

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 9: Paxos Spring 2007 Prof. Idit Keidar

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Material Paxos Made Simple Leslie Lamport ACM SIGACT News (Distributed Computing Column) 32, 4 (Whole Number 121, December 2001)

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Issues in the Real World I/III Problem: Sometimes messages take longer than expected Solution 1: Use longer timeouts –Slow convergence Solution 2: Assume asynchrony –FLP Solution 3: Assume eventual synchrony or unreliable failure detectors –See last week – MR Algorithm

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Issues in the Real World II/III Problem: Sometimes messages are lost Solution 1: Use retransmissions –In case of transient partitions, a huge backlog can build up – catching up may take forever –More congestion, long message delays for extensive periods Solution 2: Allow message loss –2 Generals Solution 3: Assume eventually reliable links –That’s what we’ll do today

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Issues in the Real World III/III Problem: Processes may crash and later recover (aka crash-recovery model) Solution 1: Store information on stable storage (disk) and retrieve it upon recovery –What happens to messages arriving when they’re down? –See previous slide

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring MR and Unreliable Links From MR Algorithm Phase II: wait for (r,e) from n-t processes Transient message loss violates liveness What if we move to the next round in case we can’t get n-t responses for too long? –Notice the next line in MR: if any non-  value e received then val  e

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring What If MR Didn’t Wait … 11 2 n (1, v 1 ) 1 2 n est =  (2, v 2 ) no waiting no change of val 2 (1, v 1 ) decide v 1 (1,  ) will decide v 2

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring What Did We Learn? Do not get stuck in a round –Move on upon timeout –Move on upon hearing that others moved on But, a new leader before proposing a decision value must learn any possibly decided value (must check with a majority)

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Paxos: Main Principles Use “leader election” module –If you think you’re leader, you can start a new “ballot” Paxos name for a round Always join the newest ballot you hear about –Leave old ballots in the middle if you need to Two phases: –First learn outcomes of previous ballots from a majority –Then propose a new value, and get a majority to endorse it

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Leader Election Failure Detector  – Leader –Outputs one trusted process –From some point, all correct processes trust the same correct process Can easily implement ◊S Is the weakest for consensus [Chandra, Hadzilacos, Toueg 96]

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring  Implementations Easiest: use ◊P implementation –In eventual synchrony model –Output lowest id non-suspected process  is implementable also in some situations where ◊P isn’t Optimizations possible –Choose “best connected”, strongest, etc.

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Paxos: The Practicality Overcomes message loss without retransmitting entire message history Tolerates crash and recovery Does not rotate through dead coordinators Used in replicated file systems –Frangipani – DEC, early 90s –Nowadays Microsoft

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring The Part-Time Parliament [Lamport 88,98,01] Recent archaeological discoveries on the island of Paxos reveal that the parliament functioned despite the peripatetic propensity of its part-time legislators. The legislators maintained consistent copies of the parliamentary record, despite their frequent forays from the chamber and the forgetfulness of their messengers. The Paxon parliament’s protocol provides a new way of implementing the state-machine approach to the design of distributed systems.

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Annotation of TOCS 98 Paper This submission was recently discovered behind a filing cabinet in the TOCS editorial office. …the author is currently doing field work in the Greek isles and cannot be reached … The author appears to be an archeologist with only a passing interest in computer science. This is unfortunate; even though the obscure ancient Paxon civilization he describes is of little interest to most computer scientists, its legislative system is an excellent model for how to implement a distributed computer system in an asynchronous environment.

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring The Setting The data (ledger) is replicated at n processes (legislators) Operations (decrees) should be invoked (recorded) at each replica (ledger) in the same order Processes (legislators) can fail (leave the parliament) At least a majority of processes (legislators) must be up (present in the parliament) in order to make progress (pass decrees) –Why majority?

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Eventually Reliable Links There is a time after which every message sent by a correct process to a correct process eventually arrives Usual failure-detector-based algorithms do not work –Homework question

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring The Paxos (  ) Atomic Broadcast Algorithm Leader based: each process has an estimate of who is the current leader To order an operation, a process sends it to its current leader The leader sequences the operation and launches a Consensus algorithm (Synod) to fix the agreement

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring The (Synod) Consensus Algorithm Solves non-terminating consensus in asynchronous system –or consensus in a partial synchrony system –or consensus using an  failure detector Overcomes transient crashes & recoveries and message loss –can be modeled as message loss

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring The Consensus Algorithm Structure Two phases Leader contacts a majority in each phase There may be multiple concurrent leaders Ballots distinguish among values proposed by different leaders –unique, locally monotonically increasing –correspond to rounds of ◊S-based algorithm [MR] –processes respond only to leader with highest ballot seen so far

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Ballot Numbers Pairs  num, process id   n1, p1  >  n2, p2  –if n1 > n2 –or n1=n2 and p1 > p2 Leader p chooses unique, locally monotonically increasing ballot number –if latest known ballot is  n, q  –p chooses  n+1, p 

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring The Two Phases of Paxos Phase 1: prepare –If trust yourself by   believe you are the leader) Choose new unique ballot number Learn outcome of all smaller ballots from majority Phase 2: accept –Leader proposes a value with his ballot number –Leader gets majority to accept its proposal –A value accepted by a majority can be decided

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Paxos - Variables BallotNum i, initially  0,0  Latest ballot p i took part in (phase 1) AcceptNum i, initially  0,0  Latest ballot p i accepted a value in (phase 2) AcceptVal i, initially   Latest accepted value (phase 2)

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Paxos Phase I: Prepare - Leader Periodically, until decision is reached do: if leader (by  ) then BallotNum   BallotNum.num+1, myId  send (“prepare”, BallotNum) to all Goal: contact other processes, ask them to join this ballot, and get information about possible past decisions

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Paxos Phase I: Prepare - Cohort Upon receive (“prepare”, bal) from i if bal  BallotNum then BallotNum  bal send (“ack”, bal, AcceptNum, AcceptVal) to i This is a higher ballot than my current, I better join it! Tell the leader about my latest accepted value

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Paxos Phase II: Accept - Leader Upon receive (“ack”, BallotNum, b, val) from n-t if all vals =  then myVal = initial value else myVal = received val with highest b send (“accept”, BallotNum, myVal) to all /* proposal */ The value accepted in the highest ballot might have been decided, better propose this value

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Paxos Phase II: Accept - Cohort Upon receive (“accept”, b, v) if b  BallotNum then AcceptNum  b; AcceptVal  v /* accept proposal */ send (“accept”, b, v) to all (first time only) This is not from an old ballot

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Paxos – Deciding Upon receive (“accept”, b, v) from n-t decide v periodically send (“decide”, v) to all Upon receive (“decide”, v) decide v Why don’t we ever “return”?

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring In Failure-Free Synchronous Runs 11 2 n (“accept”,  1,1 ,v 1 ) 1 2 n n (“prepare”,  1,1  ) (“ack”,  1,1 ,  0,0 ,  ) decide v 1 (“accept”,  1,1 ,v 1 ) Simple  implementation always trusts process 1

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Correctness: Agreement Follows from Lemma 1: If a proposal (b,v) is accepted by a majority of the processes, then for every proposal (b’, v’) with b’>b, it holds that v’=v.

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Proving Agreement Using Lemma 1 Let v be a decided value. The first process that decides v receives n-t accept messages for v with some ballot b, i.e., (b,v) is accepted by a majority. No other value is accepted with the same b. Why? Let (b 1,v 1 ) be the first proposal accepted by n-t. By Lemma 1, v 1 is the only possible decision value.

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring To Prove Lemma 1 Use Lemma 2: (invariant): If a proposal (b,v) is sent, then there is a set S consisting of a majority such that either –no p in S accepts a proposal ranked less than b (all vals =  ; or –v is the value of the highest-ranked proposal among proposals ranked less than b accepted by processes in S (myVal = received val with highest b).

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring To Prove Lemma 2 A process can accept a proposal numbered b if and only if it has not responded to a prepare request having a number greater than b. The “ack” response to “prepare” is a promise not to accept lower-ballot proposals in the future.

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Termination Assume no loss for a moment. Once there is one correct leader – –It eventually chooses the highest ballot number –No other process becomes a leader with a higher ballot –All correct processes “ack” its prepare message and “accept” its accept message and decide

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring What About Message Loss? Does not block in case of a lost message –Phase 1 can start with new rank even if previous attempts never ended Conditional liveness: If n-t correct processes including the leader can communicate with each other then they eventually decide Holds with eventually reliable links

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Optimization Allow process 1 (only!) to skip Phase 1 –Initiate BallotNum to  1,1  –Propose its own initial value 2 steps in failure-free synchronous runs 2 steps for repeated invocations with the same leader –Common case

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Atomic Broadcast by Running A Sequence of Consensus Instances

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring The Setting Data is replicated at n servers Operations are initiated by clients Operations need to be performed at all correct servers in the same order –state-machine replication

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Client-Server Interaction (Benign Version) Leader-based: each process (client/server) has an estimate of who is the current leader A client sends a request to its current leader The leader launches the Paxos consensus algorithm to agree upon the order of the request The leader sends the response to the client

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Failure-Free Message Flow S1 S2 Sn C S1 S2 Sn S1 S2 Sn (“accept”)(“prepare”)(“ack”) C Phase 1Phase 2 request response

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Observation In Phase 1, no consensus values are sent: –Leader chooses largest unique ballot number –Gets a majority to “vote” for this ballot number –Learns the outcome of all smaller ballots from this majority In Phase 2, leader proposes either its own initial value or latest value it learned in Phase 1

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Message Flow: Take 2 S1 S2 Sn C S1 S2 Sn S1 S2 Sn (“accept”) (“prepare”)(“ack”) C Phase 1 Phase 2 request response S1

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Optimization Run Phase 1 only when the leader changes –Phase 1 is called “view change” or “recovery mode” –Phase 2 is the “normal mode” Each message includes BallotNum (from the last Phase 1) and ReqNum –e.g., ReqNum = 7 when we’re trying to agree what the 7 th operation to invoke on the state machine should be Respond only to messages with the “right” BallotNum

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Paxos Atomic Broadcast: Normal Mode Upon receive (“request”, v) from client if (I am not the leader) then forward to leader else /* propose v as request number n */ ReqNum  ReqNum +1; send (“accept”, BallotNum, ReqNum, v) to all Upon receive (“accept”, b, n, v) with b = BallotNum /* accept proposal for request number n */ AcceptNum[n]  b; AcceptVal[n]  v send (“accept”, b, n, v) to all (first time only)

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Recovery Mode The new leader must learn the outcome of all the pending requests that have smaller BallotNums –The “ack” messages include AcceptNums and AcceptVals of all pending requests For all pending requests, the leader sends “accept” messages What if there are holes? –e.g., leader learns of request number 13 and not of 12 –fill in the gaps with dummy “do nothing” requests

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Leslie Lamport’s Reflections Inspired by my success at popularizing the consensus problem by describing it with Byzantine generals, I decided to cast the algorithm in terms of a parliament on an ancient Greek island. To carry the image further, I gave a few lectures in the persona of an Indiana-Jones-style archaeologist. My attempt at inserting some humor into the subject was a dismal failure.

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring The History of the Paper by Lamport I submitted the paper to TOCS in All three referees said that the paper was mildly interesting, though not very important, but that all the Paxos stuff had to be removed. I was quite annoyed at how humorless everyone working in the field seemed to be, so I did nothing with the paper. A number of years later, a couple of people at SRC needed algorithms for distributed systems they were building, and Paxos provided just what they needed. I gave them the paper to read and they had no problem with it. So, I thought that maybe the time had come to try publishing it again.