Paxos Lamport the archeologist and the “Part-time Parliament” of Paxos: – The Part-time Parliament, TOCS 1998 – Paxos Made Simple, ACM SIGACT News – Paxos Made Live, PODC 2007 – Paxos Made Moderately Complex, (Cornell) – …….. CS 2711
The Paxos Atomic Broadcast Algorithm Thanks to Idit Keidar for slides Asynchronous system with crash failures. Leader based: each process has an estimate of who is the current leader To order an operation, a process sends it to current leader The leader sequences the operation and launches a Consensus algorithm to fix the agreement CS 2712
The Consensus Algorithm Structure Two phases Leader contacts a majority in each phase There may be multiple concurrent leaders Ballots distinguish among values proposed by different leaders – Unique, locally monotonically increasing – Processes respond only to leader with highest ballot seen so far CS 2713
Ballot Numbers Pairs num, process id n 1, p 1 > n 2, p 2 – If n 1 > n 2 – Or n 1 =n 2 and p 1 > p 2 Leader p chooses a unique, locally monotonically increasing ballot number – If latest known ballot is n, q then p chooses n+1, p CS 2714
The Two Phases of Paxos Phase 1: prepare – If you believe you are the leader Choose new unique ballot number Learn outcome of all smaller ballots from majority Phase 2: accept – Leader proposes a value with its ballot number – Leader gets majority to accept its proposal – A value accepted by a majority can be decided CS 2715
Paxos - Variables BallotNum i, initially 0,0 Latest ballot p i took part in (phase 1) AcceptNum i, initially 0,0 Latest ballot p i accepted a value in (phase 2) AcceptVal i, initially Latest accepted value (phase 2) CS 2716
Phase I: Prepare - Leader Periodically, until decision is reached do: if leader then BallotNum BallotNum.num+1, myId send (“prepare”, BallotNum) to all Goal: contact other processes, ask them to join this ballot, and get information about possible past decisions CS 2717
Phase I: Prepare - Cohort Upon receive (“prepare”, bal) from i if bal BallotNum then BallotNum bal send (“ack”, bal, AcceptNum, AcceptVal) to i This is a higher ballot than my current, I better join it Tell the leader about my latest accepted value and what ballot it was accepted in This is a promise not to accept ballots smaller than bal in the future CS 2718
Phase II: Accept - Leader Upon receive (“ack”, BallotNum, b, val) from majority if all vals = then myVal = initial value else myVal = received val with highest b send (“accept”, BallotNum, myVal) to all /* proposal */ The value accepted in the highest ballot might have been decided, I better propose this value CS 2719
Phase II: Accept - Cohort Upon receive (“accept”, b, v) if b BallotNum then AcceptNum b; AcceptVal v /* accept proposal */ send (“accept”, b, v) to all (first time only) This is not from an old ballot CS 27110
Paxos – Deciding Upon receive (“accept”, b, v) from n-t decide v periodically send (“decide”, v) to all Upon receive (“decide”, v) decide v CS 27111
In Failure-Free Execution 11 2 n (“accept”, 1,1 ,v 1 ) 1 2 n n (“prepare”, 1,1 ) (“ack”, 1,1 , 0,0 , ) decide v 1 (“accept”, 1,1 ,v 1 ) CS 27112
Why is this phase needed? Performance? 11 2 n (“accept”, 1,1 ,v 1 ) 1 2 n n (“prepare”, 1,1 ) (“ack”, 1,1 , 0,0 , ) (“accept”, 1,1 ,v 1 ) CS 27113
Failure-Free Execution S1 S2 Sn C S1 S2 Sn S1 S2 Sn (“accept”)(“prepare”)(“ack”) C Phase 1Phase 2 request response CS 27114
Observation In Phase 1, no consensus values are sent: – Leader chooses largest unique ballot number – Gets a majority to “vote” for this ballot number – Learns the outcome of all smaller ballots In Phase 2, leader proposes its own initial value or latest value it learned in Phase 1 CS 27115
Failure free execution S1 S2 Sn C S1 S2 Sn S1 S2 Sn (“accept”) (“prepare”)(“ack”) C Phase 1 Phase 2 request response S1 CS 27116
Optimization Run Phase 1 only when the leader changes – Phase 1 is called “view change” or “recovery mode” – Phase 2 is the “normal mode” Each message includes BallotNum (from the last Phase 1) and ReqNum Respond only to messages with the “right” BallotNum CS 27117
Paxos Atomic Broadcast: Normal Mode Upon receive (“request”, v) from client if (I am not the leader) then forward to leader else /* propose v as request number n */ ReqNum ReqNum +1; send (“accept”, BallotNum, ReqNum, v) to all Upon receive (“accept”, b, n, v) with b = BallotNum /* accept proposal for request number n */ AcceptNum[n] b; AcceptVal[n] v send (“accept”, b, n, v) to all (first time only) CS 27118
Recovery Mode The new leader must learn the outcome of all the pending requests that have smaller BallotNums – The “ack” messages include AcceptNums and AcceptVals of all pending requests For all pending requests, the leader sends “accept” messages What if there are holes? – e.g., leader learns of request number 13 and not of 12 – fill in the gaps with dummy “do nothing” requests CS 27119