Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 9: SMR with Paxos and Authenticated Byzantine Paxos Spring 2008 Prof. Idit Keidar
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Material The Part-Time Parliament Lamport, TOCS 1998 Practical Byzantine Fault-Tolerance Castro and Liskov, OSDI 1999 The ABCDs of Paxos Lampson 2001
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring SMR (Atomic Broadcast) by Running a Sequence of Consensus Instances
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Reminder: State-Machine Replication (SMR) Data is replicated at n servers Operations are initiated by clients Operations need to be performed at all correct servers in the same order Servers need to agree upon the sequence of operations
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Reminder: Paxos Algorithm for state machine replication with eventual synchrony (ES) –Uses failure detector (leader election) Overcomes transient crashes & recoveries and message loss Main component: (one-shot) consensus protocol (aka Synod) –Last week
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring The 2 Phases of Paxos (Synod) 11 2 n (“accept”, b, v) 1 2 n n (“prepare”, b) (“ack”, b, n’, v’) (“accept”, b, v) Phase 1: Learn about smaller ballotsPhase 2: Majority accepts v
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring SMR: Client-Server Interaction Leader-based: each process (client/server) has an estimate of who is the current leader A client sends a request to its current leader –E.g., “store X 100” The leader runs the Paxos (Synod) consensus algorithm to agree on the place of the request in the sequence –Input value: request + proposed sequence number –E.g., “store X 100” is the 7 th operation in the sequence The leader sends the response to the client –After invoking the operation on its copy
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Consensus per Request Number Many consensus instances are running at the same time, each for some request number, ReqNum Each node has unbounded arrays –AcceptNum[r], AcceptVal[r], r = 1,2, … –AcceptVal holds the client’s requested operation E.g., “store X 100” Invoke operations on the state machine –In order: AcceptVal[1], then AcceptVal[2], etc. –After the respective consensus algorithm decides –Then send outcome as response to client (leader only)
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Failure-Free Message Flow S1 S2 Sn C S1 S2 Sn S1 S2 Sn (“accept”, b, r, v) (“prepare”, b) (“ack”, b, n’, v’) C Phase 1Phase 2 request response (“accept”, b, r, v) Client’s request ReqNum
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Observation In Phase 1, no new consensus values sent: –Leader chooses largest unique ballot number –Gets a majority to join this ballot number –Learns the outcome of all smaller ballots from this majority In Phase 2, leader proposes either its initial value (request from client) or latest value it learned in Phase 1
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Failure-Free Message Flow S1 S2 Sn C S1 S2 Sn S1 S2 Sn (“accept”, b, r, v) (“prepare”, b) (“ack”, b, n’, v’) C Phase 1Phase 2 request response (“accept”, b, r, v)
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Message Flow: Take 2 S1 S2 Sn C S1 S2 Sn S1 S2 Sn C Phase 1 Phase 2 request response S1 (“accept”, b, r, v) (“prepare”, b) (“ack”, b, n’, v’) (“accept”, b, r, v)
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Optimization Run Phase 1 only when the leader changes –Phase 1 is called “view change” or “recovery mode” –Phase 2 is the “normal mode” Each message includes BallotNum (from the last Phase 1) and ReqNum –E.g., ReqNum = 7 when we’re trying to agree what the 7 th operation to invoke on the state machine should be Respond only to messages with the “right” BallotNum
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Paxos Atomic Broadcast: Normal Mode Upon receive (“request”, v) from client if (I am not the leader) then forward to leader else /* propose v as request number n */ ReqNum ReqNum +1; send (“accept”, BallotNum, ReqNum, v) to all Upon receive (“accept”, b, r, v) with b = BallotNum /* accept proposal for request number n */ AcceptNum[r] b; AcceptVal[r] v send (“accept”, b, r, v) to all (first time only)
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Recovery Mode Run once per ballot, for all request numbers –No ReqNum in “prepare” and “ack” messages The new leader must learn the outcome of all the pending requests that have smaller BallotNums –The “ack” messages include AcceptNum[r] and AcceptVal[r] for all pending requests For each of the pending requests, the leader sends an “accept” message What if there are holes? –E.g., leader learns of request number 13 and not of 12 –Fill in the gaps with dummy “do nothing” requests
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Practical Byzantine Fault-Tolerance Aka Byzantine Paxos
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Reminder: Byzantine Faults Faulty process can behave arbitrarily, i.e., they don’t have to follow the protocol. E.g., –Can suffer benign failures – crash, timing –Can send bogus values in messages –Can send messages at the wrong time –Can send different messages to different processes, etc. Captures software bugs, hacker intrusions
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Reminder: Authenticated (Byzantine) Model Authentication: The receiver of a message can ascertain its origin –An intruder cannot masquerade as someone else Integrity: The receiver of a message can verify that it has not been modified in transit; –An intruder cannot substitute a false message for a legitimate one Nonrepudiation: A sender cannot falsely deny later that he sent a message
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Byzantine Fault-Tolerant Consensus: Overview of Results Synchronous t-resilient algorithm – –iff t < n with authentication and weak unanimity –iff t < n/2 with authentication and strong unanimity –iff t < n/3 without authentication Eventually synchronous (ES) t-resilient algorithm –iff t < n/3 with or without authentication Homework problem: show the lower bound
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Overcoming Byzantine Failures With 3t+1 Processes Recall what we did for crash failures – –We gathered “votes” from a majority in every ballot –Since every two majorities intersect, for every two ballots, at least one process votes in both But now, a faulty process can lie about what it did in the other ballot –We want a correct process in the intersection –Since n-t ≥ 2t+1, two sets of size n-t intersect by at least one correct process –Gather n-t votes in a ballot, to ensure that for every two ballots, at least one correct process votes in both
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Byzantine Paxos Setting State machine replication Structured like Paxos: –Updates are sent to the current leader –Leader uses a consensus algorithm to have all replicas agree on the order of updates –Our focus today is the Consensus algorithm Used to implement BFS – Byzantine Fault Tolerant NFS –Only 3% slower than un-replicated NFS
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Model n processes: {1,…n} Up to t Byzantine failures, t < n/3 –For simplicity, assume n = 3t+1 Authentication (PKI) Reliable links, no recovery (for now)
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Reminder: Classic Paxos Phase I Periodically, until decision is reached do: if leader (by ) then BallotNum BallotNum.num+1, myId send (“prepare”, BallotNum) to all Upon receive (“prepare”, b) from i if b BallotNum then BallotNum b send (“ack”, b, AcceptNum, AcceptVal) to i
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Reminder: Classic Paxos Phase II Upon receive (“ack”, BallotNum, b, val) from n-t if all vals = then myVal = initial value else myVal = received val with highest b send (“accept”, BallotNum, myVal) to all /* proposal */ Upon receive (“accept”, b, v) with b BallotNum AcceptNum b; AcceptVal v /* accept proposal */ send (“accept”, b, v) to all (first time only)
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring How Can Byzantine Failures Cause Problems?
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Safety Problems: Leader Can Lie Problem 1: Leader can choose a value different than the highest accepted by n-t processes –Solution: Can “prove” he’s not lying by sending the signed “ack” (Phase 1) messages to all processes Problem 2: If no previous ballot was accepted, leader can send different new values to different processes
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Solution to the 2 nd Problem Before accepting a value proposed by the leader, verify that the value was proposed to “enough” processes Byzantine Paxos Phases: –Phase 1: Prepare –Phase 2: Propose – echo leader’s proposal –Phase 3: Accept – now only if n-t proposed Add new variable: PropNum, initially 0
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Safety Problems: Others Can Lie Problem 3: Faulty users can send invalid “accept” messages –Solution: Wait for n-t=2t+1 “accept” messages Problem 4: Faulty users can send invalid values with higher AcceptNums in “ack” messages –Solution: Can “prove” value is valid by forwarding signed “propose” messages –Add new variable: Proof, initially empty set
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Liveness Problems Problem 5: Faulty leader can deadlock algorithm –Solution: Propose a new leader when the current does not deliver –Use rotating coordinator until one is correct, leader will be (BallotNum mod n)+1 Problem 6: Faulty processes may keep selecting new leaders all the time (livelock) –Solution: Accept a new ballot only if t+1 processes propose a new leader
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring And Now For Our Feature Presentation The Byzantine Paxos Consensus Algorithm
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Byzantine Paxos Variables Int BallotNum, initially 0 Int PropNum, initially 0 Int AcceptNum, initially 0 Value { } AcceptVal, initially Message Set Proof, initially empty Define: Leader = (BallotNum mod n)+1
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Byzantine Paxos Phase I: Prepare Upon timeout on Leader BallotNum BallotNum +1 send (“prepare”, BallotNum) to all Upon receive (“prepare”, b) from t+1 if (b < BallotNum) then return if (b > BallotNum) then BallotNum b send (“prepare”, BallotNum) to all send (“ack”, b, AcceptNum, AcceptVal, Proof) to Leader
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Byzantine Paxos Phase II: Propose Upon receive (“ack”, BallotNum, b, val, proof) from n-t S = {received (signed) “ack” messages} if (all vals that have valid proofs in S are then myVal init value else myVal val that has valid proof with highest b in S send (“propose”, BallotNum, myVal, S) to all Upon receive (“propose”, BallotNum, v, S) if (BallotNum PropNum) then return if (v is not a valid choice given S) then return PropNum BallotNum send (“propose”, BallotNum, v, S) to all
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Byzantine Paxos Phase III: Accept Upon receive (“propose”, b, v, S) from n-t if (b < BallotNum) then return AcceptNum b; AcceptVal v Proof set of n-t signed “propose” messages send (“accept”, b, v) to all Upon receive (“accept”, b, v) from n-t decide v
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring In Failure-Free Runs accept 1 prepareackpropose 2 n n n n n All send prepare All echo propose
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Some Optimizations Prepare and its “ack” can be merged into one message round Proofs don’t have to be sent with messages: processes can have the information to check the proofs locally because the original messages are multicast
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Invariant If proposals (b,v) and (b, v’) are accepted by correct processes i and j, (possibly i = j ) then v’=v Proof: –An accepted proposal is proposed by n-t processes –Two sets of n-t = 2t+1 processes have at least one correct process in common –A correct process sends no more than one propose message with the same b
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Lemma 1 If a proposal (b,v) is accepted by t+1 correct processes, then for every proposal (b’, v’) with b’>b that is proposed by a correct process, v’=v. Again, follows from Lemma 2…
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Lemma 2 If a proposal (b,v) is proposed by a correct process, then there is a set S including at least t+1 correct processes such that either –(1) no correct p in S accepts a proposal ranked less than b; or –(2) v is the value of the highest-ranked proposal among proposals ranked less than b accepted by correct processes in S
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Proving Lemma 1 from Lemma 2 Assume (b,v) is accepted by t+1 correct processes, and consider the lowest ranked proposal (b’, v’) with b’>b proposed by a correct process Since two sets of t+1 correct processes have at least one correct process in common, case (1) of Lemma 2 is impossible, and by case (2), v’=v Continue by induction on ballot number
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Proving Agreement Let v be a decided value. The first process that decides v receives a n-t accept messages for v with some ballot b, i.e., (b,v) is accepted by at least t+1 correct processes No other value is accepted by a correct process with the same b. Why? Let (v 1,b 1 ) be the first proposal accepted by n-t By Lemma 1, v 1 is the only possible decision value
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Liveness Is the current leader making progress? –If yes, some correct process decides. This process can periodically forward the “proof” for its decision to others so they will decide too. –If not, all timeout on the leader and start a new ballot. Once there is a correct leader. –The n-t correct processes will send all the needed messages. –The t faulty processes will not be able to force a new ballot.
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Atomic Broadcast: Issues Leader can propose invalid client requests Leader can refrain from proposing client requests Leader can lie to client about response Leader can refrain from sending client responses Solution: clients cannot trust a single server
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Byzantine Message Flow accept S1 prepareackpropose S2 Sn S1 S2 Sn S1 S2 Sn S1 S2 Sn propose S1 S2 Sn S1 request S2 Sn C C response