Consensus and Consistency When you can’t agree to disagree.

Slides:



Advertisements
Similar presentations
Paxos and Zookeeper Roy Campbell.
Advertisements

CS 542: Topics in Distributed Systems Diganta Goswami.
CS 5204 – Operating Systems1 Paxos Student Presentation by Jeremy Trimble.
1 Indranil Gupta (Indy) Lecture 8 Paxos February 12, 2015 CS 525 Advanced Distributed Systems Spring 2015 All Slides © IG 1.
6.852: Distributed Algorithms Spring, 2008 Class 7.
Consensus Hao Li.
Consensus Algorithms Willem Visser RW334. Why do we need consensus? Distributed Databases – Need to know others committed/aborted a transaction to avoid.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
CS 582 / CMPE 481 Distributed Systems
Eddie Bortnikov & Aran Bergman, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Recitation.
Distributed Systems CS Case Study: Replication in Google Chubby Recitation 5, Oct 06, 2011 Majd F. Sakr, Vinay Kolar, Mohammad Hammoud.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Distributed Systems 2006 Group Membership * *With material adapted from Ken Birman.
CHUBBY and PAXOS Sergio Bernales 1Dennis Kafura – CS5204 – Operating Systems.
In Search of an Understandable Consensus Algorithm Diego Ongaro John Ousterhout Stanford University.
CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 19: Paxos All slides © IG.
1 ICS 214B: Transaction Processing and Distributed Data Management Distributed Database Systems.
State Machines CS 614 Thursday, Feb 21, 2002 Bill McCloskey.
Transaction. A transaction is an event which occurs on the database. Generally a transaction reads a value from the database or writes a value to the.
Commit Protocols. CS5204 – Operating Systems2 Fault Tolerance Causes of failure: process failure machine failure network failure Goals : transparent:
Distributed Deadlocks and Transaction Recovery.
Paxos Made Simple Jinghe Zhang. Introduction Lock is the easiest way to manage concurrency Mutex and semaphore. Read and write locks. In distributed system:
Fault Tolerance via the State Machine Replication Approach Favian Contreras.
Bringing Paxos Consensus in Multi-agent Systems Andrei Mocanu Costin Bădică University of Craiova.
An Introduction to Consensus with Raft
Practical Byzantine Fault Tolerance
Paxos A Consensus Algorithm for Fault Tolerant Replication.
Paxos: Agreement for Replicated State Machines Brad Karp UCL Computer Science CS GZ03 / M st, 23 rd October, 2008.
Replication (1). Topics r Why Replication? r System Model r Consistency Models – How do we reason about the consistency of the “global state”? m Data-centric.
Commit Algorithms Hamid Al-Hamadi CS 5204 November 17, 2009.
Distributed systems Consensus Prof R. Guerraoui Distributed Programming Laboratory.
Consensus and leader election Landon Cox February 6, 2015.
© Spinnaker Labs, Inc. Chubby. © Spinnaker Labs, Inc. What is it? A coarse-grained lock service –Other distributed systems can use this to synchronize.
CSci8211: Distributed Systems: Raft Consensus Algorithm 1 Distributed Systems : Raft Consensus Alg. Developed by Stanford Platform Lab  Motivated by the.
Implementing Replicated Logs with Paxos John Ousterhout and Diego Ongaro Stanford University Note: this material borrows heavily from slides by Lorenzo.
The Raft Consensus Algorithm Diego Ongaro and John Ousterhout Stanford University.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Paxos Steve Ko Computer Sciences and Engineering University at Buffalo.
Distributed Systems Lecture 9 Leader election 1. Previous lecture Middleware RPC and RMI – Marshalling 2.
CSE 486/586, Spring 2014 CSE 486/586 Distributed Systems Paxos Steve Ko Computer Sciences and Engineering University at Buffalo.
EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Consensus 2 Replicated State Machines, RAFT
The consensus problem in distributed systems
Distributed Systems – Paxos
Paxos and Raft (Lecture 21, cs262a)
Lecture 17: Leader Election
COS 418: Distributed Systems
View Change Protocols and Reconfiguration
Paxos and Raft (Lecture 9 cont’d, cs262a)
EECS 498 Introduction to Distributed Systems Fall 2017
Strong consistency and consensus
Implementing Consistency -- Paxos
CS434/534: Topics in Networked (Networking) Systems Distributed Network OS Yang (Richard) Yang Computer Science Department Yale University 208A Watson.
RAFT: In Search of an Understandable Consensus Algorithm
Fault-tolerance techniques RSM, Paxos
EEC 688/788 Secure and Dependable Computing
Consensus, FLP, and Paxos
EEC 688/788 Secure and Dependable Computing
EECS 498 Introduction to Distributed Systems Fall 2017
Our Final Consensus Protocol: RAFT
Replicated state machine and Paxos
View Change Protocols and Reconfiguration
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
The SMART Way to Migrate Replicated Stateful Services
EEC 688/788 Secure and Dependable Computing
Implementing Consistency -- Paxos
Presentation transcript:

Consensus and Consistency When you can’t agree to disagree

Consensus Why do applications need consensus? What does it mean to have consensus?

Consensus = consistency

Assumptions Processes can choose values Run at arbitrary speeds Can fail/restart at any time No adversarial model (not Byzantine) Messages can be duplicated or lost, and take arbitrarily long to arrive (but not forever)

How do we do this? General stages – New value produced at a process – New value(s) shared among processes – Eventually all processes agree on a new value Roles – Proposer (of new value) – Learner – Acceptor

Strawman solution Centralized approach – One node listens for new values from each process and broadcasts accepted values to others A A Proposers/ Listeners

Problems with this Single point of failure

Ok, so we need a group Multiple acceptors – New problem: consensus among acceptors Solution – If enough acceptors agree, then we’re set – Majority is enough (for accepting only one value) Why? – Only works for one value at a time

Which value to accept? P1: Accept the first proposal you hear – Ensures progress even if there is only one proposal What if there are multiple proposals? – P2: Assign numbers to proposals, pick highest- numbered value (n) chosen by majority of acceptors – In fact, every proposal accepted with number > n must have value v

But wait… What if new process wakes up with new value and higher number? – P1 says you have to accept it, but that violates P2 Need to make sure proposers know about accepted proposals – What about future accepted proposals? – Extract a promise not to accept

Proposing: Two phases Phase 1: Prepare – Ask for a promise not to accept a proposal with number greater than n – If acceptors haven’t seen a proposal with number > n, they make a promise not to accept one Phase 2: Accept – Proposer proposes accept for v, where v is value of highest number proposal, or anything if no proposal – Acceptor accepts if it hasn’t accepted proposal with number greater than n

Learning All acceptors notify all learners – Scalability problem Use super-learner to disseminate – Fault tolerance Use group of super-learners Values may be chosen w/o any learners learning – Lazy update at next chosen proposal

Liveness Consider this – P proposes n promise made – Q proposes n+1 promise made to q, promise to p invalid – P proposes n+2 promise made to p, promise to q invalid – … Use distinguished proposer – Only one to issue proposals

Paxos in action Pick leader Leader learns all previous chosen values Leader proposes new value(s) Failure get in the way – Leader may fail between actions – Even if some values are accepted out of order

Problems with Paxos Difficult to understand how to use it in a practical setting, or how it ensures invariants – Single-decree abstraction (one value decision) is hard to map to online system – Multi-Paxos (series of transactions) is even harder to fully grasp in terms of guarantees

Problems with Paxos Multi-Paxos not fully specified – Many have struggled to fill in blanks – Few are published (Chubby is an exception) Single decree not a great abstraction for practical systems – We need something for reconciling a sequence of transactions efficiently P2P model not great fit for used scenarios – Leaders can substantially optimize the algorithm

Raft: Making Paxos easier? Stronger leaders – Only leader propagates state changes Randomized timers for leader election – Simplifies conflict resolution Joint consensus – Availability during configuration changes (Subsequent slides from Raft people)

Replicated log => replicated state machine – All servers execute same commands in same order Consensus module ensures proper log replication System makes progress as long as any majority of servers are up Failure model: fail-stop (not Byzantine), delayed/lost messages March 3, 2013Raft Consensus AlgorithmSlide 18 Goal: Replicated Log addjmpmovshl Log Consensus Module State Machine addjmpmovshl Log Consensus Module State Machine addjmpmovshl Log Consensus Module State Machine Servers Clients shl

Two general approaches to consensus: Symmetric, leader-less: – All servers have equal roles – Clients can contact any server Asymmetric, leader-based: – At any given time, one server is in charge, others accept its decisions – Clients communicate with the leader Raft uses a leader: – Decomposes the problem (normal operation, leader changes) – Simplifies normal operation (no conflicts) – More efficient than leader-less approaches March 3, 2013Raft Consensus AlgorithmSlide 19 Approaches to Consensus

1.Leader election: – Select one of the servers to act as leader – Detect crashes, choose new leader 2.Normal operation (basic log replication) 3.Safety and consistency after leader changes 4.Neutralizing old leaders 5.Client interactions – Implementing linearizeable semantics 6.Configuration changes: – Adding and removing servers March 3, 2013Raft Consensus AlgorithmSlide 20 Raft Overview

At any given time, each server is either: – Leader: handles all client interactions, log replication At most 1 viable leader at a time – Follower: completely passive (issues no RPCs, responds to incoming RPCs) – Candidate: used to elect a new leader Normal operation: 1 leader, N-1 followers March 3, 2013 Raft Consensus Algorithm Slide 21 Server States FollowerCandidateLeader start timeout, start election receive votes from majority of servers timeout, new election discover server with higher term discover current server or higher term “step down”

Time divided into terms: – Election – Normal operation under a single leader At most 1 leader per term Some terms have no leader (failed election) Each server maintains current term value Key role of terms: identify obsolete information March 3, 2013Raft Consensus AlgorithmSlide 22 Terms Term 1Term 2Term 3Term 4Term 5 time ElectionsNormal OperationSplit Vote

March 3, 2013Raft Consensus AlgorithmSlide 23 Respond to RPCs from candidates and leaders. Convert to candidate if election timeout elapses without either: Receiving valid AppendEntries RPC, or Granting vote to candidate Followers Increment currentTerm, vote for self Reset election timeout Send RequestVote RPCs to all other servers, wait for either: Votes received from majority of servers: become leader AppendEntries RPC received from new leader: step down Election timeout elapses without election resolution: increment term, start new election Discover higher term: step down Candidates Each server persists the following to stable storage synchronously before responding to RPCs: currentTermlatest term server has seen (initialized to 0 on first boot) votedForcandidateId that received vote in current term (or null if none) log[]log entries Persistent State termterm when entry was received by leader indexposition of entry in the log commandcommand for state machine Log Entry Invoked by candidates to gather votes. Arguments: candidateIdcandidate requesting vote termcandidate's term lastLogIndexindex of candidate's last log entry lastLogTermterm of candidate's last log entry Results: termcurrentTerm, for candidate to update itself voteGrantedtrue means candidate received vote Implementation: 1.If term > currentTerm, currentTerm ← term (step down if leader or candidate) 2.If term == currentTerm, votedFor is null or candidateId, and candidate's log is at least as complete as local log, grant vote and reset election timeout RequestVote RPC Invoked by leader to replicate log entries and discover inconsistencies; also used as heartbeat. Arguments: termleader's term leaderIdso follower can redirect clients prevLogIndexindex of log entry immediately preceding new ones prevLogTermterm of prevLogIndex entry entries[]log entries to store (empty for heartbeat) commitIndexlast entry known to be committed Results: termcurrentTerm, for leader to update itself successtrue if follower contained entry matching prevLogIndex and prevLogTerm Implementation: 1.Return if term < currentTerm 2.If term > currentTerm, currentTerm ← term 3.If candidate or leader, step down 4.Reset election timeout 5.Return failure if log doesn’t contain an entry at prevLogIndex whose term matches prevLogTerm 6.If existing entries conflict with new entries, delete all existing entries starting with first conflicting entry 7.Append any new entries not already in the log 8.Advance state machine with newly committed entries AppendEntries RPC Raft Protocol Summary Initialize nextIndex for each to last log index + 1 Send initial empty AppendEntries RPCs (heartbeat) to each follower; repeat during idle periods to prevent election timeouts Accept commands from clients, append new entries to local log Whenever last log index ≥ nextIndex for a follower, send AppendEntries RPC with log entries starting at nextIndex, update nextIndex if successful If AppendEntries fails because of log inconsistency, decrement nextIndex and retry Mark log entries committed if stored on a majority of servers and at least one entry from current term is stored on a majority of servers Step down if currentTerm changes Leaders

Servers start up as followers Followers expect to receive RPCs from leaders or candidates Leaders must send heartbeats (empty AppendEntries RPCs) to maintain authority If electionTimeout elapses with no RPCs: – Follower assumes leader has crashed – Follower starts new election – Timeouts typically ms March 3, 2013Raft Consensus AlgorithmSlide 24 Heartbeats and Timeouts

Increment current term Change to Candidate state Vote for self Send RequestVote RPCs to all other servers, retry until either: 1.Receive votes from majority of servers: Become leader Send AppendEntries heartbeats to all other servers 2.Receive RPC from valid leader: Return to follower state 3.No-one wins election (election timeout elapses): Increment term, start new election March 3, 2013Raft Consensus AlgorithmSlide 25 Election Basics

Safety: allow at most one winner per term – Each server gives out only one vote per term (persist on disk) – Two different candidates can’t accumulate majorities in same term Liveness: some candidate must eventually win – Choose election timeouts randomly in [T, 2T] – One server usually times out and wins election before others wake up – Works well if T >> broadcast time March 3, 2013Raft Consensus AlgorithmSlide 26 Elections, cont’d Servers Voted for candidate A B can’t also get majority

Log entry = index, term, command Log stored on stable storage (disk); survives crashes Entry committed if known to be stored on majority of servers – Durable, will eventually be executed by state machines March 3, 2013Raft Consensus AlgorithmSlide 27 Log Structure 1 add jmp 1 cmp 1 ret 2 mov 3 div 3 shl 3 sub 1 add 3 jmp 1 cmp 1 ret 2 mov 1 add 3 jmp 1 cmp 1 ret 2 mov 3 div 3 shl 3 sub 1 add 1 cmp 1 add 3 jmp 1 cmp 1 ret 2 mov 3 div 3 shl leader log index followers committed entries term command

Client sends command to leader Leader appends command to its log Leader sends AppendEntries RPCs to followers Once new entry committed: – Leader passes command to its state machine, returns result to client – Leader notifies followers of committed entries in subsequent AppendEntries RPCs – Followers pass committed commands to their state machines Crashed/slow followers? – Leader retries RPCs until they succeed Performance is optimal in common case: – One successful RPC to any majority of servers March 3, 2013Raft Consensus AlgorithmSlide 28 Normal Operation

High level of coherency between logs: If log entries on different servers have same index and term: – They store the same command – The logs are identical in all preceding entries If a given entry is committed, all preceding entries are also committed March 3, 2013Raft Consensus AlgorithmSlide 29 Log Consistency 1 add jmp 1 cmp 1 ret 2 mov 3 div 4 sub 1 add 3 jmp 1 cmp 1 ret 2 mov

Each AppendEntries RPC contains index, term of entry preceding new ones Follower must contain matching entry; otherwise it rejects request Implements an induction step, ensures coherency March 3, 2013Raft Consensus AlgorithmSlide 30 AppendEntries Consistency Check 1 add 3 jmp 1 cmp 1 ret 2 mov 1 add 1 cmp 1 ret 2 mov leader follower add 3 jmp 1 cmp 1 ret 2 mov 1 add 1 cmp 1 ret 1 shl leader follower AppendEntries succeeds: matching entry AppendEntries fails: mismatch

At beginning of new leader’s term: – Old leader may have left entries partially replicated – No special steps by new leader: just start normal operation – Leader’s log is “the truth” – Will eventually make follower’s logs identical to leader’s – Multiple crashes can leave many extraneous log entries: March 3, 2013Raft Consensus AlgorithmSlide 31 Leader Changes log index term s1s1 s2s2 s3s3 s4s4 s5s5

Once a log entry has been applied to a state machine, no other state machine must apply a different value for that log entry Raft safety property: – If a leader has decided that a log entry is committed, that entry will be present in the logs of all future leaders This guarantees the safety requirement – Leaders never overwrite entries in their logs – Only entries in the leader’s log can be committed – Entries must be committed before applying to state machine March 3, 2013Raft Consensus AlgorithmSlide 32 Safety Requirement Committed → Present in future leaders’ logs Restrictions on commitment Restrictions on leader election

Can’t tell which entries are committed! During elections, choose candidate with log most likely to contain all committed entries – Candidates include log info in RequestVote RPCs (index & term of last log entry) – Voting server V denies vote if its log is “more complete”: (lastTerm V > lastTerm C ) || (lastTerm V == lastTerm C ) && (lastIndex V > lastIndex C ) – Leader will have “most complete” log among electing majority March 3, 2013Raft Consensus AlgorithmSlide 33 Picking the Best Leader unavailable during leader transition committed?

Case #1/2: Leader decides entry in current term is committed Safe: leader for term 3 must contain entry 4 March 3, 2013Raft Consensus AlgorithmSlide 34 Committing Entry from Current Term s1s1 s2s2 s3s3 s4s4 s5s AppendEntries just succeeded Can’t be elected as leader for term 3 Leader for term 2

Case #2/2: Leader is trying to finish committing entry from an earlier term Entry 3 not safely committed: – s 5 can be elected as leader for term 5 – If elected, it will overwrite entry 3 on s 1, s 2, and s 3 ! March 3, 2013Raft Consensus AlgorithmSlide 35 Committing Entry from Earlier Term s1s1 s2s2 s3s3 s4s4 s5s5 2 2 AppendEntries just succeeded Leader for term 4 3

For a leader to decide an entry is committed: – Must be stored on a majority of servers – At least one new entry from leader’s term must also be stored on majority of servers Once entry 4 committed: – s 5 cannot be elected leader for term 5 – Entries 3 and 4 both safe March 3, 2013Raft Consensus AlgorithmSlide 36 New Commitment Rules s1s1 s2s2 s3s3 s4s4 s5s Leader for term Combination of election rules and commitment rules makes Raft safe 3

Leader changes can result in log inconsistencies: March 3, 2013Raft Consensus AlgorithmSlide 37 Log Inconsistencies log index leader for term possible followers (a) (b) (c) (d) (e) (f) Extraneous Entries Missing Entries

March 3, 2013Raft Consensus Algorithm New leader must make follower logs consistent with its own – Delete extraneous entries – Fill in missing entries Leader keeps nextIndex for each follower: – Index of next log entry to send to that follower – Initialized to (1 + leader’s last index) When AppendEntries consistency check fails, decrement nextIndex and try again: Repairing Follower Logs log index leader for term followers (a) (b) nextIndex Slide 38

When follower overwrites inconsistent entry, it deletes all subsequent entries: March 3, 2013Raft Consensus AlgorithmSlide 39 Repairing Logs, cont’d log index leader for term follower (before) nextIndex 111 follower (after) 4

Deposed leader may not be dead: – Temporarily disconnected from network – Other servers elect a new leader – Old leader becomes reconnected, attempts to commit log entries Terms used to detect stale leaders (and candidates) – Every RPC contains term of sender – If sender’s term is older, RPC is rejected, sender reverts to follower and updates its term – If receiver’s term is older, it reverts to follower, updates its term, then processes RPC normally Election updates terms of majority of servers – Deposed server cannot commit new log entries March 3, 2013Raft Consensus AlgorithmSlide 40 Neutralizing Old Leaders

Send commands to leader – If leader unknown, contact any server – If contacted server not leader, it will redirect to leader Leader does not respond until command has been logged, committed, and executed by leader’s state machine If request times out (e.g., leader crash): – Client reissues command to some other server – Eventually redirected to new leader – Retry request with new leader March 3, 2013Raft Consensus AlgorithmSlide 41 Client Protocol

What if leader crashes after executing command, but before responding? – Must not execute command twice Solution: client embeds a unique id in each command – Server includes id in log entry – Before accepting command, leader checks its log for entry with that id – If id found in log, ignore new command, return response from old command Result: exactly-once semantics as long as client doesn’t crash March 3, 2013Raft Consensus AlgorithmSlide 42 Client Protocol, cont’d

System configuration: – ID, address for each server – Determines what constitutes a majority Consensus mechanism must support changes in the configuration: – Replace failed machine – Change degree of replication March 3, 2013Raft Consensus AlgorithmSlide 43 Configuration Changes

Cannot switch directly from one configuration to another: conflicting majorities could arise March 3, 2013Raft Consensus AlgorithmSlide 44 Configuration Changes, cont’d C old C new Server 1 Server 2 Server 3 Server 4 Server 5 Majority of C old Majority of C new time

March 3, 2013Raft Consensus AlgorithmSlide 45 Raft uses a 2-phase approach: – Intermediate phase uses joint consensus (need majority of both old and new configurations for elections, commitment) – Configuration change is just a log entry; applied immediately on receipt (committed or not) – Once joint consensus is committed, begin replicating log entry for final configuration Joint Consensus time C old+new entry committed C new entry committed C old C old+new C new C old can make unilateral decisions C new can make unilateral decisions

Additional details: – Any server from either configuration can serve as leader – If current leader is not in C new, must step down once C new is committed. March 3, 2013Raft Consensus AlgorithmSlide 46 Joint Consensus, cont’d time C old+new entry committed C new entry committed C old C old+new C new C old can make unilateral decisions C new can make unilateral decisions leader not in C new steps down here

1.Leader election 2.Normal operation 3.Safety and consistency 4.Neutralize old leaders 5.Client protocol 6.Configuration changes March 3, 2013Raft Consensus AlgorithmSlide 47 Raft Summary

For Thursday Ready Chubby paper – Shows example of trying to implement Paxos Read faulty process paper – The plot thickens with Byzantine failures