Download presentation
Presentation is loading. Please wait.
Published byMariusz Jabłoński Modified over 6 years ago
1
Raft consensus Landon Cox April 11/16, 2018
2
How can things fall apart
Machines can get slow Machines can crash and reboot Machines can crash and die Network can become partitioned Machines can behave arbitrarily Easier Harder Fault tolerance: “Do not lose data.” Consistency: “Give sensible answers to reads.” Consistency depends on fault tolerance (hard to give sensible answers when data can disappear)
3
How can things fall apart
Machines can get slow Machines can crash and reboot Machines can crash and die Network can become partitioned Machines can behave arbitrarily Easier Harder Step 1: don’t lose data if machines crash and reboot
4
Transactions Fundamental to databases Several important properties
“ACID” (atomicity, consistency, isolation, durability) For now, only consider atomicity (all or nothing) BEGIN disk write 1 … disk write n END Called “committing” the transaction
5
Transactions: logging
Begin transaction Append info about modifications to a log Append “commit” to log to end x-action Write new data to normal database Single-sector write commits x-action (3) Begin Write1 … WriteN What if we crash here? On reboot, discard uncommitted updates.
6
How can things fall apart
Machines can get slow Machines can crash and reboot Machines can crash and die Machines can become partitioned Machines can behave arbitrarily Easier Harder Step 1: don’t lose data if machines crash and reboot
7
How can things fall apart
Machines can get slow Machines can crash and reboot Machines can crash and die Machines can become partitioned Machines can behave arbitrarily Easier Harder Step 1: don’t lose data if machines crash and reboot Step 2: don’t lose data if machines crash and die What has to happen if machines are not guaranteed to restart after a crash? Transactions have to commit at > 1 machine
8
Two-phase commit Besides the value of X, what else do nodes have to agree on? The identity of the coordinator! Replica X 1 Coordinator Replica X 1 Replica X 1
9
Two-phase commit Besides the value of X, what else do nodes have to agree on? The identity of the coordinator! Replica X 1 Coordinator Replica X 1 Replica X 1
10
Two-phase commit Replica Coordinator C
What happens if the coordinator fails? Replicas have to agree on a new coordinator A process called “leader election” Replica Coordinator C Coordinator C Replica Coordinator C Replica Coordinator C
11
Two-phase commit Replica Coordinator C
What happens if the coordinator fails? Replicas have to agree on a new coordinator A process called “leader election” Replica Coordinator C Coordinator C Replica Coordinator C Replica Coordinator C
12
Two-phase commit Replica Coordinator C Coordinator Replica C
Can we use two-phase commit to agree on the coordinator? Two-phase commit requires a coordinator So to agree on one coordinator, need another coordinator, which requires agreement on yet another coordinator … We have a serious boot-strapping problem Replica Coordinator C Coordinator C Replica Coordinator C Replica Coordinator C
13
Two-phase commit Replica Coordinator C Coordinator Replica C
Can we use two-phase commit to agree on the coordinator? Two-phase commit requires a coordinator So to agree on one coordinator, need another coordinator, which requires agreement on yet another coordinator … We have a serious boot-strapping problem Replica Coordinator C Coordinator C Replica Coordinator C Replica Coordinator C
14
Paxos ACM TOCS: Submitted: 1990. Accepted: 1998 Introduced:
Transactions on Computer Systems Submitted: Accepted: 1998 Introduced:
16
???
17
v2.0
18
“Paxos Made Simple”
19
Butler W. Lampson Butler Lampson is a Technical Fellow at Microsoft Corporation and an Adjunct Professor at MIT…..He was one of the designers of the SDS 940 time-sharing system, the Alto personal distributed computing system, the Xerox 9700 laser printer, two-phase commit protocols, the Autonet LAN, the SPKI system for network security, the Microsoft Tablet PC software, the Microsoft Palladium high-assurance stack, and several programming languages. He received the ACM Software Systems Award in 1984 for his work on the Alto, the IEEE Computer Pioneer award in 1996 and von Neumann Medal in 2001, the Turing Award in 1992, and the NAE’s Draper Prize in 2004.
20
[Lampson 1995]
21
Barbara Liskov MIT professor 2008 Turing Award
“View-stamped replication” PODC ’88 Very similar to Raft Different election process
22
At any moment, machine exists in a “state”
State machines At any moment, machine exists in a “state” What is a state? Should think of as a set of named variables and their values
23
State machines What is your state? 4 5 3 2 6 1 Client My state is “2”
Clients can ask a machine about its current state. What is your state? 4 5 3 2 6 1 Client My state is “2”
24
“actions” change the machine’s state
State machines “actions” change the machine’s state What is an action? Command that updates named variables’ values
25
“actions” change the machine’s state
State machines “actions” change the machine’s state Is an action’s effect deterministic? For our purposes, yes. Given a state and an action, we can determine next state w/ 100% certainty.
26
“actions” change the machine’s state
State machines “actions” change the machine’s state Is the effect of a sequence of actions deterministic? Yes, given a state and a sequence of actions, can be 100% certain of end state
27
Replicated state machines
Each state machine should compute same state, even if some fail. Client What is the state? What is the state? Client What is the state? Client
28
Replicated state machines
What has to be true of the actions that clients submit? Applied in same order Client Apply action c. Apply action a. Client Apply action b. Client
29
State machines How should a machine make sure it applies action in same order across reboots? Store them in a log! Action …
30
Replicated state machines
Can reduce problem of consistent, replicated states to consistent, replicated logs … … … …
31
Replicated state machines
How to make sure that logs are consistent? Two-phase commit? … … … … …
32
Replicated state machines
What is the heart of the matter? Have to agree on the leader, outside of the logs. Leader=L Leader=L Apply action a. … … Client Leader=L Leader=L … …
33
Key elements of consensus
Leader election Who is in charge? Log replication What are the actions and what is their order? Safety What is true for all states, in all executions (including failures)? e.g., either we haven’t agreed or we all agree on the same value
34
Key elements of consensus
Leader election Who is in charge? Log replication What are the actions and what is their order? Safety What is true for all states, in all executions (including failures? e.g., either we haven’t agreed or we all agree on the same value
35
Server states Three states: leader, follower, candidate F L C
36
Server state: follower
Passive state: respond to candidates and leaders F L C
37
Server state: leader Server handles all client requests F L
What should happen if a client sends a request to a follower? Follower forwards request to leader.
38
Server state: candidate
An intermediate state, used during elections F L C
39
Time divided into terms
Election Normal operation Election Normal operation Leader unknown Leader known What happened here? Election failed
40
Terms as a logical clock
All servers maintain the current term Terms increase monotonically Maintained as a logical clock Terms are exchanged whenever servers communicate What if A’s term is bigger than B’s? B updates its term to A’s What if A’s term is bigger and B thinks of itself as the leader? B reverts to a follower state What if A’s term is bigger, and it receives a request from B? A rejects B’s request B must be up to date to issue requests
41
Server state: follower
C Current term = 0 S1 F L C Current term = 0 S3 F L C Current term = 0 All servers start as followers. All servers have local timers. Note: no bootstrapping problem!
42
Server state: follower
C Current term = 0 S1 F L C Current term = 0 S3 F L C Current term = 0 Server remains follower as long as it receives periodic valid messages from a leader or candidate. Called a “heartbeat” message.
43
Server state: follower
C Current term = 0 S1 F L C Current term = 0 S3 F L C Current term = 0 What should server assume if no heartbeat? Assume no viable leader, start election.
44
Server state: follower
C Current term = 0 S1 F L C Current term = 0 S3 F L C Current term = 0 Who should the server nominate? How about itself? At least it knows that it’s running.
45
Server state: candidate
F L C Current term = 0 S1 F L C Current term = 1 S3 F L C Current term = 0 To start an election: Increment current term and set state to candidate
46
Server state: candidate
F L C Current term = 0 S1 F L C Current term = 1 S3 F L C Current term = 0 Need to collect votes. For whom should the server vote? Itself, of course! Major qualification: It’s running.
47
Server state: candidate
F L C Current term = 0 S1 Vote in term 1 F L C Current term = 1 S3 Votes S1=S1 S2=? S3=? Vote in term 1 F L C Current term = 0 How should S2, S3 respond to vote request?
48
Server state: candidate
F L C Current term = 1 S1 S1 in term 1 F L C Current term = 1 S3 Votes S1=S1 S2=? S3=? S1 in term 1 F L C Current term = 1 How should S2, S3 respond to vote request? Increment term, vote for S1 … why vote for S1? Our goal is consensus, and we know the collector voted for itself.
49
Server state: candidate
F L C Current term = 1 S1 F L C Current term = 1 S3 Votes S1=S1 S2=S1 S3=S1 F L C Current term = 1 What should S1 do next? Count votes (majority wins) Make itself the leader, start sending heartbeats.
50
Server state: candidate
F L C Current term = 1 Leader = S1 S1 S1 is leader F L C Current term = 1 S3 Leader = S1 S1 is leader F L C Current term = 1 Leader = S1
51
Server state: candidate
F L C Current term = 1 Leader = S1 S1 F L C Current term = 1 S3 Leader = S1 F L C Current term = 1 Leader = S1 How many faults can we tolerate? One. Need two/three to vote for same new leader
52
Server state: candidate
F L C Current term = 1 S1 Vote in term 1 F L C Current term = 1 S3 Votes S1=S1 S2=? S3=? Vote in term 1 F L C Current term = 1 C Votes S1=? S2=? S3=S3 F Who votes for whom if S1 and S3 both call elections?
53
Server state: candidate
F L C Current term = 1 S1 Vote in term 1 F L C Current term = 1 S3 Votes S1=S1 S2=S3 S3=S3 Vote in term 1 F L C Current term = 1 C Votes S1=S1 S2=S3 S3=S3 F Who votes for whom if S1 and S3 both call elections? S2 votes for the server that asked first
54
Server state: candidate
F L C Current term = 1 S1 F L C Current term = 1 S3 Votes S1=S1 S2=S3 S3=S3 F L C Current term = 1 C Votes S1=S1 S2=S3 S3=S3 F What does S1 do if it loses the election?
55
Server state: candidate
F L C Current term = 1 S1 F L C Current term = 1 S3 Votes S1=S1 S2=S3 S3=S3 F L C Current term = 1 C Votes S1=S1 S2=S3 S3=S3 F What does S1 do if it loses the election? Moves back to a follower state.
56
Server state: candidate
F L C Current term = 1 S3 is leader Leader = S3 S1 L Current term = 1 C S3 Leader = S3 F F L C Current term = 1 C Leader = S3 F
57
Server state: follower
C Current term = 0 S1 F L C Current term = 0 S3 F L C Current term = 0 Can all three servers nominate themselves?
58
Server state: candidate
F L C Current term = 1 Votes S1=S1 S2=S2 S3=S3 S1 F L C Current term = 1 S3 Votes S1=S1 S2=S2 S3=S3 F L C Current term = 1 Votes S1=S1 S2=S2 S3=S3 Can all three servers nominate themselves? Sure!
59
Server state: candidate
F L C Current term = 1 Votes S1=S1 S2=S2 S3=S3 S1 F L C Current term = 1 S3 Votes S1=S1 S2=S2 S3=S3 F L C Current term = 1 Votes S1=S1 S2=S2 S3=S3 Could this happen indefinitely?
60
Server state: candidate
F L C Current term = 1 Votes S1=S1 S2=S2 S3=S3 S1 F L C Current term = 1 S3 Votes S1=S1 S2=S2 S3=S3 F L C Current term = 1 Votes S1=S1 S2=S2 S3=S3 Could this happen indefinitely? Yes, there is not way to prevent this from happening. The worst possible thing is still possible.
61
Server state: candidate
F L C Current term = 1 Votes S1=S1 S2=S2 S3=S3 S1 F L C Current term = 1 S3 Votes S1=S1 S2=S2 S3=S3 F L C Current term = 1 Votes S1=S1 S2=S2 S3=S3 How do we make this less likely to occur?
62
Server state: candidate
F L C Current term = 1 Votes S1=S1 S2=S2 S3=S3 S1 F L C Current term = 1 S3 Votes S1=S1S2=S2 S3=S3 F L C Current term = 1 Votes S1=S1 S2=S2 S3=S3 How do we make this less likely to occur? Randomize election timeouts i.e., wait a random period before new election
63
Server state: candidate
F L C Current term = 1 S1 F L C Current term = 1 S3 F L C Current term = 1 Can we ever have two leaders?
64
Server state: candidate
F L C Current term = 1 S1 F L C Current term = 1 S3 F L C Current term = 1 Can we ever have two leaders? No, votes either split or converge w/ three nodes.
65
Key elements of consensus
Leader election Who is in charge? Log replication What are the actions and what is their order? Safety What is true for all states, in all executions (including failures)? e.g., either we haven’t agreed or we all agree on the same value
66
Each node maintains an action log.
Leader=L Each node maintains an action log. Each entry contains an action and a term. The term indicates when the leader received the action. Leader=L 1 x 3 1 y 1 Leader=L 1 x 3 1 y 1 1 x 3 1 y 1
67
When a request comes in, leader appends action to its log.
Leader=L When a request comes in, leader appends action to its log. Client y 5 Leader=L 1 x 3 1 y 1 Leader=L 1 x 3 1 y 1 1 y 5 1 x 3 1 y 1
68
Next, the leader tells other servers to append action.
Leader=L Next, the leader tells other servers to append action. Client y 5 1 y 5 Leader=L 1 x 3 1 y 1 Leader=L 1 y 5 1 x 3 1 y 1 1 y 5 1 x 3 1 y 1
69
The leader waits for confirmation that the entry was appended.
Leader=L The leader waits for confirmation that the entry was appended. Client y 5 ack Leader=L 1 x 3 1 y 1 1 y 5 Leader=L 1 x 3 1 y 1 1 y 5 1 x 3 1 y 1 1 y 5
70
Leader=L y 5 Client 3/3 … commit! Leader=L Leader=L
As with leader election, majority rules. Entry is committed once the leader that received it has replicated the entry on a majority of servers (for now). Leader=L Client y 5 3/3 … commit! Leader=L 1 x 3 1 y 1 1 y 5 Leader=L 1 x 3 1 y 1 1 y 5 1 x 3 1 y 1 1 y 5
71
If action commits, leader updates state machine. Leader=L
Client y 5 Leader=L 1 x 3 1 y 1 1 y 5 Leader=L 1 x 3 1 y 1 1 y 5 1 x 3 1 y 1 1 y 5
72
Leader reports success to client and other servers if action commits
Leader=L Client y 5 C C Leader=L 1 x 3 1 y 1 1 y 5 Leader=L 1 x 3 1 y 1 1 y 5 1 x 3 1 y 1 1 y 5
73
Can an action commit if one follower fails?
Action will still commit, since 2/3 are alive. Leader=L Leader=L 1 x 3 1 y 1 Leader=L 1 x 3 1 y 1 1 y 5 1 x 3 1 y 1 1 y 5
74
Leader=L Leader=L Leader=L
Can an action commit if both followers fail: (1) after adding entry to logs, and (2) before acking? Leader=L The leader will keep trying to append. If one server comes back, the action will eventually commit. Leader=L 1 x 3 1 y 1 1 y 5 Leader=L 1 x 3 1 y 1 1 y 5 1 x 3 1 y 1 1 y 5
75
Leader=L Leader=L Leader=L
How long might a client have to wait in this case? Leader=L A long time, i.e., until two machines append the action Leader=L 1 x 3 1 y 1 1 y 5 Leader=L 1 x 3 1 y 1 1 y 5 1 x 3 1 y 1 1 y 5
76
Leader=L Leader=L Leader=L
Can an action commit if leader fails: (1) after adding entry to log, and (2) before contacting followers? Leader=L Action will not commit. New entries will occur under next term. Leader=L 1 x 3 1 y 1 Leader=L 1 x 3 1 y 1 1 y 5 1 x 3 1 y 1
77
As with leader election, majority rules for commit (for now).
For now: entry is committed once the leader that received it has replicated the entry on a majority of servers (this will change in a bit) 1 2 3 4 5 6 7 8 Log index 1 x 3 y 1 y 9 2 x 2 3 x 0 y 7 x 5 x 4 Leader Followers
78
In this example, which committed entry has the highest index? Entry 7
1 2 3 4 5 6 7 8 Log index 1 x 3 y 1 y 9 2 x 2 3 x 0 y 7 x 5 x 4 Leader Followers
79
<sigh>This picture is far too orderly and easy to understand.
Committing a log entry also commits all entries in leader’s log with lower index. Note: the current leader’s index is the one that counts! 1 2 3 4 5 6 7 8 Log index 1 x 3 y 1 y 9 2 x 2 3 x 0 y 7 x 5 x 4 Leader Followers <sigh>This picture is far too orderly and easy to understand. No guarantee the world will look like this.</sigh>
80
This can be the state of the logs when a leader comes to power.
Each server has assigned each entry (1) a term, and (2) an index. Log index 1 2 3 4 5 6 7 8 9 10 11 12 1 1 1 4 4 5 5 6 6 6 Leader (a) 1 1 1 4 4 5 5 6 6 (b) 1 1 1 4 (c) 1 1 1 4 4 5 5 6 6 6 6 Followers (d) 1 1 1 4 4 5 5 6 6 6 7 7 (e) 1 1 1 4 4 4 4 (f) 1 1 1 2 2 2 3 3 3 3 3
81
What is the current term?
We are in term >= 8. Log index 1 2 3 4 5 6 7 8 9 10 11 12 1 1 1 4 4 5 5 6 6 6 Leader (a) 1 1 1 4 4 5 5 6 6 (b) 1 1 1 4 (c) 1 1 1 4 4 5 5 6 6 6 6 Followers (d) 1 1 1 4 4 5 5 6 6 6 7 7 (e) 1 1 1 4 4 4 4 (f) 1 1 1 2 2 2 3 3 3 3 3
82
Why aren’t there any entries for term 8?
Because the leader didn’t/hasn’t received any requests. Log index 1 2 3 4 5 6 7 8 9 10 11 12 1 1 1 4 4 5 5 6 6 6 Leader (a) 1 1 1 4 4 5 5 6 6 (b) 1 1 1 4 (c) 1 1 1 4 4 5 5 6 6 6 6 Followers (d) 1 1 1 4 4 5 5 6 6 6 7 7 (e) 1 1 1 4 4 4 4 (f) 1 1 1 2 2 2 3 3 3 3 3
83
This can be the state of the logs when the leader comes to power.
Each server has assigned each entry (1) a term, and (2) an index. Log index 1 2 3 4 5 6 7 8 9 10 11 12 1 1 1 4 4 5 5 6 6 6 Leader (a) 1 1 1 4 4 5 5 6 6 (b) 1 1 1 4 What’s wrong with the logs of (a) and (b)? They are missing log entries.
84
How might this have happened?
This can be the state of the logs when the leader comes to power. Each server has assigned each entry (1) a term, and (2) an index. Log index 1 2 3 4 5 6 7 8 9 10 11 12 1 1 1 4 4 5 5 6 6 6 Leader (a) 1 1 1 4 4 5 5 6 6 (b) 1 1 1 4 How might this have happened? They could have gone offline and come back; (a) during term 6, (b) during term 4
85
This can be the state of the logs when the leader comes to power.
Each server has assigned each entry (1) a term, and (2) an index. Log index 1 2 3 4 5 6 7 8 9 10 11 12 1 1 1 4 4 5 5 6 6 6 Leader What’s wrong with the logs of (c) and (d)? They have extra log entries. (c) 1 1 1 4 4 5 5 6 6 6 6 (d) 1 1 1 4 4 5 5 6 6 6 7 7
86
How might this have happened?
This can be the state of the logs when the leader comes to power. Each server has assigned each entry (1) a term, and (2) an index. Log index 1 2 3 4 5 6 7 8 9 10 11 12 1 1 1 4 4 5 5 6 6 6 Leader How might this have happened? (c) was leader for term 6, added entry and crashed; (d) was leader for 7, added entries and crashed (c) 1 1 1 4 4 5 5 6 6 6 6 (d) 1 1 1 4 4 5 5 6 6 6 7 7
87
This can be the state of the logs when the leader comes to power.
Each server has assigned each entry (1) a term, and (2) an index. Log index 1 2 3 4 5 6 7 8 9 10 11 12 1 1 1 4 4 5 5 6 6 6 Leader What’s wrong with the logs of (e) and (f)? They have extra log entries and missing log entries. (e) 1 1 1 4 4 4 4 (f) 1 1 1 2 2 2 3 3 3 3 3
88
This can be the state of the logs when the leader comes to power.
Each server has assigned each entry (1) a term, and (2) an index. Log index 1 2 3 4 5 6 7 8 9 10 11 12 1 1 1 4 4 5 5 6 6 6 Leader (f) was leader for term 2, added several entries and crashed. (f) quickly restarted and became leader for term 3, added more entries and crashed before any entries from terms 2 or 3 could commit How could this have happened to (f)? (e) 1 1 1 4 4 4 4 (f) 1 1 1 2 2 2 3 3 3 3 3
89
Goal: (eventually) converge on a sane state from logs like this.
Log index 1 2 3 4 5 6 7 8 9 10 11 12 1 1 1 4 4 5 5 6 6 6 Leader (a) 1 1 1 4 4 5 5 6 6 (b) 1 1 1 4 (c) 1 1 1 4 4 5 5 6 6 6 6 Followers (d) 1 1 1 4 4 5 5 6 6 6 7 7 (e) 1 1 1 4 4 4 4 (f) 1 1 1 2 2 2 3 3 3 3 3
90
Servers have to keep track of committed log entry with highest index
Servers have to keep track of committed log entry with highest index. What are those here? Leader=L 2 for all three servers. Client y 5 Leader=L 1 x 3 1 y 1 1 2 Leader=L 1 x 3 1 y 1 1 y 5 1 2 3 1 x 3 1 y 1 1 2
91
Leader=L Client y 5 Leader=L 1 2 Leader=L 1 2 3 1 2
Leader reports its highest index of committed action when forwarding request to followers. This is how followers update their state machines. Leader=L Client y 5 Term=1 y 5 High=2 Leader=L 1 x 3 1 y 1 1 2 Leader=L Term=1 y 5 High=2 1 x 3 1 y 1 1 y 5 1 2 3 1 x 3 1 y 1 1 2
92
Leader also reports highest index immediately preceding current append request.
Leader=L Client y 5 Term=1 y 5 High=2 Pred=2 Leader=L 1 x 3 1 y 1 1 2 Leader=L Term=1 y 5 High=2 Pred=2 1 x 3 1 y 1 1 y 5 1 2 3 1 x 3 1 y 1 1 2
93
Could this happen? Yes, if follower failed before it received action with index 2 from term 1 Leader=L Client y 5 Term=1 y 5 High=2 Pred=2 Leader=L 1 x 3 1 Leader=L Term=1 y 5 High=2 Pred=2 1 x 3 1 y 1 1 y 5 1 2 3 1 x 3 1 y 1 1 2
94
Should the recovered follower append the new action?
If it did, then it would have a different action in index 2 during term 1. Better to reject new action and append missing committed actions first. Leader=L Client y 5 Term=1 y 5 High=2 Pred=2 Leader=L 1 x 3 1 Leader=L Term=1 y 5 High=2 Pred=2 1 x 3 1 y 1 1 y 5 1 2 3 1 x 3 1 y 1 1 2
95
Yes, since we can still achieve majority w/o it.
Can new action commit while recovered server receives actions it missed while down? Leader=L Yes, since we can still achieve majority w/o it. Client y 5 Term=1 y 5 High=2 Pred=2 Leader=L 1 x 3 1 Leader=L Term=1 y 5 High=2 Pred=2 1 x 3 1 y 1 1 y 5 1 2 3 1 x 3 1 y 1 1 2
96
This gives us a very important system property:
Two entries in different logs w/ same index and term always have the same action. Leader=L Client y 5 Term=1 y 5 High=2 Pred=2 Leader=L 1 x 3 1 Leader=L Term=1 y 5 High=2 Pred=2 1 x 3 1 y 1 1 y 5 1 2 3 1 x 3 1 y 1 1 2
97
Leader=L Client y 5 Leader=L 1 Leader=L 1 2 3 1 2
Two entries in different logs w/ same index and term always have the same action. Why? Leader=L In a term, leader creates at most one entry with a given index. Followers catch up first when behind. Client y 5 Term=1 y 5 High=2 Pred=2 Leader=L 1 x 3 1 Leader=L Term=1 y 5 High=2 Pred=2 1 x 3 1 y 1 1 y 5 1 2 3 1 x 3 1 y 1 1 2
98
Log Matching Property Two entries in different logs with same index and term The entries store the same action The logs are identical in all preceding entries Does the matching property hold below? Log index 1 2 3 4 5 6 7 8 9 10 11 12 1 1 1 4 4 5 5 6 6 6 Leader 1 1 1 4 4 5 5 6 6 6 7 7 1 1 1 4 4 4 4 Followers 1 1 1 2 2 2 3 3 3 3 3
99
Log Matching Property Two entries in different logs with same index and term The entries store the same action The logs are identical in all preceding entries Why is the first part always true? One leader/term If leader fails, term changes Leader inserts a new entry once at a given index NOTE: this is not enough to ensure the second part
100
Log Matching Property Two entries in different logs with same index and term The entries store the same action The logs are identical in all preceding entries Ensuring the second part requires extra check by follower Leader sends followers append(term, last_index, action) Follower must check that last entry has same term and last_index If not, the follower refuses the new entry Otherwise, follower may append new entry containing action
101
Log Matching Property Two entries in different logs with same index and term The entries store the same action The logs are identical in all preceding entries Do logs always agree on indexes, terms? No, indexes at logs may have different actions and terms Goal of repairing logs is to increase entries w/ same index, term 1 1 1 4 4 5 5 6 6 6 7 7 1 1 1 4 4 4 4 1 1 1 2 2 2 3 3 3 3 3
102
How do we make the logs look more like one another?
Log index 1 2 3 4 5 6 7 8 9 10 11 12 1 1 1 4 4 5 5 6 6 6 Leader (a) 1 1 1 4 4 5 5 6 6 (b) 1 1 1 4 (c) 1 1 1 4 4 5 5 6 6 6 6 Followers (d) 1 1 1 4 4 5 5 6 6 6 7 7 (e) 1 1 1 4 4 4 4 (f) 1 1 1 2 2 2 3 3 3 3 3
103
How do we make the logs look more like one another?
Raft: How do we make the logs look like the leader’s? Log index 1 2 3 4 5 6 7 8 9 10 11 12 1 1 1 4 4 5 5 6 6 6 Leader (a) 1 1 1 4 4 5 5 6 6 (b) 1 1 1 4 (c) 1 1 1 4 4 5 5 6 6 6 6 Followers (d) 1 1 1 4 4 5 5 6 6 6 7 7 (e) 1 1 1 4 4 4 4 (f) 1 1 1 2 2 2 3 3 3 3 3
104
Repairing logs Leader Append-only Property
Leader’s log is treated as ground truth Leader never overwrites or deletes log entries Leader only appends to its log Basic idea: make the leader’s life simple How would a leader know that follower is missing entries? Follower refuses an append request (based on last_index) Leader must monitor each follower’s progress (next_index) Send log entries until followers are caught up
105
Leader initially assumes followers’ logs look like hers
next_index[a]=11 next_index[b]=11 next_index[c]=11 Leader initially assumes followers’ logs look like hers Log index 1 2 3 4 5 6 7 8 9 10 11 12 1 1 1 4 4 5 5 6 6 6 Leader (a) 1 1 1 4 4 5 5 6 6 Followers (b) 1 1 1 4 (c) 1 1 1 4 4 5 5 6 6 6 6
106
next_index[a]=11 next_index[b]=11 next_index[c]=11 Log index 1 2 3 4 5 6 7 8 9 10 11 12 1 1 1 4 4 5 5 6 6 6 8 Leader (a) 1 1 1 4 4 5 5 6 6 8 Prev={term 6, index 10} (b) 1 1 1 4 8 Prev={term 6, index 10} (c) 1 1 1 4 4 5 5 6 6 6 6 8 Prev={term 6, index 10}
107
next_index[a]=10 next_index[b]=10 next_index[c]=10 Log index 1 2 3 4 5 6 7 8 9 10 11 12 1 1 1 4 4 5 5 6 6 6 8 Leader ✗ (a) 1 1 1 4 4 5 5 6 6 8 Prev={term 6, index 10} ✗ (b) 1 1 1 4 8 Prev={term 6, index 10} ✗ (c) 1 1 1 4 4 5 5 6 6 6 6 8 Prev={term 6, index 10}
108
next_index[a]=10 next_index[b]=10 next_index[c]=10 Log index 1 2 3 4 5 6 7 8 9 10 11 12 1 1 1 4 4 5 5 6 6 6 8 Leader (a) 1 1 1 4 4 5 5 6 6 6 Prev={term 6, index 9} (b) 1 1 1 4 6 Prev={term 6, index 9} (c) 1 1 1 4 4 5 5 6 6 6 6 Prev={term 6, index 9} 6
109
What should happen to this log entry?
next_index[a]=10 next_index[b]=10 next_index[c]=10 Log index 1 2 3 4 5 6 7 8 9 10 11 12 1 1 1 4 4 5 5 6 6 6 8 Leader ✓ (a) 1 1 1 4 4 5 5 6 6 6 Prev={term 6, index 9} ✗ (b) 1 1 1 4 6 Prev={term 6, index 9} ✓ (c) 1 1 1 4 4 5 5 6 6 6 Prev={term 6, index 9} 6 6 What should happen to this log entry? It should be deleted
110
What entry does the leader send to (a) next?
next_index[a]=11 next_index[b]=9 next_index[c]=11 Log index 1 2 3 4 5 6 7 8 9 10 11 12 1 1 1 4 4 5 5 6 6 6 8 Leader (a) 1 1 1 4 4 5 5 6 6 6 (b) 1 1 1 4 (c) 1 1 1 4 4 5 5 6 6 6 What entry does the leader send to (a) next? Entry at index 11 (term 8); this succeeds.
111
What entry does the leader send to (b) next?
next_index[a]=11 next_index[b]=9 next_index[c]=11 Log index 1 2 3 4 5 6 7 8 9 10 11 12 1 1 1 4 4 5 5 6 6 6 8 Leader (a) 1 1 1 4 4 5 5 6 6 6 (b) 1 1 1 4 (c) 1 1 1 4 4 5 5 6 6 6 What entry does the leader send to (b) next? Entry at index 9 (term 6); this fails and next_index[b] 8. Eventually, (b) will accept entry from term 4 at index 5, and it will catch up with everyone else.
112
What entry does the leader send to (c) next?
next_index[a]=11 next_index[b]=9 next_index[c]=11 Log index 1 2 3 4 5 6 7 8 9 10 11 12 1 1 1 4 4 5 5 6 6 6 8 Leader (a) 1 1 1 4 4 5 5 6 6 6 (b) 1 1 1 4 (c) 1 1 1 4 4 5 5 6 6 6 What entry does the leader send to (c) next? Entry at index 11 (term 8); this succeeds.
113
next_index[a]=11 next_index[b]=9 next_index[c]=11 Log index 1 2 3 4 5 6 7 8 9 10 11 12 1 1 1 4 4 5 5 6 6 6 8 Leader (a) 1 1 1 4 4 5 5 6 6 6 (b) 1 1 1 4 (c) 1 1 1 4 4 5 5 6 6 6 As described, this should make you uncomfortable. Why? We could choose a bad leader (e.g., one w/ an empty log). This could delete committed entries!
114
How do we prevent this from happening?
next_index[a]=11 next_index[b]=9 next_index[c]=11 Log index 1 2 3 4 5 6 7 8 9 10 11 12 1 1 1 4 4 5 5 6 6 6 8 Leader (a) 1 1 1 4 4 5 5 6 6 6 (b) 1 1 1 4 (c) 1 1 1 4 4 5 5 6 6 6 How do we prevent this from happening? We have to be smarter about electing an leader…
115
Refined leader election
Goal Leader must have all committed actions How actions are committed Leader accepts action from client Leader forwards action to followers Majority append must occur before next term If new term starts before majority append, no commit Commit cannot occur until later action commits
116
Refined leader election
Initial election criterion Any candidate can be elected Normally the candidate who starts the election wins Problem: leader can force stale log on followers What must be true of an elected leader? Leader must have all prior committed actions If leader is missing a committed action, it may be lost
117
Refined leader election
New election criterion Vote for candidate with most “up to date” log If voter’s log is more recent than candidate’s, vote self Which log is more up to date? (a) because its last entry is from term 4 (> term 3) If last log entry is from later term, log is more up to date (a) 1 1 1 4 4 4 4 (b) 1 1 1 2 2 2 3 3 3 3 3
118
Refined leader election
New election criterion Vote for candidate with most “up to date” log If voter’s log is more recent than candidate’s, vote self Which log is more up to date? (b) because its last term-4 entry has higher index When last log entries are from same term, higher index is more up to date (a) 1 1 1 4 4 4 4 (b) 1 1 1 4 4 4 4 4
119
Refined leader election
New election criterion Vote for candidate with most “up to date” log If voter’s log is more recent than candidate’s, vote self How to decide which log is more up to date If last log entry is from later term, log is more up to date When last log entries are from same term, higher index is more up to date
120
Refined leader election
Goal Leader must have all committed actions Recall how actions had been committed Leader accepts action from client Leader forwards action to followers If a majority of followers append action, it’s committed If action commits, leader returns to client If leader fails, future leaders can commit Unfortunately, there’s a problem …
121
Committing actions Log index 1 2 (a) 1 2 (b) 1 2 (c) 1 (d) 1 (e) 1
122
How was (e) elected if (b) has a more up-to-date log?
Committing actions Log index 1 2 (a) 1 2 How was (e) elected if (b) has a more up-to-date log? (b) 1 2 (c) 1 Votes from (c) and (d) (d) 1 (e) 1
123
Committing actions Log index 1 2 (a) 1 2 (b) 1 2 (c) 1 (d) 1 (e) 1 3
124
Committing actions Log index 1 2 (a) 1 2 (b) 1 2 (c) 1 (d) 1 (e) 1 3
125
What is the first thing that (a) will do?
Committing actions Log index 1 2 3 (a) 1 2 4 What is the first thing that (a) will do? (b) 1 2 (c) 1 Forward entry at index 3 to followers, but this will fail at (c), because (c) does not have entry at index 2 yet (d) 1 (e) 1 3
126
Which entry will (a) send to (c) first?
Committing actions Log index 1 2 3 (a) 1 2 4 Which entry will (a) send to (c) first? (b) 1 2 (c) 1 2 Entry at index 2 (d) 1 (e) 1 3
127
Has the entry at index 2 been committed?
Committing actions Log index 1 2 3 (a) 1 2 4 Has the entry at index 2 been committed? (b) 1 2 (c) 1 2 By our previous definition, yes. But note that it reached a majority of nodes in term 4, not term 2 (i.e., after it was created) (d) 1 (e) 1 3
128
Committing actions 1 2 4 1 2 1 2 1 1 3 Log index 1 2 3 (a) (b) (c) (d)
129
How could (e) have been elected?
Committing actions Log index 1 2 3 (a) 1 2 4 How could (e) have been elected? (b) 1 2 (c) 1 2 Votes from (b),(c), and (d) (d) 1 (e) 1 3
130
What is the first thing that (e) will do?
Committing actions Log index 1 2 3 (a) 1 2 4 What is the first thing that (e) will do? (b) 1 2 (c) 1 2 Forward entry at index 2 to followers (d) 1 (e) 1 3
131
What will happen to other logs’ entries at index 2?
Committing actions Log index 1 2 3 (a) 1 2 4 What will happen to other logs’ entries at index 2? (b) 1 2 3 (c) 1 3 2 They will be replaced by the leader’s entry from term 3 (d) 1 3 (e) 1 3
132
What will happen to (a)’s entry at index 3?
Committing actions Log index 1 2 3 (a) 1 3 2 4 What will happen to (a)’s entry at index 3? (b) 1 2 3 (c) 1 3 2 It will be deleted (d) 1 3 (e) 1 3
133
Committing actions So, are actions stored on a majority really committed? No, only if they reach a majority in the term they’re created. Log index 1 2 3 (a) 1 3 2 (b) 1 2 3 (c) 1 2 3 (d) 1 3 (e) 1 3
134
Can (e) win the election?
Committing actions Log index 1 2 (a) 1 2 Can (e) win the election? (b) 1 2 (c) 1 2 No, only (d) will vote for it (d) 1 (e) 1
135
Can (b) or (c) win the election?
Committing actions Log index 1 2 (a) 1 2 Can (b) or (c) win the election? (b) 1 2 (c) 1 2 Yes, either can win. (d) 1 (e) 1
136
What will (b) send to (d) and (e) first?
Committing actions Log index 1 2 3 (a) 1 2 What will (b) send to (d) and (e) first? (b) 1 2 3 STOPPED (c) 1 2 3 Entry at index 2 (d) 1 2 3 (e) 1 2 3
137
Committing actions (say 2 reached majority in term 4) 1 2 4 1 2 1 2 1
Log index 1 2 3 (a) 1 2 4 Can entry 2 ever commit? (b) 1 2 (c) 1 2 Yes, if a subsequent entry (e.g., index 3, term 4) commits, all prior entries are also committed (d) 1 (e) 1 3
138
Entry 3 is now committed, so entry 2 is as well
Committing actions (say 2 reached majority in term 4) Log index 1 2 3 (a) 1 2 4 Entry 3 is now committed, so entry 2 is as well (b) 1 2 4 (c) 1 2 4 (d) 1 (e) 1 3
139
Intertwining elections and log repair
Log index 1 2 3 4 5 6 7 8 9 10 11 12 (a) 1 1 1 4 4 5 5 6 6 6 (b) 1 1 1 4 4 4 4 (c) 1 1 1 2 2 2 3 3 3 7 Is it possible for servers’ logs to reach this state? Why or why not?
140
Intertwining elections and log repair
Log index 1 2 3 4 5 6 7 8 9 10 11 12 (a) 1 1 1 5 5 6 6 6 6 (b) 1 1 1 5 5 5 5 (c) 1 1 1 2 4 Is it possible for servers’ logs to reach this state? Why or why not?
141
Intertwining elections and log repair
Log index 1 2 3 4 5 6 7 8 9 10 11 12 (a) 1 1 1 2 6 (b) 1 1 1 2 4 (c) 1 1 1 2 5 Is it possible for servers’ logs to reach this state? Why or why not?
142
Key elements of consensus
Leader election Who is in charge? Log replication What are the actions and what is their order? Safety What is true for all states, in all executions? e.g., either we haven’t agreed or we all agree on the same value
143
Review: Log Matching Property
Two entries in different logs with same index and term The entries store the same action The logs are identical in all preceding entries Critical for Safety proof that follows
144
Log Matching Property Two entries in different logs with same index and term The entries store the same action The logs are identical in all preceding entries Why is the first part always true? One leader/term If leader fails, term changes Leader inserts a new entry once at a given index every action is assigned a unique term and index NOTE: this is not enough to ensure the second part
145
Log Matching Property Two entries in different logs with same index and term The entries store the same action The logs are identical in all preceding entries Ensuring the second part requires extra check by follower Leader sends followers append(term, last_index, action) Follower must check that last entry has same term and last_index If not, the follower refuses the new entry Otherwise, follower may append new entry containing action
146
Log Matching Property Two entries in different logs with same index and term The entries store the same action The logs are identical in all preceding entries Do logs always agree on indexes, terms? No, indexes at logs may have different actions and terms Goal of repairing logs is to increase entries w/ same index, term 1 1 1 4 4 5 5 6 6 6 7 7 1 1 1 4 4 4 4 1 1 1 2 2 2 3 3 3 3 3
147
Safety Leader Completeness Property Proof setup
If a log entry is committed in a given term entry will be in leaders’ logs in all future terms Proof setup Majorities required to commit and elect What must be true of any two majorities? They must overlap This will be the linchpin of the proof Question: is a node with committed entries to be elected?
148
Proof by contradiction
Assume that node missing a committed entry e is elected Entry e was committed in term T Leader missing committed entry e is elected in term U U is earliest term after T whose leader is missing e
149
Proof by contradiction
Assume that node missing a committed entry e is elected Entry e was committed in term T Leader missing committed entry e is elected in term U U is earliest term after T whose leader is missing e Majorities are needed to commit and elect At least one node accepted e in term T and voted for leader in U Call this node “the voter”
150
Proof by contradiction
Assume that node missing a committed entry e is elected Entry e was committed in term T Leader missing committed entry e is elected in term U U is earliest term after T whose leader is missing e Majorities are needed to commit and elect At least one node accepted e in term T and voted for leader in U Call this node “the voter” Possible that the voter obtained entry e after voting in U? If so, the voter would have rejected e from the leader in term T And we know that the voter accepted e in term T So, we know that the voter obtained e before voting in term U
151
Proof by contradiction
Assume that node missing a committed entry e is elected Entry e was committed in term T Leader missing committed entry e is elected in term U U is earliest term after T whose leader is missing e Majorities are needed to commit and elect At least one node accepted e in term T and voted for leader in U Call this node “the voter” Voter obtained e before voting for leader in U We know that the voter voted for leader in U Leader in U’s log must have been as up to date as the voter’s log
152
Proof by contradiction
Assume that node missing a committed entry e is elected Entry e was committed in term T Leader missing committed entry e is elected in term U U is earliest term after T whose leader is missing e Majorities are needed to commit and elect At least one node accepted e in term T and voted for leader in U Call this node “the voter” Voter obtained e before voting for leader in U Leader in U’s log must have been as up to date as the voter’s log This will give us our contradictions
153
Proof by contradiction
Leader in U’s log must have been as up to date as the voter’s log If leader’s log was the same as the voter’s, then what must be true? Leader’s log must have contained committed entry e This contradicts our assumption, so leader’s log must ≠ voter’s
154
Proof by contradiction
Leader in U’s log must have been as up to date as the voter’s log If leader’s log was the same as the voter’s, then what must be true? Leader’s log must have contained committed entry e This contradicts our assumption, so leader’s log must ≠ voter’s What if leader’s log has same last term as voter’s? Then leader’s log must have been longer than the voter’s If longer with same last term, then there must be an entry from last term in common Log Matching Property: entries with same index and term all preceding are same Thus, the leader’s log must contain entry e This contradicts our assumption, so leader’s last log term > voter’s last log term
155
Proof by contradiction
Leader in U’s log must have been as up to date as the voter’s log If leader’s log was the same as the voter’s, then what must be true? Leader’s log must have contained committed entry e This contradicts our assumption, so leader’s log must ≠ voter’s What if leader’s log has same last term as voter’s? Then leader’s log must have been longer than the voter’s If longer with same last term, then there must be an entry from last term in common Log Matching Property: entries with same index and term all preceding are same Thus, the leader’s log must contain entry e This contradicts our assumption, so leader’s last log term > voter’s last log term What if leader’s last log term is later than voter’s last log term?
156
Log index 1 2 3 4 5 6 7 8 9 10 11 12 L Leader(U) Voter T Leader (L) Last term in Leader(U)’s log is L, and L > the last term in Voter’s log What does this tell us about the relationship of L and T? L > T, since Voter has entry e from T Entry e: term T at index 6 in Voter’s log
157
Log index 1 2 3 4 5 6 7 8 9 10 11 12 L Leader(U) Voter T Leader (L) T Last term in Leader(U)’s log is L, and L > the last term in Voter’s log Do we know whether the leader for L, Leader(L), has e? Yes, it must, since we assumed that Leader(U) was first w/o e Entry e: term T at index 6 in Voter’s log
158
Log index 1 2 3 4 5 6 7 8 9 10 11 12 L Leader(U) Voter T Leader (L) T L Last term in Leader(U)’s log is L, and L > the last term in Voter’s log Do we know whether Leader(L) has the last entry in Leader(U)’s log? Yes, it must, since updates only flow from leaders to followers Entry e: term T at index 6 in Voter’s and Leader(L)’s log
159
Log index 1 2 3 4 5 6 7 8 9 10 11 12 T L Leader(U) Voter T Leader (L) T L Last term in Leader(U)’s log is L, and L > the last term in Voter’s log Finally, apply the Log Matching property to infer that Leader(U) has e. Since Leader(L) and Leader(U) both have the last entry in Leader(U)’s log at the same index and term, all preceding entries are identical. And since Leader(L) has e and (T < L), Leader(U) must have e. Entry e: term T at index 6 in Voter’s and Leader(L)’s log
160
Log index 1 2 3 4 5 6 7 8 9 10 11 12 T L Leader(U) Voter T Leader (L) T L Last term in Leader(U)’s log is L, and L > the last term in Voter’s log Finally, apply the Log Matching property to infer that Leader(U) has e. Since Leader(L) and Leader(U) both have the last entry in Leader(U)’s log at the same index and term, all preceding entries are identical. And since Leader(L) has e and (T < L), Leader(U) must have e. ✓ Entry e: term T at index 6 in Voter’s, Leader(L)’s, and Leader(U)’s log
161
Proof by contradiction
Leader in U’s log must have been as up to date as the voter’s log If leader’s log was the same as the voter’s, then what must be true? Leader’s log must have contained committed entry e This contradicts our assumption, so leader’s log must ≠ voter’s What if leader’s log has same last term as voter’s? Then leader’s log must have been longer than the voter’s If longer with same last term, then there must be an entry from last term in common Log Matching Property: entries with same index and term all preceding are same Thus, the leader’s log must contain entry e This contradicts our assumption, so leader’s last log term > voter’s last log term What if leader’s last log term is later than voter’s last log term? Last term in leader’s log is L, and L > T since voter’s log contains entry e Leader in L must have had e, since leader in U is first to not have it By Log Matching Property, leader in U’s log must also contain e
162
Proof by contradiction
Assume that node missing a committed entry e is elected Entry e was committed in term T Leader missing committed entry e is elected in term U U is earliest term after T whose leader is missing e Majorities are needed to commit and elect At least one node accepted e in term T and voted for leader in U Call this node “the voter” Voter obtained e before voting for leader in U Leader in U’s log must have been as up to date as the voter’s log Nodes missing committed entries cannot be elected Committed entries are always preserved across terms/elections
163
One final issue Cluster membership changes
Membership is part of the cluster’s configuration What is the easy way to handle this? Take the cluster down Change configuration on all machines Restart the cluster Why isn’t this ideal? Would rather not take the availability hit Manually editing configuration can be dangerous
164
Should never have two leaders
Problem: cannot guarantee when change happens What makes this period dangerous? Disjoint majorities (a) old new (b) old new (c) old new (d) new (e) new Time
165
Should never have two leaders
Problem: cannot guarantee when change happens How can you have disjoint majorities? Old (a,b) New (c,d,e) (a) old new (b) old new (c) old new (d) new (e) new Time
166
Raft’s approach Change configuration in two phases
First, “joint consensus” Second, transition to new configuration Joint consensus rules Log entries replicated to all servers in both configs Any server, from either config, may serve as leader Agreement requires majorities from both configs
167
Configuration change Leaders initiate configuration change from old to new They send out a special log entry, Cold,new, describing both As soon as a server logs Cold,new, it plays by joint consensus rules For Cold,new to commit, it must reach a majority of Cold and Cnew If the leader fails, the new leader could be in Cold or Cold,new Important thing is that membership of Cnew cannot act unilaterally In other words, majority of Cold still required for commit, elections Eventually, Cold,new will commit (we may have to try several times) If Cold,new commits, what must be true of future leaders? By Leader Completeness, they must be running Cold,new config
168
Configuration change Once, Cold,new commits, we’re in good shape
Leader can now try to commit Cnew Leader logs and uses Cnew immediately i.e., only requires a majority of Cnew to commit entries Nasty corner case: what if the leader isn’t in Cnew? (i.e., managing a cluster of which they are not a member) It’s fine! Majorities don’t have to include the leader The leader continues to count votes and push updates When and how should the leader exit? Wait until Cnew commits, then shut down Cnew will elect a new leader
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.