CMPT 431 Lecture IX: Coordination And Agreement
2 CMPT 431 © A. Fedorova A Replicated Service client servers network client master slave W W WR R W write W data replication R read
3 CMPT 431 © A. Fedorova A Need For Coordination And Agreement client servers network client master slave Must coordinate election of a new master Must agree on a new master
4 CMPT 431 © A. Fedorova Roadmap Today we will discuss protocols for coordination and agreement This is a difficult problem because of failures and lack of bound on message delay We will begin with a strong set of assumptions (assume few failures), and then we will relax those assumptions We will look at several problems requiring communication and agreement: distributed mutual exclusion, election We will finally learn that in an asynchronous distributed system it is impossible to reach a consensus
5 CMPT 431 © A. Fedorova Distributed Mutual Exclusion (DMTX) Similar to a local mutual exclusion problem Processes in a distributed system share a resource Only one process can access a resource at a time Examples: –File sharing –Sharing a bank account –Updating a shared database
6 CMPT 431 © A. Fedorova Assumptions and Requirements A synchronous system Processes do not fail Message delivery is reliable (exactly once) Protocol requirements: Safety: At most one process may execute in the critical section at a time Liveness: Requests to enter and exit the critical section eventually succeed Fairness: Requests to enter the critical section are granted in the order in which they were received
7 CMPT 431 © A. Fedorova Evaluation Criteria of DMTX Algorithms Bandwidth consumed –proportional to the number of messages sent in each entry and exit operation Client delay –delay incurred by a process and each entry and exit operation System throughput –the rate at which processes can access the critical section (number of accesses per unit of time)
8 CMPT 431 © A. Fedorova DMTX Algorithms We will consider the following algorithms: –Central server algorithm –Ring-based algorithm –An algorithm based on voting
9 CMPT 431 © A. Fedorova The Central Server Algorithm
10 CMPT 431 © A. Fedorova The Central Server Algorithm Performance: –Entering a critical section takes two messages (a request message followed by a grant message) –System throughput is limited by the synchronization delay at the server: the time between the release message to the server and the grant message to the next client) Fault tolerance –Does not tolerate failures –What if the client holding the token fails?
11 CMPT 431 © A. Fedorova A Ring-Based Algorithm
12 CMPT 431 © A. Fedorova A Ring-Based Algorithm (cont) Processes are arranged in the ring There is a communication channel from process p i to process (p i +1) mod N They continuously pass the mutual exclusion token around the ring A process that does not need to enter the critical section (CS) passes the token along A process that needs to enter the CS retains the token; once it exits the CS, it keeps on passing the token No fault tolerance Excessive bandwidth consumption
13 CMPT 431 © A. Fedorova Maekawa’s Voting Algorithm To enter a critical section a process must receive a permission from a subset of its peers Processes are organized in voting sets A process is a member of M voting sets All voting sets are of equal size (for fairness)
14 CMPT 431 © A. Fedorova Maekawa’s Voting Algorithm p1 p2 p3 p4 Intersection of voting sets guarantees mutual exclusion To avoid deadlock, requests to enter critical section must be ordered
15 CMPT 431 © A. Fedorova Elections Election algorithms are used when a unique process must be chosen to play a particular role: –Master in a master-slave replication system –Central server in the DMTX protocol We will look at the bully election algorithm The bully algorithm tolerates failstop failures But it works only in a synchronous system with reliable messaging
16 CMPT 431 © A. Fedorova The Bully Election Algorithm All processes are assigned identifiers The system always elects a coordinator with the highest identifier: –Each process must know all processes with higher identifiers than its own Three types of messages: –election – a process begins an election –answer – a process acknowledges the election message –coordinator – an announcement of the identity of the elected process
17 CMPT 431 © A. Fedorova The Bully Election Algorithm (cont.) Initiation of election: –Process p 1 detects that the existing coordinator p 4 has crashed an initiates the election –p 1 sends an election messages to all processes with higher identifier than itself election p1p1 p2p2 p3p3 p4p4
18 CMPT 431 © A. Fedorova The Bully Election Algorithm (cont.) What happens if there are no crashes: –p 2 and p 3 receive the election message from p 1 send back the answer message to p 1, and begin their own elections –p 3 sends answer to p 2 –p 3 receives no answer message from p 4, so after a timeout it elects itself as a leader (knowing it has the highest ID) election p1p1 p2p2 p3p3 p4p4 answer coordinator
19 CMPT 431 © A. Fedorova The Bully Election Algorithm (cont.) What happens if p 3 also crashes after sending the answer message but before sending the coordinator message? In that case, p 2 will time out while waiting for coordinator message and will start a new election election p1p1 p2p2 p3p3 p4p4 answer p2p2
20 CMPT 431 © A. Fedorova The Bully Election Algorithm (summary) The algorithm does not require a central server Does not require knowing identities of all the processes Requires knowing identities of processes with higher IDs Survives crashes Assumes a synchronous system (relies on timeouts)
21 CMPT 431 © A. Fedorova Consensus With General Failures The algorithms we’ve covered so far tolerated only failstop failures Let’s look at reaching consensus in presence of more general failures –Omission –Byzantine
22 CMPT 431 © A. Fedorova Consensus All processes agree on the same value (or set of values) When do you need consensus? –Leader (master) election –Mutual exclusion –Transaction involving multiple parties (banking) We will look at several variants of consensus problem –Consensus –Byzantine generals
23 CMPT 431 © A. Fedorova System Model There is a set of processes P i There is a set of values {v 0, …, v N-1 } proposed by processes Each processes P i decides on d i d i belongs to the set {v 0, …, v N-1 } Assumptions: –Synchronous system (for now) –Failstop failures –Byzantine failures –Reliable channels
24 CMPT 431 © A. Fedorova Consensus Step 1 Propose. P1P1 P2P2 P3P3 v1v1 v3v3 v2v2 Consensus algorithm Step 2 Decide. P1P1 P2P2 P3P3 d1d1 d3d3 d2d2 Courtesy of Jeff Chase, Duke University
25 CMPT 431 © A. Fedorova Consensus (C) P i selects d i from {v 0, …, v N-1 }. All P i select the same v k (make the same decision) d i = v k Courtesy of Jeff Chase, Duke University
26 CMPT 431 © A. Fedorova Conditions for Consensus Termination: All correct processes eventually decide. Agreement: All correct processes select the same d i. Integrity: If all correct processes propose the same v, then d i = v
27 CMPT 431 © A. Fedorova Consensus in a Synchronous System Without Failures Each process p i proposes a decision value v i All proposed v i are sent around, such that each process knows all proposed v i Once all processes receive all proposed v’s, they apply to them the same function, such as: minimum(v 1, v 2, …., v N ) Each process p i sets d i = minimum(v 1, v 2, …., v N ) The consensus is reached What if processes fail? Can other processes still reach an agreement?
28 CMPT 431 © A. Fedorova Consensus in a Synchronous System With Failstop & Omission Failures We assume that at most f out of N processes fail To reach a consensus despite f failures, we must extend the algorithm to take f+1 rounds At round 1: each process p i sends its proposed v i to all other processes and receives v’s from other processes At each subsequent round process p i sends v’s that it has not sent before and receives new v’s The algorithm terminates after f+1 rounds Let’s see why it works…
29 CMPT 431 © A. Fedorova Proof that Consensus is Reached Will prove by contradiction Suppose some correct process p i possesses a value that another correct process p j does not possess This must have happened because some other processes p k sent that value to p i but crashed or before sending it to p j (or lost the message) The crash must have happened in round f+1 (last round). Otherwise, p i would have sent that value to p j in round f+1 But how come p j have not received that value in any of the previous rounds? There must have been a crash at every previous round – some process sent the value to some other processes, but did not send it to p j But this implies that there must have been f+1 failures This is a contradiction: we assumed at most f failures
30 CMPT 431 © A. Fedorova A Take-Away Point If you cannot build a fully failproof algorithm... Build an algorithm that is guaranteed to tolerate some number f of failures Then build a system that has fewer than f failures with high probability
31 CMPT 431 © A. Fedorova Byzantine Generals Problem (BG) Two types of generals: commander and subordinates A commander proposes an action (v i ). Subordinates must agree d i = v leader v leader leader or commander subordinate or lieutenant d j = v leader Courtesy of Jeff Chase, Duke University
32 CMPT 431 © A. Fedorova Conditions for Consensus Termination: All correct processes eventually decide. Agreement: All correct processes select the same d i. Integrity: If the commander is correct than all correct processes decide on the value that the commander proposed
33 CMPT 431 © A. Fedorova Consensus in a Synchronous System With Byzantine Failures Byzantine failure: a process can forward to another process an arbitrary value v Byzantine generals: the commander... –says to one lieutenant that v = A –says to another lieutenant that v = B We will show that consensus is impossible with only 3 generals Pease et. al generalized this to impossibility of consensus with N≤3f faulty generals
34 CMPT 431 © A. Fedorova BG: Impossibility With Three General Scenario 1: p 2 must decide v (by integrity condition) But p 2 cannot distinguish between Scenario 1 and Scenario 2 If it decides to believe the general, it will decide v in Scenario 2 By symmetry, p 3 will decide u in Scenario 2 p 2 and p 3 will have reached different decisions p 1 (Commander) p 2 p 3 1:v 2:1:v 3:1:u p 1 (Commander) p 2 p 3 1:u 1:v 2:1:v 3:1:u Faulty processes are shown shaded “3:1:u” means “3 says 1 says u”. Scenario 1 Scenario 2
35 CMPT 431 © A. Fedorova Solution With Four Byzantine Generals We can reach consensus if there are 4 generals and at most 1 is faulty Intuition: use the majority rule Correct process Who is telling the truth? Majority rules!
36 CMPT 431 © A. Fedorova Solution With Four Byzantine Generals p 1 (Commander) p 2 p 3 1:v 2:1:v 3:1:u Faulty processes are shown shaded p 4 1:v 4:1:v 2:1:v3:1:w 4:1:v p 1 (Commander) p 2 p 3 1:w1:u 2:1:u 3:1:w p 4 1:v 4:1:v 2:1:u3:1:w 4:1:v Round 1: The commander sends v to all other generals Round 2: All generals exchange values that they sent to commander The decision is made based on majority
37 CMPT 431 © A. Fedorova Solution With Four Byzantine Generals p 1 (Commander) p 2 p 3 1:v 2:1:v 3:1:u p 4 1:v 4:1:v 2:1:v3:1:w 4:1:v p 2 receives: {v, v, u}. Decides v p 4 receives: {v, v, w}. Decides v
38 CMPT 431 © A. Fedorova Solution With Four Byzantine Generals p 1 (Commander) p 2 p 3 1:w1:u 2:1:u 3:1:w p 4 1:v 4:1:v 2:1:u3:1:w 4:1:v p 2 receives: {u, w, v}. Decides NULL p 4 receives: {u, v, w}. Decides NULL p 3 receives: {w, u, v}. Decides NULL The result generalizes for system with N ≥ 3f + 1, (N is the number of processes, f is the number of faulty processes)
39 CMPT 431 © A. Fedorova Consensus in an Asynchronous System In the algorithms we’ve looked at consensus has been reached by using several rounds of communication The systems were synchronous, so each round always terminated If a process has not received a message from another process in a given round, it could assume that the process is faulty In an asynchronous system this assumption cannot be made! Fischer-Lynch-Patterson (1985): No consensus can be guaranteed in an asynchronous communication system in the presence of any failures. Intuition: a “failed” process may just be slow, and can rise from the dead at exactly the wrong time.
40 CMPT 431 © A. Fedorova Consensus in Practice Real distributed systems are by and large asynchronous How do they operate if consensus cannot be reached? Assume a synchronous system: use manual fault resolution if something goes wrong Fault masking: assume that failed processes always recover, and define a way to reintegrate them into the group. –If you haven’t heard from a process, just keep waiting… –A round terminates when every expected message is received. Failure detectors: construct a failure detector that can determine if a process has failed. –A round terminates when every expected message is received, or the failure detector reports that its sender has failed.
41 CMPT 431 © A. Fedorova Failure Detectors First problem: how to detect that a member has failed? –pings, timeouts, beacons, heartbeats –recovery notifications Is the failure detector accurate? – Does it accurately detect failures? Is the failure detector live? – Are there bounds on failure detection time? In an asynchronous system, it impossible for a failure detector to be both accurate and live
42 CMPT 431 © A. Fedorova Summary Coordination and agreement are essential in real distributed systems Real distributed systems are asynchronous Consensus cannot be reached in an asynchronous distributed system Nevertheless, people still build useful distributed systems that rely on consensus Fault recovery and masking are used as mechanisms for helping processes reach consensus Popular fault masking and recovery techniques are transactions and replication – the topics of the next few lectures