State Machines CS 614 Thursday, Feb 21, 2002 Bill McCloskey.

Slides:



Advertisements
Similar presentations
CS 542: Topics in Distributed Systems Diganta Goswami.
Advertisements

CS425 /CSE424/ECE428 – Distributed Systems – Fall 2011 Material derived from slides by I. Gupta, M. Harandi, J. Hou, S. Mitra, K. Nahrstedt, N. Vaidya.
CS 5204 – Operating Systems1 Paxos Student Presentation by Jeremy Trimble.
1 Indranil Gupta (Indy) Lecture 8 Paxos February 12, 2015 CS 525 Advanced Distributed Systems Spring 2015 All Slides © IG 1.
6.852: Distributed Algorithms Spring, 2008 Class 7.
Distributed Systems Overview Ali Ghodsi
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Byzantine Fault Tolerance Steve Ko Computer Sciences and Engineering University at Buffalo.
Consensus Hao Li.
Byzantine Generals Problem: Solution using signed messages.
LEADER ELECTION CS Election Algorithms Many distributed algorithms need one process to act as coordinator – Doesn’t matter which process does the.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 3 – Distributed Systems.
CPSC 668Set 12: Causality1 CPSC 668 Distributed Algorithms and Systems Fall 2009 Prof. Jennifer Welch.
1 Fault-Tolerant Consensus. 2 Failures in Distributed Systems Link failure: A link fails and remains inactive; the network may get partitioned Crash:
Eddie Bortnikov & Aran Bergman, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Recitation.
Replication Management using the State-Machine Approach Fred B. Schneider Summary and Discussion : Hee Jung Kim and Ying Zhang October 27, 2005.
2/23/2009CS50901 Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial Fred B. Schneider Presenter: Aly Farahat.
1 Principles of Reliable Distributed Systems Lecture 5: Failure Models, Fault-Tolerant Broadcasts and State-Machine Replication Spring 2005 Dr. Idit Keidar.
1 Message Logging Pessimistic & Optimistic CS717 Lecture 10/16/01-10/18/01 Kamen Yotov
Distributed Systems CS Case Study: Replication in Google Chubby Recitation 5, Oct 06, 2011 Majd F. Sakr, Vinay Kolar, Mohammad Hammoud.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Distributed Algorithms: Agreement Protocols. Problems of Agreement l A set of processes need to agree on a value (decision), after one or more processes.
CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 19: Paxos All slides © IG.
Time, Clocks, and the Ordering of Events in a Distributed System Leslie Lamport (1978) Presented by: Yoav Kantor.
Election Algorithms. Topics r Issues r Detecting Failures r Bully algorithm r Ring algorithm.
Paxos Made Simple Jinghe Zhang. Introduction Lock is the easiest way to manage concurrency Mutex and semaphore. Read and write locks. In distributed system:
1 A Modular Approach to Fault-Tolerant Broadcasts and Related Problems Author: Vassos Hadzilacos and Sam Toueg Distributed Systems: 526 U1580 Professor:
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Chapter 6 Synchronization.
Fault Tolerance via the State Machine Replication Approach Favian Contreras.
Bringing Paxos Consensus in Multi-agent Systems Andrei Mocanu Costin Bădică University of Craiova.
Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.
Copyright © George Coulouris, Jean Dollimore, Tim Kindberg This material is made available for private study and for direct.
Reliable Communication in the Presence of Failures Based on the paper by: Kenneth Birman and Thomas A. Joseph Cesar Talledo COEN 317 Fall 05.
Coordination and Agreement. Topics Distributed Mutual Exclusion Leader Election.
Agenda Fail Stop Processors –Problem Definition –Implementation with reliable stable storage –Implementation without reliable stable storage Failure Detection.
Paxos A Consensus Algorithm for Fault Tolerant Replication.
Paxos: Agreement for Replicated State Machines Brad Karp UCL Computer Science CS GZ03 / M st, 23 rd October, 2008.
Replication (1). Topics r Why Replication? r System Model r Consistency Models – How do we reason about the consistency of the “global state”? m Data-centric.
Copyright © George Coulouris, Jean Dollimore, Tim Kindberg This material is made available for private study and for direct.
CS 425/ECE 428/CSE424 Distributed Systems (Fall 2009) Lecture 9 Consensus I Section Klara Nahrstedt.
Totally Ordered Broadcast in the face of Network Partitions [Keidar and Dolev,2000] INF5360 Student Presentation 4/3-08 Miran Damjanovic
Distributed systems Consensus Prof R. Guerraoui Distributed Programming Laboratory.
CSE 60641: Operating Systems Implementing Fault-Tolerant Services Using the State Machine Approach: a tutorial Fred B. Schneider, ACM Computing Surveys.
Consensus and leader election Landon Cox February 6, 2015.
Chap 15. Agreement. Problem Processes need to agree on a single bit No link failures A process can fail by crashing (no malicious behavior) Messages take.
SysRép / 2.5A. SchiperEté The consensus problem.
Distributed Systems Topic 5: Time, Coordination and Agreement
Hwajung Lee.  Improves reliability  Improves availability ( What good is a reliable system if it is not available?)  Replication must be transparent.
Replication Improves reliability Improves availability ( What good is a reliable system if it is not available?) Replication must be transparent and create.
1 Fault-Tolerant Consensus. 2 Communication Model Complete graph Synchronous, network.
Implementing Replicated Logs with Paxos John Ousterhout and Diego Ongaro Stanford University Note: this material borrows heavily from slides by Lorenzo.
The Raft Consensus Algorithm Diego Ongaro and John Ousterhout Stanford University.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Paxos Steve Ko Computer Sciences and Engineering University at Buffalo.
Distributed Systems Lecture 9 Leader election 1. Previous lecture Middleware RPC and RMI – Marshalling 2.
Unreliable Failure Detectors for Reliable Distributed Systems Tushar Deepak Chandra Sam Toueg Presentation for EECS454 Lawrence Leinweber.
CSE 486/586, Spring 2014 CSE 486/586 Distributed Systems Paxos Steve Ko Computer Sciences and Engineering University at Buffalo.
Coordination and Agreement
CS 525 Advanced Distributed Systems Spring 2013
CS 525 Advanced Distributed Systems Spring 2018
Replication Improves reliability Improves availability
CS 425 / ECE 428 Distributed Systems Fall 2017 Indranil Gupta (Indy)
EEC 688/788 Secure and Dependable Computing
UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department
EEC 688/788 Secure and Dependable Computing
Fault-Tolerant State Machine Replication
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
CSE 486/586 Distributed Systems Reliable Multicast --- 2
Presentation transcript:

State Machines CS 614 Thursday, Feb 21, 2002 Bill McCloskey

Introduction State machines provide fault-tolerance through replication. They consist of state variables and commands to change the state. Clients request the state machine to execute commands. State AState B Command Client

An Example: Memory State variables: store: array[0..n] of word Commands: read:command(loc: 0..n) send store[loc] to client write:command(loc: 0..n, value: word) store[loc] := value Reads and writes values to and from storage.

Ordering of Commands The machine will have multiple clients. Commands from the same client must be executed in the order they were issued. Commands from different clients must be executed in an order determined by causality.

Fault-tolerance Replicas of state machine are run on multiple processors for fault-tolerance. Replicas must start in the same initial state and must process the same set of requests in the same order. This is a consensus problem. Agreement: Every non-faulty replica receives the every request. Order: Every non-faulty replica processes the requests in the same order. There are several ways of achieving these conditions…

Agreement Often, agreement is met using a Byzantine Agreement protocol. Every non-faulty processor will receive each command. Clients can transmit commands to the replicas, or a single replica can serve as transmitter for the client. More efficient techniques for fail-stop failures are can be used instead. Other techniques are also possible, such as the Paxos algorithm (more later).

Order and Stability A request can be labeled with a unique ID (uid). The request is considered stable at a certain state machine replica when requests with lower uids cannot be received from correct clients. The replica must wait until a request is stable before executing it. So, a state machine processes requests in order of uids. Therefore, uid ordering is constrained by causality of requests (from earlier). Possible stability tests use Lamport clocks or real-time clocks. Replicas may also agree on a uid using agreement.

Achieving Stability with Lamport Clocks Messages marked with a logical timestamp. This is its uid. Causality requirements are satisfied. Clients must periodically make “null” requests. A request is stable at a replica when a request with a larger timestamp has been received from every client. Then no lower uids can arrive. Requires FIFO channels (easy). Works in the presence of fail-stop failures.

Achieving Stability with Real-Time Clocks Real-time clock value, together with the identity of the sending process, is the uid. To satisfy causality, a client can make only one request per clock tick, and message delivery must take longer than the difference between clocks on different processors. Let  be the time for a request to reach every correct processor. A request is stable if its timestamp is at least  time units in the past, according to the local clock. This imposes a delay in processing.

Replica-Generated Uids Ordering using Lamport clocks requires all processors to communicate (null requests). Real-time clock ordering requires clock synchronization, also expensive. Could also have the replicas themselves agree on a uid for each request. Each replica proposes a candidate uid. The replicas agree on a uid, accepting the request. Clients cannot execute a request until the previous request is accepted, to guarantee causality.

Implementing Replica- Generated Uids Final uid is always at least the candidate uid. A request r’ seen after a request r has been accepted has a higher candidate uid than the final uid of r. A new candidate uid will be one greater than any candidate or final uid so far, plus a factor of i/N to make it unique. Each replica broadcasts its candidate uid. The final uid is selected as the maximum of all the uids received.

Paxos: Another Approach Lamport’s Paxon Synod is an agreement algorithm. It is efficient and practical. It assumes a partially synchronous model. Messages are not always delivered on time. Messages may be lost or duplicated. Processors may fail silently. Guarantees: Agreement: Everyone agrees on the same value. Validity: The chosen values was one of the candidates. Termination is not guaranteed.

Stability An execution fragment  is stable if: No processors fail or recover in . No packets are lost or duplicated in . Delivery of messages is on time.  is nice if it is stable and if a majority of processes are alive. We’ll see that Paxos terminates if there is an execution fragment which is nice for long enough.

Leader Election Paxos requires a leader to “run” the algorithm. Processes exchange “Alive” messages to try to detect failures. When the current leader fails, a new one is selected which has the largest processor ID. Failure detector doesn’t always work, so there may be multiple leaders or no leader. The algorithm may not terminate if there are always too many leaders.

Setup The algorithm operates in a sequence of rounds. Multiple rounds may be ongoing at the same time. In each round, the leader tries to get a majority for a certain value. Processes vote in each round. If a majority of processes vote in a round, then the value chosen is the one proposed by the leader. If too few processes vote in the round, it fails and a new one is started.

Rounds Each round is numbered with a tuple ( l, r ), where l is the process ID of the leader and r is the leader’s index for that round. Lexicographic ordering is used on the rounds. This way, round numbers are unique. Thus, each round has a unique value, since a leader only proposes one value, and a round only has one leader.

Algorithm 1. Leader informs other replicas that round R is starting. 2. Each replica finds the last round before R in which it voted. It sends this vote to the leader. 3. The leader waits for these votes from a majority set (quorum) Q. 4. Based on these previous votes, the leader decides to propose a certain value v for the new round, and informs the replicas in Q. 5. Each replica may vote in this round or not. If they choose to vote, they send the vote to the leader.

Algorithm (2) 6. If the leader receives a vote from every replica in Q, it informs everyone that v is the consensus value.

Voting Why would a replica ever decide not to vote? When it gives its last vote (which, say, was in round R ’) to a leader in round R, then it must not vote in any round from R ’ to R, since that would invalidate the information it sent. This means that if leaders keep starting new rounds, everyone will be forbidden to vote and the algorithm will never terminate. If a majority of processes are forbidden to vote in a round, the round is dead. A dead round can never succeed.

Example of Paxos LeaderMessageReplica Choose round R Ask for previous votes Query(R)  Find R’ as previous round Choose value v to keep round anchored  Report(R’) Forbid votes for rounds in (R’, R) Send value Vote(R, v)  Check if forbidden  Voted(R, v) Send vote if not If majority voted, send outcome Outcome(R, v) 

Anchored Rounds Let v R be the value that the leader proposes for round R. (We saw that this is well defined.) If no quorum is found, v R = null. A round is anchored if all rounds before it are either dead or have the same value v R. An anchored round stays anchored (stable). Paxos will be set up so that every round is anchored or has v R = null. This implies that any two successful rounds have the same value…

Any Two Successful Rounds Have the Same Value for all R, v R = null or R is anchored  for all R, R ’  R, if R ’ is not dead then v R = null or v R =v R ’  for all R, R ’  R, if R ’ is successful then v R = null or v R =v R ’  Any two successful rounds have the same value This is the essential property that we needed. It tells us that once there is agreement by a majority, all future rounds will agree on the same value. Now we need to assure that all rounds will be anchored.

Anchoring the Rounds When the leader has received the most recent votes (and the values voted for) from a majority of replicas, it must propose a value for the current round R, to keep it anchored. It looks through previous rounds from R, skipping over those in which no value was reported. These rounds must be dead, since a majority chose not to vote for them. When it gets to a round R ’ with a value, it chooses the same value for the new round. Since R ’ was anchored, and all rounds between R and R ’ are dead, R ’ is anchored. If it finds no R ’, it can choose the value to be its initial value (given as part of agreement).

An Example ValueA VotedB VotedC Voted Round 17X Round 28X Round 39X All rounds are dead. So with complete information, leader could choose any value. Q = {A,B}: Round 4 will use value 8 Q = {A,C}: Round 4 will use value 9 Q = {B,C}: Round 4 will use value 9

Another Example ValueA VotedB VotedC Voted Round 17X Round 28XX Round 38X Round 2 succeeds. Rounds 1 and 3 are dead. Q = {A,B}: Round 4 will use value 8 Q = {A,C}: Round 4 will use value 8 Q = {B,C}: Round 4 will use value 8 Another Example

Summary of Paxon Synod This completes the proof that Paxos is correct. Validity: Leaders always propose values that they were given or that were proposed before. Agreement: The leader sends the consensus result to everyone. Even if more rounds take place, they’ll always produce the same value. Now we have an agreement algorithm which works in a realistic environment, but which may not terminate when failures occur.

The Paxon Parliament Paxos agrees on a single value. For state machines, we need to agree on the commands to execute. We can consider a numbered list of commands which will be executed. The identity of these commands will be decided by consensus. A single leader will run an instance of Paxos for each index. For a finite number of indices, the leader is forced to pick commands based on previous voting. For the rest, it chooses commands as they come from the client.

Summary A client must wait for a command to reach consensus before requesting another, to satisfy causality. Read operations can be satisfied by checking the local state or by executing a read command, which guarantees proper ordering. Lamport proposes many other optimizations. Ideally, agreeing on a command takes 3n messages for n replicas.