The Raft Consensus Algorithm Diego Ongaro and John Ousterhout Stanford University.

Slides:



Advertisements
Similar presentations
COS 461 Fall 1997 Group Communication u communicate to a group of processes rather than point-to-point u uses –replicated service –efficient dissemination.
Advertisements

The Raft Consensus Algorithm
Consensus: Bridging Theory and Practice
Distributed Systems Overview Ali Ghodsi
The SMART Way to Migrate Replicated Stateful Services Jacob R. Lorch, Atul Adya, Bill Bolosky, Ronnie Chaiken, John Douceur, Jon Howell Microsoft Research.
CS 582 / CMPE 481 Distributed Systems
Distributed Systems CS Case Study: Replication in Google Chubby Recitation 5, Oct 06, 2011 Majd F. Sakr, Vinay Kolar, Mohammad Hammoud.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
In Search of an Understandable Consensus Algorithm Diego Ongaro John Ousterhout Stanford University.
Wide-area cooperative storage with CFS
CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 19: Paxos All slides © IG.
CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 18: Replication Control All slides © IG.
State Machines CS 614 Thursday, Feb 21, 2002 Bill McCloskey.
Distributed Databases
Fault Tolerance via the State Machine Replication Approach Favian Contreras.
Cool ideas from RAMCloud Diego Ongaro Stanford University Joint work with Asaf Cidon, Ankita Kejriwal, John Ousterhout, Mendel Rosenblum, Stephen Rumble,
Bringing Paxos Consensus in Multi-agent Systems Andrei Mocanu Costin Bădică University of Craiova.
An Introduction to Consensus with Raft
Practical Byzantine Fault Tolerance
Byzantine fault-tolerance COMP 413 Fall Overview Models –Synchronous vs. asynchronous systems –Byzantine failure model Secure storage with self-certifying.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation MongoDB Architecture.
Serverless Network File Systems Overview by Joseph Thompson.
Databases Illuminated
Paxos A Consensus Algorithm for Fault Tolerant Replication.
Paxos: Agreement for Replicated State Machines Brad Karp UCL Computer Science CS GZ03 / M st, 23 rd October, 2008.
Totally Ordered Broadcast in the face of Network Partitions [Keidar and Dolev,2000] INF5360 Student Presentation 4/3-08 Miran Damjanovic
Consensus and Consistency When you can’t agree to disagree.
Consensus and leader election Landon Cox February 6, 2015.
Spring 2003CS 4611 Replication Outline Failure Models Mirroring Quorums.
Hwajung Lee.  Improves reliability  Improves availability ( What good is a reliable system if it is not available?)  Replication must be transparent.
Replication Improves reliability Improves availability ( What good is a reliable system if it is not available?) Replication must be transparent and create.
CSci8211: Distributed Systems: Raft Consensus Algorithm 1 Distributed Systems : Raft Consensus Alg. Developed by Stanford Platform Lab  Motivated by the.
Implementing Replicated Logs with Paxos John Ousterhout and Diego Ongaro Stanford University Note: this material borrows heavily from slides by Lorenzo.
PIROGUE, A LIGHTER DYNAMIC VERSION OF THE RAFT DISTRIBUTED CONSENSUS ALGORITHM Jehan-François Pâris, U. of Houston Darrell D. E. Long, U. C. Santa Cruz.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Paxos Steve Ko Computer Sciences and Engineering University at Buffalo.
ZOOKEEPER. CONTENTS ZooKeeper Overview ZooKeeper Basics ZooKeeper Architecture Getting Started with ZooKeeper.
CSE 486/586, Spring 2014 CSE 486/586 Distributed Systems Paxos Steve Ko Computer Sciences and Engineering University at Buffalo.
Detour: Distributed Systems Techniques
Consensus 2 Replicated State Machines, RAFT
Raft consensus Landon Cox February 15, 2017.
Distributed Systems – Paxos
Paxos and Raft (Lecture 21, cs262a)
Two phase commit.
Google File System CSE 454 From paper by Ghemawat, Gobioff & Leung.
COS 418: Distributed Systems
Paxos and Raft (Lecture 9 cont’d, cs262a)
EECS 498 Introduction to Distributed Systems Fall 2017
Strong consistency and consensus
Implementing Consistency -- Paxos
CS434/534: Topics in Networked (Networking) Systems Distributed Network OS Yang (Richard) Yang Computer Science Department Yale University 208A Watson.
RAFT: In Search of an Understandable Consensus Algorithm
Raft consensus Landon Cox April 11/16, 2018.
EEC 688/788 Secure and Dependable Computing
Lecture 21: Replication Control
EEC 688/788 Secure and Dependable Computing
EECS 498 Introduction to Distributed Systems Fall 2017
Our Final Consensus Protocol: RAFT
Replicated state machine and Paxos
Log Structure log index term leader Server 1 command
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Overview Multimedia: The Role of WINS in the Network Infrastructure
The SMART Way to Migrate Replicated Stateful Services
EEC 688/788 Secure and Dependable Computing
John Kubiatowicz (Lots of borrowed slides see attributions inline)
Lecture 21: Replication Control
Implementing Consistency -- Paxos
Sisi Duan Assistant Professor Information Systems
Presentation transcript:

The Raft Consensus Algorithm Diego Ongaro and John Ousterhout Stanford University

● Consensus: get multiple servers to agree on state ● Solutions typically handle minority of servers failing ● == master-slave replication that can recover from master failures safely and autonomously ● Used in building consistent storage systems  Top-level system configuration  Sometimes manages entire database state (e.g., Spanner) ● Examples: Chubby, ZooKeeper, Doozer October 2013Raft Consensus AlgorithmSlide 2 What is Consensus?

● Consensus widely regarded as difficult  Dominated by an algorithm called Paxos ● Raft designed to be easier to understand  User study showed students learn Raft better ● 25+ implementations of Raft in progress on GitHub  See  Bloom, C#, C++, Clojure, Elixir, Erlang, F#, Go, Haskell, Java, Javascript, OCaml, Python, Ruby October 2013Raft Consensus AlgorithmSlide 3 Raft: making consensus easier

October 2013Raft Consensus AlgorithmSlide 4 Single Server Clients z6z6 Server Hash Table x1 y2 z6

October 2013Raft Consensus AlgorithmSlide 5 Single Server Clients z6z6 Server State Machine

October 2013Raft Consensus AlgorithmSlide 6 Single Server Clients z6z6 Log State Machine Server x3x3y2y2x1x1z6z6

● Replicated log  replicated state machine  All servers execute same commands in same order ● Consensus module ensures proper log replication ● System makes progress as long as any majority of servers are up ● Failure model: fail-stop (not Byzantine), delayed/lost messages October 2013Raft Consensus AlgorithmSlide 7 Goal: Replicated Log x3x3y2y2x1x1z6z6 Log Consensus Module State Machine Log Consensus Module State Machine Log Consensus Module State Machine Servers Clients x3x3y2y2x1x1z6z6x3x3y2y2x1x1z6z6 z6z6

Two general approaches to consensus: ● Symmetric, leader-less:  All servers have equal roles  Clients can contact any server ● Asymmetric, leader-based:  At any given time, one server is in charge, others accept its decisions  Clients communicate with the leader ● Raft uses a leader:  Decomposes the problem (normal operation, leader changes)  Simplifies normal operation (no conflicts)  More efficient than leader-less approaches October 2013Raft Consensus AlgorithmSlide 8 Approaches to Consensus

1. Leader election:  Select one of the servers to act as leader  Detect crashes, choose new leader 2. Normal operation (log replication)  Leader takes commands from clients, appends them to its log  Leader replicates its log to other servers (overwriting inconsistencies) 3. Safety  Need committed entries to survive across leader changes  Define commitment rule, rig leader election October 2013Raft Consensus AlgorithmSlide 9 Raft Overview

● At any given time, each server is either:  Leader: handles all client interactions, log replication ● At most 1 viable leader at a time  Follower: completely passive replica (issues no RPCs, responds to incoming RPCs)  Candidate: used to elect a new leader October 2013Raft Consensus AlgorithmSlide 10 Server States FollowerCandidateLeader start timeout, start election receive votes from majority of servers

● Time divided into terms:  Election  Normal operation under a single leader ● At most 1 leader per term ● Some terms have no leader (failed election) ● Each server maintains current term value ● Key role of terms: identify obsolete information October 2013Raft Consensus AlgorithmSlide 11 Terms Term 1Term 2Term 3Term 4Term 5 time ElectionsNormal OperationSplit Vote

● Servers start up as followers ● Followers expect to receive RPCs from leaders or candidates ● If election timeout elapses with no RPCs:  Follower assumes leader has crashed  Follower starts new election  Timeouts typically ms ● Leaders must send heartbeats to maintain authority October 2013Raft Consensus AlgorithmSlide 12 Heartbeats and Timeouts

Upon election timeout: ● Increment current term ● Change to Candidate state ● Vote for self ● Send RequestVote RPCs to all other servers, wait until either: 1.Receive votes from majority of servers: ● Become leader ● Send AppendEntries heartbeats to all other servers 2.Receive RPC from valid leader: ● Return to follower state 3.No-one wins election (election timeout elapses): ● Increment term, start new election Slide 13 Election Basics

● Safety: allow at most one winner per term  Each server gives out only one vote per term (persist on disk)  Two different candidates can’t accumulate majorities in same term ● Liveness: some candidate must eventually win  Choose election timeouts randomly from, e.g., ms range  One server usually times out and wins election before others wake up October 2013Raft Consensus AlgorithmSlide 14 Election Properties Servers Voted for candidate A B can’t also get majority

● Log entry = index, term, command ● Log stored on stable storage (disk); survives crashes October 2013Raft Consensus AlgorithmSlide 15 Log Structure 1x31x z03z0 1y21y2 1x11x1 2z62z6 3y93y9 3y13y1 3x43x4 1x31x3 3z03z0 1y21y2 1x11x1 2z62z6 1x31x3 3z03z0 1y21y2 1x11x1 2z62z6 3y93y9 3y13y1 3x43x4 1x31x3 1y21y2 1x31x3 3z03z0 1y21y2 1x11x1 2z62z6 3y93y9 3y13y1 leader log index followers committed entries term command

● Client sends command to leader ● Leader appends command to its log ● Leader sends AppendEntries RPCs to followers ● Once new entry safely committed:  Leader applies command to its state machine, returns result to client ● Catch up followers in background:  Leader notifies followers of committed entries in subsequent AppendEntries RPCs  Followers apply committed commands to their state machines ● Performance is optimal in common case:  One successful RPC to any majority of servers October 2013Raft Consensus AlgorithmSlide 16 Normal Operation

High level of coherency between logs: ● If log entries on different servers have same index and term:  They store the same command  The logs are identical in all preceding entries ● If a given entry is committed, all preceding entries are also committed October 2013Raft Consensus AlgorithmSlide 17 Log Consistency 1x31x z03z0 1y21y2 1x11x1 2z62z6 3y93y9 4x44x4 1x31x3 3z03z0 1y21y2 1x11x1 2z62z6

● Each AppendEntries RPC contains index, term of entry preceding new ones ● Follower must contain matching entry; otherwise it rejects request ● Implements an induction step, ensures coherency October 2013Raft Consensus AlgorithmSlide 18 AppendEntries Consistency Check 1x31x3 3z03z0 1y21y2 1x11x1 2z62z6 1x31x3 1y21y2 1x11x1 2z62z6 leader follower x31x3 3z03z0 1y21y2 1x11x1 2z62z6 1x31x3 1y21y2 1x11x1 1x41x4 leader follower AppendEntries succeeds: matching entry AppendEntries fails: mismatch

Leader changes can result in tmp. log inconsistencies: October 2013Raft Consensus AlgorithmSlide 19 Log Inconsistencies log index leader for term possible followers (a) (b) (c) (d) (e) (f) Extraneous Entries Missing Entries

Slide 20October 2013Raft Consensus Algorithm ● Leader keeps nextIndex for each follower:  Index of next log entry to send to that follower ● When AppendEntries consistency check fails, decrement nextIndex and try again: ● When follower overwrites inconsistent entry, it deletes all subsequent entries: Repairing Follower Logs log index leader for term followers (a) (b) nextIndex 111 follower (b) after 4

Any two committed entries at the same index must be the same. October 2013Raft Consensus AlgorithmSlide 21 Safety Requirement Leader marks entry committed Restrictions on commitment Restrictions on leader election Entry present in every future leaders’ log

● During elections, candidate must have most up-to- date log among electing majority:  Candidates include log info in RequestVote RPCs (length of log & term of last log entry)  Voting server denies vote if its log is more up-to-date: October 2013Raft Consensus AlgorithmSlide 22 Picking Up-to-date Leader Same last term but different lengths: more up-to-date Different last terms:

● Case #1/2: Leader decides entry in current term is committed ● Majority replication makes entry 3 safe: October 2013Raft Consensus AlgorithmSlide 23 Committing Entry from Current Term s1s1 s2s2 s3s3 s4s4 s5s AppendEntries just succeeded Can’t be elected as leader for term 3 Leader for term 2 Leader marks entry committed Entry present in every future leaders’ log

● Case #2/2: Leader is trying to finish committing entry from an earlier term ● Entry 3 not safely committed: October 2013Raft Consensus AlgorithmSlide 24 Committing Entry from Earlier Term s1s1 s2s2 s3s3 s4s4 s5s5 2 2 AppendEntries just succeeded Leader for term 4 3 Could be elected as leader for term 5! Leader marks entry committed Entry present in every future leaders’ log

● New leader may not mark old entries committed until it has committed an entry from its current term. ● Once entry 4 committed:  s 5 cannot be elected leader for term 5  Entries 3 and 4 both safe October 2013Raft Consensus AlgorithmSlide 25 New Commitment Rules s1s1 s2s2 s3s3 s4s4 s5s Leader for term Combination of election rules and commitment rules makes Raft safe 3

1. Leader election 2. Normal operation 3. Safety More at ● Many more details in the paper (membership changes, log compaction) ● Join the raft-dev mailing list ● Check out the 25+ implementations on GitHub Diego Slide 26 Raft Summary