An Introduction to Consensus with Raft

Slides:



Advertisements
Similar presentations
CS 542: Topics in Distributed Systems Diganta Goswami.
Advertisements

Teaser - Introduction to Distributed Computing
The Raft Consensus Algorithm
Consensus: Bridging Theory and Practice
The SMART Way to Migrate Replicated Stateful Services Jacob R. Lorch, Atul Adya, Bill Bolosky, Ronnie Chaiken, John Douceur, Jon Howell Microsoft Research.
Chubby Lock server for distributed applications 1Dennis Kafura – CS5204 – Operating Systems.
Consensus Algorithms Willem Visser RW334. Why do we need consensus? Distributed Databases – Need to know others committed/aborted a transaction to avoid.
Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Eddie Bortnikov & Aran Bergman, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Recitation.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
CS 582 / CMPE 481 Distributed Systems
Distributed Systems CS Case Study: Replication in Google Chubby Recitation 5, Oct 06, 2011 Majd F. Sakr, Vinay Kolar, Mohammad Hammoud.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
In Search of an Understandable Consensus Algorithm Diego Ongaro John Ousterhout Stanford University.
CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 19: Paxos All slides © IG.
CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 18: Replication Control All slides © IG.
State Machines CS 614 Thursday, Feb 21, 2002 Bill McCloskey.
Distributed Databases
Course 6425A Module 9: Implementing an Active Directory Domain Services Maintenance Plan Presentation: 55 minutes Lab: 75 minutes This module helps students.
Fault Tolerance via the State Machine Replication Approach Favian Contreras.
Cool ideas from RAMCloud Diego Ongaro Stanford University Joint work with Asaf Cidon, Ankita Kejriwal, John Ousterhout, Mendel Rosenblum, Stephen Rumble,
Bringing Paxos Consensus in Multi-agent Systems Andrei Mocanu Costin Bădică University of Craiova.
Practical Byzantine Fault Tolerance
From Viewstamped Replication to BFT Barbara Liskov MIT CSAIL November 2007.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation MongoDB Architecture.
Paxos A Consensus Algorithm for Fault Tolerant Replication.
Paxos: Agreement for Replicated State Machines Brad Karp UCL Computer Science CS GZ03 / M st, 23 rd October, 2008.
Consensus and Consistency When you can’t agree to disagree.
CSE 60641: Operating Systems Implementing Fault-Tolerant Services Using the State Machine Approach: a tutorial Fred B. Schneider, ACM Computing Surveys.
Consensus and leader election Landon Cox February 6, 2015.
Hwajung Lee.  Improves reliability  Improves availability ( What good is a reliable system if it is not available?)  Replication must be transparent.
Systems Research Barbara Liskov October Replication Goal: provide reliability and availability by storing information at several nodes.
Replication Improves reliability Improves availability ( What good is a reliable system if it is not available?) Replication must be transparent and create.
Highly Available Services and Transactions with Replicated Data Jason Lenthe.
John Ousterhout Stanford University RAMCloud Overview and Update SEDCL Retreat June, 2013.
CSci8211: Distributed Systems: Raft Consensus Algorithm 1 Distributed Systems : Raft Consensus Alg. Developed by Stanford Platform Lab  Motivated by the.
Implementing Replicated Logs with Paxos John Ousterhout and Diego Ongaro Stanford University Note: this material borrows heavily from slides by Lorenzo.
The Raft Consensus Algorithm Diego Ongaro and John Ousterhout Stanford University.
PIROGUE, A LIGHTER DYNAMIC VERSION OF THE RAFT DISTRIBUTED CONSENSUS ALGORITHM Jehan-François Pâris, U. of Houston Darrell D. E. Long, U. C. Santa Cruz.
CSci8211: Distributed Systems: State Machines 1 Detour: Some Theory of Distributed Systems Supplementary Materials  Replicated State Machines Notion of.
ZOOKEEPER. CONTENTS ZooKeeper Overview ZooKeeper Basics ZooKeeper Architecture Getting Started with ZooKeeper.
Fail-Stop Processors UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau One paper: Byzantine.
Consensus 2 Replicated State Machines, RAFT
The consensus problem in distributed systems
Distributed Systems – Paxos
Paxos and Raft (Lecture 21, cs262a)
CPS 512 midterm exam #1, 10/7/2016 Your name please: ___________________ NetID:___________ /60 /40 /10.
Principles of Computer Security
COS 418: Distributed Systems
Paxos and Raft (Lecture 9 cont’d, cs262a)
Yang Zhang, Eman Ramadan, Hesham Mekky, Zhi-Li Zhang
EECS 498 Introduction to Distributed Systems Fall 2017
Strong consistency and consensus
Providing Secure Storage on the Internet
آزمايشگاه سيستمهای هوشمند علی کمالی زمستان 95
RAFT: In Search of an Understandable Consensus Algorithm
Principles of Computer Security
Replication Improves reliability Improves availability
From Viewstamped Replication to BFT
IS 651: Distributed Systems Fault Tolerance
Consensus, FLP, and Paxos
Lecture 21: Replication Control
Fault-Tolerant State Machine Replication
Our Final Consensus Protocol: RAFT
The SMART Way to Migrate Replicated Stateful Services
EEC 688/788 Secure and Dependable Computing
John Kubiatowicz (Lots of borrowed slides see attributions inline)
Lecture 21: Replication Control
Implementing Consistency -- Paxos
Sisi Duan Assistant Professor Information Systems
Presentation transcript:

An Introduction to Consensus with Raft Diego Ongaro John Ousterhout Stanford University http://raftconsensus.github.io

Raft Consensus Algorithm April 2014 Raft Consensus Algorithm

availability or consistency Distributed Systems availability or consistency April 2014 Raft Consensus Algorithm

Inside a Consistent System TODO: eliminate single point of failure An ad hoc algorithm “This case is rare and typically occurs as a result of a network partition with replication lag.” Watch out for @aphyr – OR – A consensus algorithm (built-in or library) Paxos, Raft, … A consensus service ZooKeeper, etcd, consul, … April 2014 Raft Consensus Algorithm

Raft Consensus Algorithm What is Consensus? Agreement on shared state (single system image) Recovers from server failures autonomously Minority of servers fail: no problem Majority fail: lose availability, retain consistency Servers April 2014 Raft Consensus Algorithm

Why Is Consensus Needed? Key to building consistent storage systems Top-level system configuration Which server is my SQL master? What shards exist in my storage system? Which servers store shard X? Sometimes used to replicate entire database state (e.g., Megastore, Spanner) April 2014 Raft Consensus Algorithm

Raft Consensus Algorithm Goal: Replicated Log Clients z6 Log Consensus Module State Machine Log Consensus Module State Machine Consensus Module State Machine x 1 y 2 z 6 x 1 y 2 z 6 x 1 y 2 z 6 Servers Log x3 y2 x1 z6 x3 y2 x1 z6 x3 y2 x1 z6 Replicated log  replicated state machine All servers execute same commands in same order Consensus module ensures proper log replication System makes progress as long as any majority of servers are up Failure model: fail-stop (not Byzantine), delayed/lost messages April 2014 Raft Consensus Algorithm

Raft Consensus Algorithm Paxos Protocol Leslie Lamport, 1989 Nearly synonymous with consensus Hard to understand “The dirty little secret of the NSDI community is that at most five people really, truly understand every part of Paxos ;-).” – Anonymous NSDI reviewer Bad foundation for building systems “There are significant gaps between the description of the Paxos algorithm and the needs of a real-world system…the final system will be based on an unproven protocol.” – Chubby authors April 2014 Raft Consensus Algorithm

Raft’s Design for Understandability We wanted the best algorithm for building real systems Must be correct, complete, and perform well Must also be understandable “What would be easier to understand or explain?” Fundamentally different decomposition than Paxos Less complexity in state space Less mechanism April 2014 Raft Consensus Algorithm

Raft Consensus Algorithm User study Quiz Grades Survey Results April 2014 Raft Consensus Algorithm

Raft Consensus Algorithm Raft Overview Leader election Select one of the servers to act as leader Detect crashes, choose new leader Log replication (normal operation) Leader takes commands from clients, appends them to its log Leader replicates its log to other servers (overwriting inconsistencies) Safety Only elect leaders with all committed entries in their logs April 2014 Raft Consensus Algorithm

Server States At any given time, each server is either: Follower Follower: completely passive replica (issues no RPCs, responds to incoming RPCs) Candidate: used to elect a new leader Leader: handles all client interactions, log replication At most one viable leader at a time time out, start election receive votes from majority of servers start Follower Candidate Leader April 2014 Raft Consensus Algorithm

Raft Consensus Algorithm Terms Term 1 Term 2 Term 3 Term 4 Term 5 time Elections Split Vote Normal Operation Time divided into terms: Election Normal operation under a single leader At most one leader per term Each server maintains current term value Key role of terms: identify obsolete information April 2014 Raft Consensus Algorithm

Leader Election Leaders send heartbeats to maintain authority. Upon election timeout, start new election: Increment current term Change to Candidate state Vote for self Send Request Vote RPCs to all other servers, wait until either: Receive votes from majority of servers: Become leader, send heartbeats to all other servers Receive RPC from valid leader: Return to follower state No-one wins election (election timeout elapses): Increment term, start new election

Leader Election Visualization The Secret Lives of Data http://thesecretlivesofdata.com Visualizes distributed algorithms, starting with Raft Project by Ben Johnson (author of go-raft) April 2014 Raft Consensus Algorithm

Randomized Timeouts If we choose election timeouts randomly, One server usually times out and wins election before others wake up

Raft Consensus Algorithm Raft Paper Log replication Client interaction Cluster membership changes Log compaction To appear: 2014 USENIX Annual Technical Conf. June 19-20 in Philadelphia Draft on Raft website April 2014 Raft Consensus Algorithm

Raft Consensus Algorithm Raft Implementations kanaka/raft.js JS Joel Martin go-raft Go Ben Johnson (Sky) and Xiang Li (CoreOS) hashicorp/raft Armon Dadgar (HashiCorp) LogCabin C++ Diego Ongaro (Stanford) ckite Scala Pablo Medina peterbourgon/raft Peter Bourgon rafter Erlang Andrew Stone (Basho) barge Java Dave Rusek py-raft Python Toby Burress ocaml-raft OCaml Heidi Howard (Cambridge) … April 2014 Raft Consensus Algorithm

Best Logo: go-raft by Brandon Philips (CoreOS) April 2014 Raft Consensus Algorithm

Summary Consensus is key to building consistent systems Design for understandability Raft separates leader election from log replication Leader election uses voting and randomized timeouts More at http://raftconsensus.github.io: Paper draft, other talks 10 to 50+ implementations raft-dev mailing list Diego Ongaro @ongardie