Consensus: Bridging Theory and Practice

Slides:



Advertisements
Similar presentations
Consensus on Transaction Commit
Advertisements

Paxos and Zookeeper Roy Campbell.
CS 542: Topics in Distributed Systems Diganta Goswami.
High throughput chain replication for read-mostly workloads
CS 5204 – Operating Systems1 Paxos Student Presentation by Jeremy Trimble.
The Raft Consensus Algorithm
Distributed Systems Overview Ali Ghodsi
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Byzantine Fault Tolerance Steve Ko Computer Sciences and Engineering University at Buffalo.
The SMART Way to Migrate Replicated Stateful Services Jacob R. Lorch, Atul Adya, Bill Bolosky, Ronnie Chaiken, John Douceur, Jon Howell Microsoft Research.
Consensus Algorithms Willem Visser RW334. Why do we need consensus? Distributed Databases – Need to know others committed/aborted a transaction to avoid.
1 Principles of Reliable Distributed Systems Tutorial 12: Frangipani Spring 2009 Alex Shraer.
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Distributed Systems CS Case Study: Replication in Google Chubby Recitation 5, Oct 06, 2011 Majd F. Sakr, Vinay Kolar, Mohammad Hammoud.
In Search of an Understandable Consensus Algorithm Diego Ongaro John Ousterhout Stanford University.
CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 19: Paxos All slides © IG.
Byzantine Fault Tolerance CS 425: Distributed Systems Fall Material drived from slides by I. Gupta and N.Vaidya.
PETAL: DISTRIBUTED VIRTUAL DISKS E. K. Lee C. A. Thekkath DEC SRC.
Presented by: Alvaro Llanos E.  Motivation and Overview  Frangipani Architecture overview  Similar DFS  PETAL: Distributed virtual disks ◦ Overview.
CS492: Special Topics on Distributed Algorithms and Systems Fall 2008 Lab 3: Final Term Project.
The DHCP Failover Protocol A Formal Perspective Rui FanMIT Ralph Droms Cisco Systems Nancy GriffethCUNY Nancy LynchMIT.
Cool ideas from RAMCloud Diego Ongaro Stanford University Joint work with Asaf Cidon, Ankita Kejriwal, John Ousterhout, Mendel Rosenblum, Stephen Rumble,
Bringing Paxos Consensus in Multi-agent Systems Andrei Mocanu Costin Bădică University of Craiova.
An Introduction to Consensus with Raft
Practical Byzantine Fault Tolerance
From Viewstamped Replication to BFT Barbara Liskov MIT CSAIL November 2007.
Paxos A Consensus Algorithm for Fault Tolerant Replication.
Paxos: Agreement for Replicated State Machines Brad Karp UCL Computer Science CS GZ03 / M st, 23 rd October, 2008.
Byzantine Fault Tolerance CS 425: Distributed Systems Fall 2012 Lecture 26 November 29, 2012 Presented By: Imranul Hoque 1.
Consensus and Consistency When you can’t agree to disagree.
CSE 60641: Operating Systems Implementing Fault-Tolerant Services Using the State Machine Approach: a tutorial Fred B. Schneider, ACM Computing Surveys.
Consensus and leader election Landon Cox February 6, 2015.
CSci8211: Distributed Systems: Raft Consensus Algorithm 1 Distributed Systems : Raft Consensus Alg. Developed by Stanford Platform Lab  Motivated by the.
Implementing Replicated Logs with Paxos John Ousterhout and Diego Ongaro Stanford University Note: this material borrows heavily from slides by Lorenzo.
The Raft Consensus Algorithm Diego Ongaro and John Ousterhout Stanford University.
PIROGUE, A LIGHTER DYNAMIC VERSION OF THE RAFT DISTRIBUTED CONSENSUS ALGORITHM Jehan-François Pâris, U. of Houston Darrell D. E. Long, U. C. Santa Cruz.
CSci8211: Distributed Systems: State Machines 1 Detour: Some Theory of Distributed Systems Supplementary Materials  Replicated State Machines Notion of.
Distributed Systems Lecture 9 Leader election 1. Previous lecture Middleware RPC and RMI – Marshalling 2.
Fundamentals of Fault-Tolerant Distributed Computing In Asynchronous Environments Paper by Felix C. Gartner Graeme Coakley COEN 317 November 23, 2003.
Consensus 2 Replicated State Machines, RAFT
BChain: High-Throughput BFT Protocols
Primary-Backup Replication
Distributed Systems – Paxos
Alternative system models
Paxos and Raft (Lecture 21, cs262a)
COS 418: Distributed Systems
View Change Protocols and Reconfiguration
Strong consistency and consensus
Providing Secure Storage on the Internet
Implementing Consistency -- Paxos
RAFT: In Search of an Understandable Consensus Algorithm
Distributed Systems, Consensus and Replicated State Machines
Principles of Computer Security
Replication Improves reliability Improves availability
CS 425 / ECE 428 Distributed Systems Fall 2017 Indranil Gupta (Indy)
EEC 688/788 Secure and Dependable Computing
From Viewstamped Replication to BFT
IS 651: Distributed Systems Fault Tolerance
EEC 688/788 Secure and Dependable Computing
Fault-Tolerant State Machine Replication
IS 651: Distributed Systems Final Exam
Our Final Consensus Protocol: RAFT
View Change Protocols and Reconfiguration
Log Structure log index term leader Server 1 command
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
The SMART Way to Migrate Replicated Stateful Services
EEC 688/788 Secure and Dependable Computing
Implementing Consistency -- Paxos
Presentation transcript:

Consensus: Bridging Theory and Practice Diego Ongaro PhD Defense

Consensus: Bridging Theory and Practice Introduction Consensus: agreement on shared state Store state consistently on several servers Must be available even if some servers fail Needed for consistent, fault-tolerant storage systems Top-level system configuration Sometimes used to replicate entire database state Consensus is widely regarded as difficult Raft: consensus algorithm designed for understandability May 27, 2014 Consensus: Bridging Theory and Practice

Replicated State Machines Clients z6 Log Consensus Module State Machine Log Consensus Module State Machine Consensus Module State Machine x 1 y 2 z 6 x 1 y 2 z 6 x 1 y 2 z 6 Servers Log x3 y2 x1 z6 x3 y2 x1 z6 x3 y2 x1 z6 Replicated log  replicated state machine All servers execute same commands in same order Consensus module ensures proper log replication System makes progress as long as any majority of servers are up Failure model: fail-stop (not Byzantine), delayed/lost messages May 27, 2014 Consensus: Bridging Theory and Practice

Consensus: Bridging Theory and Practice Motivation: Paxos “The dirty little secret of the NSDI community is that at most five people really, truly understand every part of Paxos ;-).” – NSDI reviewer “There are significant gaps between the description of the Paxos algorithm and the needs of a real-world system…. the final system will be based on an unproven protocol.” – Chubby authors May 27, 2014 Consensus: Bridging Theory and Practice

Consensus: Bridging Theory and Practice Motivation: Paxos (2) Leslie Lamport, 1989 Theoretical foundations Hard to understand: Can’t separate phase 1 and 2, no intuitive meanings Bad problem decomposition for building systems Too low-level Implementations must extend published algorithm May 27, 2014 Consensus: Bridging Theory and Practice

Consensus: Bridging Theory and Practice Contributions Understandability Raft algorithm, designed for understandability Strong form of leadership Leader election algorithm using randomized timeouts User study to evaluate understandability Completeness Proof of safety and formal spec for core algorithm Cluster membership change algorithm Other components needed for complete and practical system May 27, 2014 Consensus: Bridging Theory and Practice

Design for Understandability Key considerations How hard is it to explain each alternative? How easy will it be for someone to completely understand the approach and its implications? General techniques Decomposing the problem Reducing state space complexity May 27, 2014 Consensus: Bridging Theory and Practice

Consensus: Bridging Theory and Practice Raft Components Leader election Select one of the servers to act as cluster leader Log replication (normal operation) Leader takes commands from clients, appends them to its log Leader replicates its log to other servers Safety Tie above components together to maintain consistency May 27, 2014 Consensus: Bridging Theory and Practice

RaftScope Visualization May 27, 2014 Consensus: Bridging Theory and Practice

Consensus: Bridging Theory and Practice Core Raft Review Leader election Heartbeats and timeouts to detect crashes Randomized timeouts to avoid split votes Majority voting to guarantee at most one leader per term Log replication (normal operation) Leader takes commands from clients, appends them to its log Leader replicates its log to other servers (overwriting inconsistencies) Built-in consistency check simplifies how logs may differ Safety Only elect leaders with all committed entries in their logs New leader defers committing entries from prior terms May 27, 2014 Consensus: Bridging Theory and Practice

Topics for Practical Systems Cluster membership changes Log compaction Client interaction May 27, 2014 Consensus: Bridging Theory and Practice

Cluster Membership Changes Grow/shrink cluster, replace nodes Agreement on change requires consensus Raft’s approach Switch to joint configuration: requires majorities from both old and new clusters Switch to new cluster Overlapping majorities guarantee safety Continues processing requests during change May 27, 2014 Consensus: Bridging Theory and Practice

Other Topics for Complete Systems Log compaction: snapshotting Client interaction How clients find the leader Optimizing read-only operations May 27, 2014 Consensus: Bridging Theory and Practice

Consensus: Bridging Theory and Practice Evaluation Understandability Is Raft easier to understand? Leader election performance How quickly does the randomized timeout approach elect a leader? Correctness Formal specification in TLA+ Proof of core algorithm’s safety Log replication performance One round of RPC from leader to commit log entry (same as Multi-Paxos, ZooKeeper) May 27, 2014 Consensus: Bridging Theory and Practice

Consensus: Bridging Theory and Practice User Study Intro Goal: evaluate Raft’s understandability quantitatively Two classrooms of students Taught them both Raft and Paxos Quizzed them to see which one they learned better Each student: Considered programming assignment: less data Raft lecture and quiz Paxos lecture and quiz Short survey May 27, 2014 Consensus: Bridging Theory and Practice

Consensus: Bridging Theory and Practice Quiz Results 43 participants 33 scored higher on Raft 15 had some prior Paxos experience Paxos mean 20.8 Raft mean 25.7 (+23.6%) May 27, 2014 Consensus: Bridging Theory and Practice

Consensus: Bridging Theory and Practice Survey Results Which would be easier to implement in a correct and efficient system? Which would be easier to explain to a CS grad student? For each question, 33 of 41 said Raft May 27, 2014 Consensus: Bridging Theory and Practice

Consensus: Bridging Theory and Practice Randomized Timeouts How much randomization is needed to avoid split votes? Conservatively, use random range ~10x network latency May 27, 2014 Consensus: Bridging Theory and Practice

Consensus: Bridging Theory and Practice Raft Implementations go-raft Go Ben Johnson (Sky) and Xiang Li (CoreOS) kanaka/raft.js JS Joel Martin hashicorp/raft Armon Dadgar (HashiCorp) rafter Erlang Andrew Stone (Basho) ckite Scala Pablo Medina kontiki Haskell Nicolas Trangez LogCabin C++ Diego Ongaro (Stanford) akka-raft Konrad Malawski floss Ruby Alexander Flatten CRaft C Willem-Hendrik Thiart barge Java Dave Rusek harryw/raft Harry Wilkinson py-raft Python Toby Burress … May 27, 2014 Consensus: Bridging Theory and Practice

Consensus: Bridging Theory and Practice Raft Implementations go-raft Go Ben Johnson (Sky) and Xiang Li (CoreOS) kanaka/raft.js JS Joel Martin hashicorp/raft Armon Dadgar (HashiCorp) rafter Erlang Andrew Stone (Basho) ckite Scala Pablo Medina kontiki Haskell Nicolas Trangez LogCabin C++ Diego Ongaro (Stanford) akka-raft Konrad Malawski floss Ruby Alexander Flatten CRaft C Willem-Hendrik Thiart barge Java Dave Rusek harryw/raft Harry Wilkinson py-raft Python Toby Burress … go-raft logo by Brandon Philips May 27, 2014 Consensus: Bridging Theory and Practice

Consensus: Bridging Theory and Practice Related Work Paxos Theoretical, difficult to apply “Our Paxos implementation is actually closer to the Raft algorithm than to what you read in the Paxos paper...” – Sebastian Kanthak, Spanner Viewstamped Replication, ZooKeeper Both leader-based Ad hoc in nature: did not fully explore design space More complex state spaces: more mechanism Each uses 10 message types, Raft has 4 ZooKeeper widely deployed but neither widely implemented May 27, 2014 Consensus: Bridging Theory and Practice

Summary: Contributions Understandability Raft algorithm, designed for understandability Strong form of leadership Leader election algorithm using randomized timeouts User study to evaluate understandability Completeness Proof of safety and formal spec for core algorithm Cluster membership change algorithm Other components needed for complete and practical system May 27, 2014 Consensus: Bridging Theory and Practice

Consensus: Bridging Theory and Practice Conclusions Consensus widely regarded as difficult Hope Raft makes consensus more accessible Easier to teach in classrooms Better foundation for building practical systems Burst of Raft-based systems is exciting Renewed interest in building consensus systems More off-the-shelf options becoming available Understandability should be a primary design goal May 27, 2014 Consensus: Bridging Theory and Practice

Consensus: Bridging Theory and Practice Acknowledgements May 27, 2014 Consensus: Bridging Theory and Practice

Consensus: Bridging Theory and Practice Acknowledgements May 27, 2014 Consensus: Bridging Theory and Practice

Consensus: Bridging Theory and Practice Questions raftconsensus.github.io May 27, 2014 Consensus: Bridging Theory and Practice