From Viewstamped Replication to BFT Barbara Liskov MIT CSAIL November 2007.

Slides:



Advertisements
Similar presentations
Two phase commit. Failures in a distributed system Consistency requires agreement among multiple servers –Is transaction X committed? –Have all servers.
Advertisements

NETWORK ALGORITHMS Presenter- Kurchi Subhra Hazra.
Distributed Systems Overview Ali Ghodsi
1 The Case for Byzantine Fault Detection. 2 Challenge: Byzantine faults Distributed systems are subject to a variety of failures and attacks Hacker break-in.
P. Kouznetsov, 2006 Abstracting out Byzantine Behavior Peter Druschel Andreas Haeberlen Petr Kouznetsov Max Planck Institute for Software Systems.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Byzantine Fault Tolerance Steve Ko Computer Sciences and Engineering University at Buffalo.
Replication and Consistency (2). Reference r Replication in the Harp File System, Barbara Liskov, Sanjay Ghemawat, Robert Gruber, Paul Johnson, Liuba.
1 Attested Append-Only Memory: Making Adversaries Stick to their Word Byung-Gon Chun (ICSI) October 15, 2007 Joint work with Petros Maniatis (Intel Research,
Yee Jiun Song Cornell University. CS5410 Fall 2008.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 16 Wenbing Zhao Department of Electrical and Computer Engineering.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 15 Wenbing Zhao Department of Electrical and Computer Engineering.
Attested Append-only Memory: Making Adversaries Stick to their Word Distributed Storage Systems CS presented by: Hussam Abu-Libdeh.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 16 Wenbing Zhao Department of Electrical and Computer Engineering.
© 2006 Andreas Haeberlen, MPI-SWS 1 The Case for Byzantine Fault Detection Andreas Haeberlen MPI-SWS / Rice University Petr Kouznetsov MPI-SWS Peter Druschel.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 15 Wenbing Zhao Department of Electrical and Computer Engineering.
Practical Byzantine Fault Tolerance (The Byzantine Generals Problem)
September 24, 2007The 3 rd CSAIL Student Workshop Byzantine Fault Tolerant Cooperative Caching Raluca Ada Popa, James Cowling, Barbara Liskov Summer UROP.
BASE: Using Abstraction to Improve Fault Tolerance Rodrigo Rodrigues, Miguel Castro, and Barbara Liskov MIT Laboratory for Computer Science and Microsoft.
EEC 688 Secure and Dependable Computing Lecture 16 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Byzantine fault tolerance
Byzantine Fault Tolerance CS 425: Distributed Systems Fall Material drived from slides by I. Gupta and N.Vaidya.
Problem Computer systems provide crucial services Computer systems fail –natural disasters –hardware failures –software errors –malicious attacks Need.
1 The Design of a Robust Peer-to-Peer System Gisik Kwon Dept. of Computer Science and Engineering Arizona State University Reference: SIGOPS European Workshop.
1 The Design of a Robust Peer-to-Peer System Rodrigo Rodrigues, Barbara Liskov, Liuba Shrira Presented by Yi Chen Some slides are borrowed from the authors’
Chapter 19 Recovery and Fault Tolerance Copyright © 2008.
Presented by Keun Soo Yim March 19, 2009
HQ Replication: Efficient Quorum Agreement for Reliable Distributed Systems James Cowling 1, Daniel Myers 1, Barbara Liskov 1 Rodrigo Rodrigues 2, Liuba.
Practical Byzantine Fault Tolerance
Practical Byzantine Fault Tolerance Jayesh V. Salvi
Byzantine fault-tolerance COMP 413 Fall Overview Models –Synchronous vs. asynchronous systems –Byzantine failure model Secure storage with self-certifying.
1 ZYZZYVA: SPECULATIVE BYZANTINE FAULT TOLERANCE R.Kotla, L. Alvisi, M. Dahlin, A. Clement and E. Wong U. T. Austin Best Paper Award at SOSP 2007.
Byzantine fault tolerance
Practical Byzantine Fault Tolerance and Proactive Recovery
Byzantine Fault Tolerance CS 425: Distributed Systems Fall 2012 Lecture 26 November 29, 2012 Presented By: Imranul Hoque 1.
CSE 486/586 CSE 486/586 Distributed Systems Byzantine Fault Tolerance Steve Ko Computer Sciences and Engineering University at Buffalo.
EEC 688/788 Secure and Dependable Computing Lecture 15 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Byzantine Fault Tolerance
Systems Research Barbara Liskov October Replication Goal: provide reliability and availability by storing information at several nodes.
CSci8211: Distributed Systems: State Machines 1 Detour: Some Theory of Distributed Systems Supplementary Materials  Replicated State Machines Notion of.
Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.
BChain: High-Throughput BFT Protocols
Byzantine Fault Tolerance
Tolerating Latency in Replicated State Machines through Client Speculation April 22, 2009 Benjamin Wester1, James Cowling2, Edmund B. Nightingale3, Peter.
Distributed Systems – Paxos
Secure Causal Atomic Broadcast, Revisited
Principles of Computer Security
View Change Protocols and Reconfiguration
Byzantine Fault Tolerance
Providing Secure Storage on the Internet
Outline Announcements Fault Tolerance.
Principles of Computer Security
Jacob Gardner & Chuan Guo
EEC 688/788 Secure and Dependable Computing
Fault-tolerance techniques RSM, Paxos
EEC 688/788 Secure and Dependable Computing
From Viewstamped Replication to BFT
IS 651: Distributed Systems Fault Tolerance
EEC 688/788 Secure and Dependable Computing
IS 651: Distributed Systems Final Exam
EEC 688/788 Secure and Dependable Computing
Building Dependable Distributed Systems, Copyright Wenbing Zhao
View Change Protocols and Reconfiguration
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Lecture 21: Replication Control
Presentation transcript:

From Viewstamped Replication to BFT Barbara Liskov MIT CSAIL November 2007

Replication Goal: provide reliability and availability by storing information at several nodes

Today’s talk Viewstamped replication Failstop failures BFT Byzantine failures Characteristics: One-copy consistency State machine replication Runs on an asynchronous network

Failstop failures Nodes fail by crashing A machine is either working correctly or it is doing nothing! Requires 2f+1 replicas Operations must intersect at at least one replica In general want availability for both reads and writes Read and write quorums of f+1 nodes

Quorums Servers Clients 1. State: … 2. State: … 3. State: … write A X

Quorums Servers Clients … … … A A X 1. State: 2. State: 3. State:

Quorums Servers Clients … … A write B X … A X 1. State: 2. State: 3. State:

Concurrent Operations Servers Clients … A … A B … B write B write A B A write B write A 1. State: 2. State: 3. State:

Viewstamped Replication Viewstamped replication: a new primary copy method to support highly available distributed systems, B. Oki and B. Liskov, PODC 1988 Thesis, May 1988 Replication in the Harp file system, S. Ghemawat et. al, SOSP 1991 The part-time parliament, L. Lamport, TOCS 1998 Paxos made simple, L. Lamport, Nov. 2001

Ordering Operations Replicas must execute operations in the same order Implies replicas will have the same state, assuming replicas start in the same state operations are deterministic

Ordering Solution Use a primary It orders the operations Other replicas obey this order

Views System moves through a sequence of views Primary runs the protocol Replicas watch the primary and do a view change if it fails

Execution Model Server Client Application Viewstamp Replication operation result Application Viewstamp Replication operation result

Replica state A replica id i (between 0 and N-1) Replica 0, replica 1, … A view number v#, initially 0 Primary is the replica with id i = v# mod N A log of entries Status = prepared or committed

replica 2 replica 1replica 0 Normal Case View:3 Primary:0 Log: View:3 Primary:0 Log: View:3 Primary:0 Log: write A,3 client 1 client 2 Q committed 7 Q 7 Q 7

replica 2 replica 1replica 0 Normal Case View:3 Primary:0 Log: View:3 Primary:0 Log: View:3 Primary:0 Log: client 1 client 2 prepare A,8,3 X A prepared 8 Q committed 7 Q 7 Q 7

replica 2 replica 1replica 0 Normal Case View:3 Primary:0 Log: View:3 Primary:0 Log: View:3 Primary:0 Log: client 1 client 2 ok A,8,3 A prepared 8 Q committed 7 A prepared 8 Q committed 7 Q 7

replica 2 replica 1replica 0 Normal Case View:3 Primary:0 Log: View:3 Primary:0 Log: View:3 Primary:0 Log: client 1 client 2 commit A,8,3 X result A committed 8 Q 7 A prepared 8 Q committed 7 Q 7

View Changes Used to mask primary failures Replicas monitor the primary Client sends request to all Replica requests next primary to do a view change

Correctness Requirement Operation order must be preserved by a view change For operations that are visible executed by server client received result

Predicting Visibility An operation could be visible if it prepared at f+1 replicas this is the commit point

replica 2 replica 1replica 0 View Change View:3 Primary:0 Log: View:3 Primary:0 Log: View:3 Primary:0 Log: client 1 client 2 prepare A,8,3 X A prepared 8 Q committed 7 A prepared 8 Q committed 7 Q 7

replica 2 replica 1replica 0 View Change View:3 Primary:0 Log: View:3 Primary:0 Log: View:3 Primary:0 Log: client 1 client 2 A prepared 8 Q committed 7 A prepared 8 Q committed 7 Q 7 X

replica 2 replica 1replica 0 View Change View:3 Primary:0 Log: View:3 Primary:0 Log: View:3 Primary:0 Log: client 1 client 2 A prepared 8 Q committed 7 A prepared 8 Q committed 7 Q 7 X do viewchange 4

replica 2 replica 1replica 0 View Change View:3 Primary:0 Log: View:4 Primary:1 Log: View:3 Primary:0 Log: client 1 client 2 A prepared 8 Q committed 7 A prepared 8 Q committed 7 Q 7 X viewchange 4 X

replica 2 replica 1replica 0 View Change View:3 Primary:0 Log: View:4 Primary:1 Log: View:4 Primary:1 Log: client 1 client 2 A prepared 8 Q committed 7 A prepared 8 Q committed 7 Q 7 X vc-ok 4,log

Double Booking Sometimes more than one operation is assigned the same number In view 3, operation A is assigned 8 In view 4, operation B is assigned 8

Double Booking Sometimes more than one operation is assigned the same number In view 3, operation A is assigned 8 In view 4, operation B is assigned 8 Viewstamps op number is

replica 2 replica 1replica 0 Scenario View:3 Primary:0 Log: View:4 Primary:1 Log: View:4 Primary:1 Log: client 1 client 2 Q committed 7 Q 7 Q 7 A prepared 8 X

replica 2 replica 1replica 0 Scenario View:3 Primary:0 Log: View:4 Primary:1 Log: View:4 Primary:1 Log: client 1 client 2 Q committed 7 Q 7 Q 7 A prepared 8 write B,4 B prepared 8

replica 2 replica 1replica 0 Scenario View:3 Primary:0 Log: View:4 Primary:1 Log: View:4 Primary:1 Log: client 1 client 2 Q committed 7 Q 7 Q 7 A prepared 8 B 8 prepare B,8,4 B prepared 8

Additional Issues State transfer Garbage collection of the log Selecting the primary

Improved Performance Lower latency for writes (3 messages) Replicas respond at prepare client waits for f+1 Fast reads (one round trip) Client communicates just with primary Leases Witnesses (preferred quorums) Use f+1 replicas in the normal case

Performance Figure 5-2: Nhfsstone Benchmark with One Group. SDM is the Software Development Mix B. Liskov, S. Ghemawat, et al., Replication in the Harp File System, SOSP 1991

BFT Practical Byzantine Fault Tolerance, M. Castro and B. Liskov, SOSP 1999 Proactive Recovery in a Byzantine-Fault- Tolerant System, M. Castro and B. Liskov, OSDI 2000

Byzantine Failures Nodes fail arbitrarily they lie they collude Causes Malicious attacks Software errors

Quorums 3f+1 replicas are needed to survive f failures 2f+1 replicas is a quorum Ensures intersection at at least one honest replica The minimum in an asynchronous network

1. State: … A 2. State: … A 3. State: … A 4. State: … Quorums Servers Clients write A X

… A … A B … B … B Quorums write B X Servers Clients 1. State:2. State:3. State:4. State:

Strategy Primary runs the protocol in the normal case Replicas watch the primary and do a view change if it fails Key difference: replicas might lie

Execution Model Server Client Application BFT operation result Application BFT operation result

Replica state A replica id i (between 0 and N-1) Replica 0, replica 1, … A view number v#, initially 0 Primary is the replica with id i = v# mod N A log of entries Status = pre-prepared or prepared or committed

Normal Case Client sends request to primary or to all

Normal Case Primary sends pre-prepare message to all Records operation in log as pre-prepared

Normal Case Primary sends pre-prepare message to all Records operation in log as pre-prepared Why not a prepare message? Because primary might be malicious

Normal Case Replicas check the pre-prepare and if it is ok: Record operation in log as pre-prepared Send prepare messages to all All to all communication

Normal Case Replicas wait for 2f+1 matching prepares Record operation in log as prepared Send commit message to all Trust the group, not the individuals

Normal Case Replicas wait for 2f+1 matching commits Record operation in log as committed Execute the operation Send result to the client

Normal Case Client waits for f+1 matching replies

BFT Client Primary Replica 2 Replica 3 Replica 4 RequestPre-PreparePrepareCommitReply

View Change Replicas watch the primary Request a view change Commit point: when 2f+1 replicas have prepared

View Change Replicas watch the primary Request a view change send a do-viewchange request to all new primary requires f+1 requests sends new-view with this certificate Rest is similar

Additional Issues State transfer Checkpoints (garbage collection of the log) Selection of the primary Timing of view changes

Improved Performance Lower latency for writes (4 messages) Replicas respond at prepare Client waits for 2f+1 matching responses Fast reads (one round trip) Client sends to all; they respond immediately Client waits for 2f+1 matching responses

BFT Performance PhaseBFS-PKBFSNFS-sdt total Table 2: Andrew 100: elapsed time in seconds M. Castro and B. Liskov, Proactive Recovery in a Byzantine-Fault- Tolerant System, OSDI 2000

Improvements Batching Run protocol every K requests

Follow-on Work BASE: Using abstraction to improve fault tolerance, R. Rodrigo et al, SOSP 2001 R.Kotla and M. Dahlin, High Throughput Byzantine Fault tolerance. DSN 2004 J. Li and D. Mazieres, Beyond one-third faulty replicas in Byzantine fault tolerant systems, NSDI 07 Abd-El-Malek et al, Fault-scalable Byzantine fault- tolerant services, SOSP 05 J. Cowling et al, HQ replication: a hybrid quorum protocol for Byzantine Fault tolerance, OSDI 06

Papers in SOSP 07 Zyzzyva: Speculative Byzantine fault tolerance Tolerating Byzantine faults in database systems using commit barrier scheduling Low-overhead Byzantine fault-tolerant storage Attested append-only memory: making adversaries stick to their word PeerReview: practical accountability for distributed systems

Future Directions Keeping less state at 2f+1 or even f+1 replicas Reducing latency Improving scalability

From Viewstamped Replication to BFT Barbara Liskov MIT CSAIL November 2007