Practical Byzantine Fault Tolerance and Proactive Recovery

Slides:

Advertisements

Similar presentations

CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Byzantine Fault Tolerance Steve Ko Computer Sciences and Engineering University at Buffalo.

Advertisements

1 Attested Append-Only Memory: Making Adversaries Stick to their Word Byung-Gon Chun (ICSI) October 15, 2007 Joint work with Petros Maniatis (Intel Research,

1 Principles of Reliable Distributed Systems Lectures 11: Authenticated Byzantine Consensus Spring 2005 Dr. Idit Keidar.

Yee Jiun Song Cornell University. CS5410 Fall 2008.

CS 582 / CMPE 481 Distributed Systems Fault Tolerance.

CS 582 / CMPE 481 Distributed Systems

EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 16 Wenbing Zhao Department of Electrical and Computer Engineering.

EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 15 Wenbing Zhao Department of Electrical and Computer Engineering.

Attested Append-only Memory: Making Adversaries Stick to their Word Distributed Storage Systems CS presented by: Hussam Abu-Libdeh.

EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 16 Wenbing Zhao Department of Electrical and Computer Engineering.

EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 15 Wenbing Zhao Department of Electrical and Computer Engineering.

Practical Byzantine Fault Tolerance (The Byzantine Generals Problem)

BASE: Using Abstraction to Improve Fault Tolerance Rodrigo Rodrigues, Miguel Castro, and Barbara Liskov MIT Laboratory for Computer Science and Microsoft.

EEC 688 Secure and Dependable Computing Lecture 16 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

Byzantine fault tolerance

Byzantine Fault Tolerance CS 425: Distributed Systems Fall Material drived from slides by I. Gupta and N.Vaidya.

Problem Computer systems provide crucial services Computer systems fail –natural disasters –hardware failures –software errors –malicious attacks Need.

1 The Design of a Robust Peer-to-Peer System Rodrigo Rodrigues, Barbara Liskov, Liuba Shrira Presented by Yi Chen Some slides are borrowed from the authors’

Chapter 19 Recovery and Fault Tolerance Copyright © 2008.

EEC 688/788 Secure and Dependable Computing Lecture 14 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

HQ Replication: Efficient Quorum Agreement for Reliable Distributed Systems James Cowling 1, Daniel Myers 1, Barbara Liskov 1 Rodrigo Rodrigues 2, Liuba.

Practical Byzantine Fault Tolerance

Practical Byzantine Fault Tolerance Jayesh V. Salvi

Byzantine fault-tolerance COMP 413 Fall Overview Models –Synchronous vs. asynchronous systems –Byzantine failure model Secure storage with self-certifying.

From Viewstamped Replication to BFT Barbara Liskov MIT CSAIL November 2007.

1 ZYZZYVA: SPECULATIVE BYZANTINE FAULT TOLERANCE R.Kotla, L. Alvisi, M. Dahlin, A. Clement and E. Wong U. T. Austin Best Paper Award at SOSP 2007.

Byzantine fault tolerance

Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.

Byzantine Fault Tolerance CS 425: Distributed Systems Fall 2012 Lecture 26 November 29, 2012 Presented By: Imranul Hoque 1.

EECS 262a Advanced Topics in Computer Systems Lecture 25 Byzantine Agreement November 28 th, 2012 John Kubiatowicz and Anthony D. Joseph Electrical Engineering.

EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

CSE 60641: Operating Systems Implementing Fault-Tolerant Services Using the State Machine Approach: a tutorial Fred B. Schneider, ACM Computing Surveys.

EEC 688/788 Secure and Dependable Computing Lecture 15 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

Building Dependable Distributed Systems, Copyright Wenbing Zhao

Byzantine Fault Tolerance

Systems Research Barbara Liskov October Replication Goal: provide reliability and availability by storing information at several nodes.

Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.

Fail-Stop Processors UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau One paper: Byzantine.

EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

BChain: High-Throughput BFT Protocols

Byzantine Fault Tolerance

View Change Protocols and Reconfiguration

Byzantine Fault Tolerance

Providing Secure Storage on the Internet

Principles of Computer Security

Jacob Gardner & Chuan Guo

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

From Viewstamped Replication to BFT

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

IS 651: Distributed Systems Fault Tolerance

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

Building Dependable Distributed Systems, Copyright Wenbing Zhao

View Change Protocols and Reconfiguration

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

Sisi Duan Assistant Professor Information Systems

Presentation transcript:

Practical Byzantine Fault Tolerance and Proactive Recovery Miguel Castro and Barbara Liskov ACM TOCS ‘02 Presented By: Imranul Hoque

Replicate to increase availability Problem Computer systems provide crucial services Computer systems fail Natural disasters Hardware failures Software errors Malicious attacks Need highly available service Replicate to increase availability

Assumptions are a Problem Before BFT BFT Behavior of faulty process Fail stop failure [Paxos, VS Replication] Byzantine failure Synchrony Synchronous system [Rampart, SecureRing] Asynchronous system! Safety without synchrony Liveness with eventual time bound Number of faults Bounded Unbounded if less then one third fail in a time period Proactive and frequent recovery [When will this not work?]

Byzantine Fault Tolerance in Asynchronous Environment Contributions Practical replication algorithm Weak assumption Good performance [Really?] Implementation BFT: A generic replication toolkit BFS: A replicated file system Performance evaluation Byzantine Fault Tolerance in Asynchronous Environment

Challenges Request A Request B Client

Challenges Client Client 1: Request A 2: Request B

State Machine Replication 2: Request B 1: Request A Client How to assign sequence number to requests?

Primary Backup Mechanism Client What if the primary is faulty? Agreeing on sequence number Agreeing on changing the primary (view change) 1: Request A 2: Request B View 0

Agreement Certificate: set of messages from a quorum Algorithm steps are justified by certificates Quorum A Quorums have at least 2f + 1 replicas Quorums intersect in at least one correct replica Quorum B

All have to be designed to work together Algorithm Components Normal case operation View changes Garbage collection Recovery All have to be designed to work together

Quadratic message exchange Normal Case Operation Three phase algorithm: PRE-PREPARE picks order of requests PREPARE ensures order within views COMMIT ensures order across views Replicas remember messages in log Messages are authenticated {.}σk denotes a message sent by k Quadratic message exchange

Pre-prepare Phase Request: m {PRE-PREPARE, v, n, m}σ0 Primary: Replica 0 Replica 1 Replica 2 Replica 3 Request: m {PRE-PREPARE, v, n, m}σ0 Fail

Prepare Phase Request: m PRE-PREPARE Primary: Replica 0 Replica 1 Fail Replica 3 Accepted PRE-PREPARE

Prepare Phase Request: m PRE-PREPARE Primary: Replica 0 {PREPARE, v, n, D(m), 1}σ1 Replica 1 Replica 2 Fail Replica 3 Accepted PRE-PREPARE

Collect PRE-PREPARE + 2f matching PREPARE Prepare Phase Request: m Collect PRE-PREPARE + 2f matching PREPARE PRE-PREPARE Primary: Replica 0 {PREPARE, v, n, D(m), 1}σ1 Replica 1 Replica 2 Fail Replica 3 Accepted PRE-PREPARE

Commit Phase Request: m PRE-PREPARE PREPARE Primary: Replica 0 {COMMIT, v, n, D(m)}σ2 Replica 2 Fail Replica 3

Collect 2f+1 matching COMMIT: execute and reply Commit Phase (2) Request: m Collect 2f+1 matching COMMIT: execute and reply PRE-PREPARE PREPARE COMMIT Primary: Replica 0 Replica 1 Replica 2 Fail Replica 3

View Change Provide liveness when primary fails Brief protocol Timeouts trigger view changes Select new primary (= view number mod 3f+1) Brief protocol Replicas send VIEW-CHANGE message along with the requests they prepared so far New primary collects 2f+1 VIEW-CHANGE messages Constructs information about committed requests in previous views

View Change Safety Goal: No two different committed request with same sequence number across views Quorum for Committed Certificate (m, v, n) At least one correct replica has Prepared Certificate (m, v, n) View Change Quorum

Clients will not get reply if more than f replicas are recovering Recovery Corrective measure for faulty replicas Proactive and frequent recovery All replicas can fail if at most f fail in a window System administrator performs recovery, or Automatic recovery from network attacks Secure co-processor Read-only memory Watchdog timer Clients will not get reply if more than f replicas are recovering

Sketch of Recovery Protocol Save state Reboot with correct code and restore state Replica has correct code without losing state Change keys for incoming messages Prevent attacker from impersonating others Send recovery request r Others change incoming keys when r execute Check state and fetch out-of-date or corrupt items Replica has correct up-to-date state

Performance Andrew benchmark 4 machines: 600 MHz, Pentium III Andrew100 and Andrew500 4 machines: 600 MHz, Pentium III 3 Systems BFS: based on BFT NO-REP: BFS without replication NFS: NFS-V2 implementation in Linux No experiment with faulty replicas Scalability issue: only 4 & 7 replicas

Benchmark Results (w/o PR) Although they have some data on view change time – that’s artificial view change! Without view change and faulty replica!

Benchmark Results (with PR) Recovery Period Recovery is staggered!

Related Works Fault Tolerance Fail Stop Fault Tolerance Paxos 1989 (TR) VS Replication PODC 1988 Byzantine Fault Tolerance Byzantine Agreement Rampart TPDS 1995 SecureRing HICSS 1998 BFT TOCS ‘02 BASE TOCS ‘03 Byzantine Quorums Malkhi-Reiter JDC 1998 Phalanx SRDS 1998 Fleet ToKDI ‘00 Q/U SOSP ‘05 Hybrid Quorum HQ Replication OSDI ‘06

Today’s Talk Replication in Wide Area Network Prof. Keith Marzullo Distinguished Lectureship Series Today @ 4PM – 1404 SC

Questions?

Backup Slides Optimization Proof of 3f+1 optimality Garbage Collection View Change with MAC Detailed Recovery Protocol State Transfer

Optimization Using MAC instead of Digital Signatures Replying with digest Tentative execution of Read-Write requests Tentative execution or Read-only requests Piggyback COMMIT with next PRE-PREPARE Request batching

Proof of 3f+1 Optimality f faulty replicas Should reply after n-f replicas respond to ensure liveness Two n-f quorums intersect in at least n-2f replicas Worst case scenario: out of (n-2f), f are faulty Two quorums should have at least 1 non-faulty replica in common Therefore, n-3f >= 1

Garbage Collection Replicas should remove from their logs information about executed operations. Even if i has executed a request, it may not be safe to remove that request's information because of a subsequent view change. Use periodic checkpointing (after K requests).

Garbage Collection When replica i produces a checkpoint it multicasts {CHECKPOINT, n, d, i}αi. n is the highest sequence number of the requests i has executed. d is a digest of the current state. Each replica collects such messages until it has a quorum certificate with 2f + 1 authenticated CHECKPOINT messages with the same n and d from distinct i. Called the stable certificate: all other replicas will be able to obtain a weak certificate proving that its stable checkpoint is correct if they need to fetch it.

Garbage Collection Once a replica has a stable certificate for a checkpoint with sequence number n, then that checkpoint is stable. The replica can discard all entries in the log with sequence numbers less than n and all checkpointed states ealier than n. The checkpoint can be used to define high and low water marks for sequence numbers: L = n H = n + LOG for a log size LOG.

View Change with MAC Replica has logged per message m: m = {REQUEST, o, t, c} αc {PRE-PREPARE, v, n, D(m)} αp m is pre-prepared at this replica {PREPARE, v, n, D(m), i} αi from 2f replicas m is prepared at this replica {COMMIT, v, n, i}αi from 2f+1 replicas m is commited at this replica

View Change with MAC At a high level: The primary of view v + 1 reads information about stable and prepared certificates from a quorum It then computes the stable checkpoint and fills in the command sequence It sends this information to the backups. The backups verify this information.

View Change with MAC Each replica i maintains two sets P and Q: P contains tuples {n, d, v} such that i has collected a prepared certificate for request with digest d for view v, sequence n and there is no such request with sequence n for a later view. Q contains tuples {n, d, v} such that i pre-prepared request for request with digest d for view v, sequence n and there is no such request with sequence n for a later view.

View Change with MAC When i suspects the primary for view v has failed: Enter view v + 1 Multicast {VIEW-CHANGE, v+1, h, C, P, Q, i}αi h is the sequence number of the stable checkpoint of i. C is a set of checkpoints at i with their sequence number.

View Change with MAC Replicas collect VIEW-CHANGE messages and acknowledge them to primary p of view v + 1. Accepts a message only if the values in their P and Q are earlier than the new view number. {VIEW-CHANGE-ACK, v+1, i, j, d}μi i is sender of ack. j is source of the VIEW-CHANGE message. d is the digest of the VIEW-CHANGE message

View Change with MAC The primary p: Stores VIEW-CHANGE from i in S[i] if gets 2f − 1 corresponding VIEW-CHANGE-ACKs. Call a view-change certificate Chooses from S a checkpoint and sets of requests. Try to do whenever update S.

View Change with MAC Select checkpoint for starting state for processing requests in the new view. Picks checkpoint with the highest number h from the set of checkpoints that are known to be correct (at least f + 1 of them) and that have numbers higher than the low water mark of at least f +1 non-faulty replicas (so ordering information is available).

View Change with MAC Select a request to pre-prepare in the new view for each n between h and h + Log. If m committed in an earlier view, then choose m. If there is a quorum of replicas that did not prepare any request for n, then p pre-proposes a null command. Multicast decision to backups. Each can verify by running the same decision procedure.

Recovery Protocol Save state Reboot with correct code and restore state Replica has correct code without losing state Change keys for incoming messages Prevent attacker from impersonating others Estimate high-water mark in a correct log (H) Bound sequence number of bad information by H

Recovery Protocol Send recovery request r Participate in algorithm Others change incoming keys when r executes Bound sequence number of forged messages by Hr (high water mark in log when r executes) Participate in algorithm May be needed to complete recovery if not faulty Check state and fetch out-of-date or corrupt items End: checkpoint max(H, Hr) is stable Replica has correct up-to-date state

State Transfer