The Byzantine Generals Problem Leslie Lamport, Robert Shostak, and Marshall Pease ACM TOPLAS 1982 Practical Byzantine Fault Tolerance Miguel Castro and.

Slides:



Advertisements
Similar presentations
Agreement: Byzantine Generals UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau Paper: “The.
Advertisements

Failure Detection The ping-ack failure detector in a synchronous system satisfies – A: completeness – B: accuracy – C: neither – D: both.
Distributed Systems Overview Ali Ghodsi
P. Kouznetsov, 2006 Abstracting out Byzantine Behavior Peter Druschel Andreas Haeberlen Petr Kouznetsov Max Planck Institute for Software Systems.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Byzantine Fault Tolerance Steve Ko Computer Sciences and Engineering University at Buffalo.
Byzantine Generals Problem: Solution using signed messages.
1 Principles of Reliable Distributed Systems Lecture 3: Synchronous Uniform Consensus Spring 2006 Dr. Idit Keidar.
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 5: Synchronous (Uniform)
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 7: Failure Detectors.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 16 Wenbing Zhao Department of Electrical and Computer Engineering.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 15 Wenbing Zhao Department of Electrical and Computer Engineering.
1 Principles of Reliable Distributed Systems Lecture 5: Failure Models, Fault-Tolerant Broadcasts and State-Machine Replication Spring 2005 Dr. Idit Keidar.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 16 Wenbing Zhao Department of Electrical and Computer Engineering.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 15 Wenbing Zhao Department of Electrical and Computer Engineering.
Practical Byzantine Fault Tolerance (The Byzantine Generals Problem)
EEC 688 Secure and Dependable Computing Lecture 16 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Byzantine fault tolerance
Byzantine Fault Tolerance CS 425: Distributed Systems Fall Material drived from slides by I. Gupta and N.Vaidya.
9/14/20151 Lecture 18: Distributed Agreement CSC 469H1F / CSC 2208H1F Fall 2007 Angela Demke Brown.
Fault Tolerance via the State Machine Replication Approach Favian Contreras.
Lecture #12 Distributed Algorithms (I) CS492 Special Topics in Computer Science: Distributed Algorithms and Systems.
Low-Overhead Byzantine Fault-Tolerant Storage James Hendricks, Gregory R. Ganger Carnegie Mellon University Michael K. Reiter University of North Carolina.
Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.
HQ Replication: Efficient Quorum Agreement for Reliable Distributed Systems James Cowling 1, Daniel Myers 1, Barbara Liskov 1 Rodrigo Rodrigues 2, Liuba.
Practical Byzantine Fault Tolerance
Practical Byzantine Fault Tolerance Jayesh V. Salvi
Byzantine fault-tolerance COMP 413 Fall Overview Models –Synchronous vs. asynchronous systems –Byzantine failure model Secure storage with self-certifying.
From Viewstamped Replication to BFT Barbara Liskov MIT CSAIL November 2007.
1 ZYZZYVA: SPECULATIVE BYZANTINE FAULT TOLERANCE R.Kotla, L. Alvisi, M. Dahlin, A. Clement and E. Wong U. T. Austin Best Paper Award at SOSP 2007.
Byzantine fault tolerance
Practical Byzantine Fault Tolerance and Proactive Recovery
Byzantine Fault Tolerance CS 425: Distributed Systems Fall 2012 Lecture 26 November 29, 2012 Presented By: Imranul Hoque 1.
CSE 60641: Operating Systems Implementing Fault-Tolerant Services Using the State Machine Approach: a tutorial Fred B. Schneider, ACM Computing Surveys.
CSE 486/586 CSE 486/586 Distributed Systems Byzantine Fault Tolerance Steve Ko Computer Sciences and Engineering University at Buffalo.
Chap 15. Agreement. Problem Processes need to agree on a single bit No link failures A process can fail by crashing (no malicious behavior) Messages take.
EEC 688/788 Secure and Dependable Computing Lecture 15 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department
Byzantine Fault Tolerance
Systems Research Barbara Liskov October Replication Goal: provide reliability and availability by storing information at several nodes.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Byzantine Fault Tolerance Steve Ko Computer Sciences and Engineering University at Buffalo.
CSci8211: Distributed Systems: State Machines 1 Detour: Some Theory of Distributed Systems Supplementary Materials  Replicated State Machines Notion of.
Distributed Agreement. Agreement Problems High-level goal: Processes in a distributed system reach agreement on a value Numerous problems can be cast.
Intrusion Tolerant Consensus in Wireless Ad hoc Networks Henrique Moniz, Nuno Neves, Miguel Correia LASIGE Dep. Informática da Faculdade de Ciências Universidade.
Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.
CSE 486/586, Spring 2014 CSE 486/586 Distributed Systems Byzantine Fault Tolerance Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 486/586 Distributed Systems Byzantine Fault Tolerance
BChain: High-Throughput BFT Protocols
Secure Causal Atomic Broadcast, Revisited
Principles of Computer Security
Principles of Computer Security
Byzantine Fault Tolerance
CSE 486/586 Distributed Systems Byzantine Fault Tolerance
Implementing Consistency -- Paxos
Agreement Protocols CS60002: Distributed Systems
Principles of Computer Security
Jacob Gardner & Chuan Guo
EEC 688/788 Secure and Dependable Computing
IS 651: Distributed Systems Byzantine Fault Tolerance
EEC 688/788 Secure and Dependable Computing
From Viewstamped Replication to BFT
The Byzantine Generals Problem
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Implementing Consistency -- Paxos
CSE 486/586 Distributed Systems Byzantine Fault Tolerance
Sisi Duan Assistant Professor Information Systems
Presentation transcript:

The Byzantine Generals Problem Leslie Lamport, Robert Shostak, and Marshall Pease ACM TOPLAS 1982 Practical Byzantine Fault Tolerance Miguel Castro and Barbara Liskov OSDI 1999

Announcements Dave is at Hotnets –TA (James Hendricks) gave lecture Outline requested by 11/30 –Not graded, but less sympathy if you skip. Goals: (1) Ensure you pass, (2) help you cut down on amount of effort spent –Feel free to give James drafts of writeup at any time

A definition Byzantine ( 1: of, relating to, or characteristic of the ancient city of Byzantium … 4b: intricately involved : labyrinthine Lamport’s reason: “I have long felt that, because it was posed as a cute problem about philosophers seated around a table, Dijkstra's dining philosopher's problem received much more attention than it deserves.” (

Byzantine Generals Problem Concerned with (binary) atomic broadcast –All correct nodes receive same value –If broadcaster correct, correct nodes receive broadcasted value Can use broadcast to build consensus protocols (aka, agreement) –Consensus: think Byzantine fault-tolerant (BFT) Paxos

Synchronous Asynchronous Fail-stop Byzantine Synchronous, Byzantine world

Cool note Example Byzantine fault-tolerant system:  Seawolf submarine’s control system Sims, J. T Redundancy Management Software Services for Seawolf Ship Control System. In Proceedings of the 27th international Symposium on Fault-Tolerant Computing (FTCS '97) (June , 1997). FTCS. IEEE Computer Society, Washington, DC, 390. But it remains to be seen if commodity distributed systems are willing to pay to have so many replicas in a system

First protocol: no crypto Secure point-to-point links, but no crypto allowed Protocol OM(m): Recursive, exponential, all-to-all –[Try to sketch protocol – see page 388] –May be inefficient, but shows 3f+1 bound is tight –[Discuss: Understand that this is for synchronous setup without crypto!] Need at least 3f+1 to tolerate f faulty! –See figures 1 and 2 –How to fix? Signatures (for example). Or hash commitments, one-time signatures, etc.

Second protocol: With crypto Protocol SM(m) –[Page 391, but can skip protocol] –Given signatures, do m rounds of signing what you think was said. Many messages (don’t need as many in absence of faults). –Shows possible for any # of faults tolerated –[Discuss. Understand: Synchronous, lots of messages, but possible.] [Skip odd topologies. Note that “signature” can be emulated for random (not malicious) faults.]

Practical Byzantine Fault Tolerance: Asynchronous, Byzantine Synchronous Asynchronous Fail-stop Byzantine

Practical Byzantine Fault Tolerance Why async BFT? BFT: –Malicious attacks, software errors –Need N-version programming? –Faulty client can write garbage data, but can’t make system inconsistent (violate operational semantics) Why async? –Faulty network can violate timing assumptions –But can also prevent liveness [For different liveness properties, see, e.g., Cachin, C., Kursawe, K., and Shoup, V Random oracles in constantipole: practical asynchronous Byzantine agreement using cryptography (extended abstract). In Proceedings of the Nineteenth Annual ACM Symposium on Principles of Distributed Computing (Portland, Oregon, United States, July , 2000). PODC '00. ACM, New York, NY, ]

Distributed systems Async BFT consensus: Need 3f+1 nodes –Sketch of proof: Divide 3f nodes into three groups of f, left, middle, right, where middle f are faulty. When left+middle talk, they must reach consensus (right may be crashed). Same for right+middle. Faulty middle can steer partitions to different values! [See Bracha, G. and Toueg, S Asynchronous consensus and broadcast protocols. J. ACM 32, 4 (Oct. 1985), ] FLP impossibility: Async consensus may not terminate –Sketch of proof: System starts in “bivalent” state (may decide 0 or 1). At some point, the system is one message away from deciding on 0 or 1. If that message is delayed, another message may move the system away from deciding. –Holds even when servers can only crash (not Byzantine)! –Hence, protocol cannot always be live (but there exist randomized BFT variants that are probably live) [See Fischer, M. J., Lynch, N. A., and Paterson, M. S Impossibility of distributed consensus with one faulty process. J. ACM 32, 2 (Apr. 1985), ]

Aside: Linearizability Linearizability (“safety” condition) -- two goals: –Valid sequential history –If completion of E1 precedes invocation of E2 in reality, E1 must precede E2 in history Why it’s nice: Can reason about distributed system using sequential specification [Can give example on board if time] [See Herlihy, M. P. and Wing, J. M Linearizability: a correctness condition for concurrent objects. ACM Trans. Program. Lang. Syst. 12, 3 (Jul. 1990), ]

Cryptography Hash (aka, message digest): –Pre-image resistant: given hash(x), hard to find x –Second pre-image resistant: given x, hard to find y such that hash(x) = hash(y) –Collision resistant: hard to find x,y such that hash(x) = hash(y) –Random oracle: hash should be a random map (no structure) –Assembly SHA1 on 3 GHZ Pentium D -- ~250 MB/s –Brian Gladman, AMD64, SHA1: 9.7, SHA512:13.4 (cycles/byte) MACs: ~1 microsecond, > 400 MB/s (700 MB/s?) Signatures: ~150 microseconds – 4 milliseconds –150 microseconds: ESIGN (Nippon Telegraph and Telephone) –(Compare to Castro’s 45 millisecond on PPro 200) Castro&Liskov use 128-bit AdHash for checkpoints. Broken by Wagner in 2002 for keys less than 1600 bits.Moral of the story: Beware new crypto primitives unless they reduce to older, more trusted primitives!

Basic protocol Plus view change protocol, checkpoint protocol. Question: When is this not live? Answer: During successive primary timeouts. (Compare to Q/U)

Recent systems [Show network patterns] Q/U: 5f+1, 1 roundtrip (SOSP 2005) H/Q: 3f+1, 2 roundtrips (OSDI 2006) Zyzzyva: 3f+1, 3 one-way latencies but need 3f+1 responsive

Evaluation Only implemented parts of protocol that mattered for evaluation/analysis –Hint: Not a bad idea for 712 projects! NFS loopback trick is pretty standard -- good idea for prototyping Tolerate 1 fault, use multicast. BFT prototype: no disk writes –NFS server: disk writes for some operations! –Explanation: Replication provides redundancy –Is this a fair comparison? How about BFS vs replicated NFS?