Sisi Duan Assistant Professor Information Systems

Slides:



Advertisements
Similar presentations
CS 542: Topics in Distributed Systems Diganta Goswami.
Advertisements

CS425 /CSE424/ECE428 – Distributed Systems – Fall 2011 Material derived from slides by I. Gupta, M. Harandi, J. Hou, S. Mitra, K. Nahrstedt, N. Vaidya.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Byzantine Fault Tolerance Steve Ko Computer Sciences and Engineering University at Buffalo.
Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Yee Jiun Song Cornell University. CS5410 Fall 2008.
Database Replication techniques: a Three Parameter Classification Authors : Database Replication techniques: a Three Parameter Classification Authors :
Group Communications Group communication: one source process sending a message to a group of processes: Destination is a group rather than a single process.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 16 Wenbing Zhao Department of Electrical and Computer Engineering.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 15 Wenbing Zhao Department of Electrical and Computer Engineering.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 16 Wenbing Zhao Department of Electrical and Computer Engineering.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Practical Byzantine Fault Tolerance (The Byzantine Generals Problem)
Byzantine Fault Tolerance CS 425: Distributed Systems Fall Material drived from slides by I. Gupta and N.Vaidya.
EEC 688/788 Secure and Dependable Computing Lecture 14 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Practical Byzantine Fault Tolerance
Byzantine fault-tolerance COMP 413 Fall Overview Models –Synchronous vs. asynchronous systems –Byzantine failure model Secure storage with self-certifying.
1 ZYZZYVA: SPECULATIVE BYZANTINE FAULT TOLERANCE R.Kotla, L. Alvisi, M. Dahlin, A. Clement and E. Wong U. T. Austin Best Paper Award at SOSP 2007.
Byzantine fault tolerance
Paxos: Agreement for Replicated State Machines Brad Karp UCL Computer Science CS GZ03 / M st, 23 rd October, 2008.
Byzantine Fault Tolerance CS 425: Distributed Systems Fall 2012 Lecture 26 November 29, 2012 Presented By: Imranul Hoque 1.
EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Building Dependable Distributed Systems, Copyright Wenbing Zhao
Chapter 7: Consistency & Replication IV - REPLICATION MANAGEMENT By Jyothsna Natarajan Instructor: Prof. Yanqing Zhang Course: Advanced Operating Systems.
Hwajung Lee.  Improves reliability  Improves availability ( What good is a reliable system if it is not available?)  Replication must be transparent.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Replication Steve Ko Computer Sciences and Engineering University at Buffalo.
EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Reliable multicast Tolerates process crashes. The additional requirements are: Only correct processes will receive multicasts from all correct processes.
Primary-Backup Replication COS 418: Distributed Systems Lecture 5 Kyle Jamieson.
BChain: High-Throughput BFT Protocols
Primary-Backup Replication
The consensus problem in distributed systems
Distributed Systems – Paxos
Alternative system models
View Change Protocols and Reconfiguration
Chapter 7: Consistency & Replication IV - REPLICATION MANAGEMENT -Sumanth Kandagatla Instructor: Prof. Yanqing Zhang Advanced Operating Systems (CSC 8320)
Outline Distributed Mutual Exclusion Distributed Deadlock Detection
Providing Secure Storage on the Internet
Implementing Consistency -- Paxos
Outline Announcements Fault Tolerance.
Principles of Computer Security
Jacob Gardner & Chuan Guo
Replication Improves reliability Improves availability
EEC 688/788 Secure and Dependable Computing
Advanced Operating System
EEC 688/788 Secure and Dependable Computing
Active replication for fault tolerance
Fault-tolerance techniques RSM, Paxos
PERSPECTIVES ON THE CAP THEOREM
EEC 688/788 Secure and Dependable Computing
From Viewstamped Replication to BFT
EEC 688/788 Secure and Dependable Computing
IS 651: Distributed Systems Fault Tolerance
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Fault-Tolerant State Machine Replication
Distributed Systems CS
EEC 688/788 Secure and Dependable Computing
Building Dependable Distributed Systems, Copyright Wenbing Zhao
View Change Protocols and Reconfiguration
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
The SMART Way to Migrate Replicated Stateful Services
EEC 688/788 Secure and Dependable Computing
Implementing Consistency -- Paxos
Sisi Duan Assistant Professor Information Systems
Presentation transcript:

Sisi Duan Assistant Professor Information Systems sduan@umbc.edu IS 698/800-01: Advanced Distributed Systems Scalable Byzantine Fault Tolerance Sisi Duan Assistant Professor Information Systems sduan@umbc.edu

Outline The cost of scalability Available methods Steward Eyrie

The cost of BFT/Permissioned Blockchains Unfortunately, Byzantine agreement requires a number of messages quadratic in the number of participants, so it is infeasible for use in synchronizing a large number of replicas (Pond: the OceanStore prototype) Eventually batch cannot compensate for the quadratic number of messages of Practical Byzantine Fault Tolerance (PBFT) (HQ replication) The communication overhead of Byzantine Agreement is inherently large (server-initial agreement for general hierarchy wired/wireless networks)

Available Techniques and Limits Denial of Service attacks A good node cannot afford handling too many requests at a time Overlay network/hierarchy Challenges: who can decide the roles and positions of the nodes? Amplifying Randomness Select a small set of representative nodes which generate an agreed upon logarithmic length string with mostly random bits Running elections Construct a small set of representatives

The key to scalable SMR Hierarchy Partitions/Sharding Steward Amir, Yair, et al. "Scaling byzantine fault-tolerant replication towide area networks." DSN. IEEE, 2006. Partitions/Sharding Eyrie/Volery Bezerra, Carlos Eduardo, Fernando Pedone, and Robbert Van Renesse. "Scalable state-machine replication." DSN. IEEE, 2014.

Steward Hierarchy Multiple wide area sites Each site is like a group S sites, N nodes Clients can be located at any sites Read requests can be performed locally Write requests need to be totally ordered (Guess what will it look like? What are the requirements?)

Steward benefits Reduces the complexity from O(N2) to O(S2) Tolerates malicious replicas to local site, enabling the use of a benign fault-tolerant algorithm over the WAN Read requests are performed locally Public keys of replicas need to be known only within their own site How does the protocol work?

The Architecture Every site has a representative A leading site for all the sites

The Normal Operations Client->local server->local site representative->leading site

The Normal Operations Leading site: ASSSIGN-SEQUENCE procedure Output: proposal, THRESHOLD- SIGN Representative sends the output to the representatives of all the sites

The Normal Operations Upon receiving proposal: A server Representative Forwards to servers in the site A server Generates ACCEPT TREHOLD-SIGN it Representative Combines signatures and send to to other sites How many ACCEPT messages are good enough?

The Normal Operations Upon receiving ACCEPT from other sites A server Forward the ACCEPT to local servers A server Commit when receiving N/2 ACCEPT messages Reply to the client

A few concerns and issues Which protocol should the leading site run? What’s the THRESHOLD-SIGN configuration What’s the underlying failure consideration?

A few concerns and issues What if the client does not get response for a long time? What if the leading site fail? What if a site representative fail?

The timeouts Local representative (T1) When no global progress takes place for a period of time Leading site representative (T2) For servers in the leading site only T2>(f+2)*maxT1 Leading site (T3) No progress… T3=(f+3)T2 Client timer (T0) When expired, broadcast to all nodes

View changes Local view change Global view change Change site representative Global view change Change leading site

View changes Construct collective state Procedures Guarantee intra-site reconciliation to safely make progress Generating a message reflecting the site’s level of knowledge (global view change) Procedures Site representative -> all servers in the site: seq Servers->representative: acknowledge with all the execution history Site representative -> all: new view Servers : TRESHOLD-SIGN the message

View Changes Local View Change Global View Change New representative Invoke CONSTRUCT-COLLECTIVE-STATE View changes… Invoke ASSIGN-SEQUENCE to replay all pending updates in the view changes Global View Change After leading site election Representative of the new leading site The new leading site generates a new view and threshold signatures by the site members Send to all the site representatives

Evaluation Testbed: planetlab 5 sites, using up to 20 3.2GHz, 64 bit Intel Xeon computers. 16 machines for the leading site. 1 machine for each non-leading site

What are the issues? Safety Timers Failure model

Eyrie Partition based consensus Bezerra, Pedone, van Renesse. DSN 2014 Partition based consensus S-SMR (scalable state machine replication) P partitions, P1, ….PP Application state V For each variable v in V, it must be assigned to at least one partition part(v) Each partition is replicated by servers in group Si

Eyrie To execute a command C When receiving command C Client multicasts C to all partitions that hold the variables by C (The assumption: the client knows the partitions these variables belong to) When receiving command C If the server has all the variables in C, execute it Otherwise, communicate with the servers in other partitions to execute C The operation op (read, write, or computation operation)

The procedure Linearizability: the effect of an operation is not reflected until the operation finishes Total order of ending time…

The signal process For commands that involve more than one partition Every partition has replicas X: P1, P2, P4, Y: P3, P5, P6… When executing an operation Send signal to all the replicas involved in the command Wait until it receive a signal(C) from at least on server in every other partition To tolerate f failures, how many replicas for each partition should we have? F+1 replicas

The more concrete procedures Client sends C to all the involved servers Upon receiving C Server s multicasts signal(C) to others Buffers all the incoming messages and wait for enough signal(C) Execute the command (the only thing that cannot be done immediately locally: read(v)) If the value v belongs s’s partition, send to the nodes Otherwise, waits until up-to-date value of v has been delivered F+1 replicas

Issues and optimization One result from one server in each partition is good enough We still need to maintain consistency among all the replicas (?) Optimization Conservative caching Update after getting messages from other servers Speculative caching Assuming the cached value is up-to-date

Hierarchy vs partition-based SMR Number of nodes that are involved Hierarchy: all the nodes still need to learn the results Partition: only those nodes that are involved in the relevant partitions Total order of requests Hierarchy: yes and straightforward Partition: only order those requests that might create conflicts… Bottleneck Hierarchy: group communication Partition: operations that involve multiple partitions