Prophecy: Using History for High-Throughput Fault Tolerance

Slides:



Advertisements
Similar presentations
Wyatt Lloyd * Michael J. Freedman * Michael Kaminsky David G. Andersen * Princeton, Intel Labs, CMU Dont Settle for Eventual : Scalable Causal Consistency.
Advertisements

High throughput chain replication for read-mostly workloads
SPORC: Group Collaboration using Untrusted Cloud Resources Ariel J. Feldman, William P. Zeller, Michael J. Freedman, Edward W. Felten Published in OSDI’2010.
Prophecy: Using History for High- Throughput Fault Tolerance Siddhartha Sen Joint work with Wyatt Lloyd and Mike Freedman Princeton University.
1 Attested Append-Only Memory: Making Adversaries Stick to their Word Byung-Gon Chun (ICSI) October 15, 2007 Joint work with Petros Maniatis (Intel Research,
Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
File System Implementation
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Squirrel: A decentralized peer- to-peer web cache Paul Burstein 10/27/2003.
Wide-area cooperative storage with CFS
1 A Framework for Highly Available Services Based on Group Communication Alan Fekete Idit Keidar University of Sidney MIT.
Byzantine fault tolerance
Hash, Don’t Cache: Fast Packet Forwarding for Enterprise Edge Routers Minlan Yu Princeton University Joint work with Jennifer.
PNUTS: YAHOO!’S HOSTED DATA SERVING PLATFORM FENGLI ZHANG.
Highly Available ACID Memory Vijayshankar Raman. Introduction §Why ACID memory? l non-database apps: want updates to critical data to be atomic and persistent.
Low-Overhead Byzantine Fault-Tolerant Storage James Hendricks, Gregory R. Ganger Carnegie Mellon University Michael K. Reiter University of North Carolina.
Practical Byzantine Fault Tolerance
Byzantine fault-tolerance COMP 413 Fall Overview Models –Synchronous vs. asynchronous systems –Byzantine failure model Secure storage with self-certifying.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
1 ZYZZYVA: SPECULATIVE BYZANTINE FAULT TOLERANCE R.Kotla, L. Alvisi, M. Dahlin, A. Clement and E. Wong U. T. Austin Best Paper Award at SOSP 2007.
Byzantine fault tolerance
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
S-Paxos: Eliminating the Leader Bottleneck
On The Cooperation of Web Clients and Proxy Caches Yiu Fai Sit, Francis C.M. Lau, Cho-Li Wang Department of Computer Science The University of Hong Kong.
Hwajung Lee.  Improves reliability  Improves availability ( What good is a reliable system if it is not available?)  Replication must be transparent.
Replication Improves reliability Improves availability ( What good is a reliable system if it is not available?) Replication must be transparent and create.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Lecture 24: GFS.
Gargamel: A Conflict-Aware Contention Resolution Policy for STM Pierpaolo Cincilla, Marc Shapiro, Sébastien Monnet.
BChain: High-Throughput BFT Protocols
Fail-stutter Behavior Characterization of NFS
Free Transactions with Rio Vista
Jonathan Walpole Computer Science Portland State University
CS6320 – Performance L. Grewe.
RAID Redundant Arrays of Independent Disks
Chilimbi, et al. (2014) Microsoft Research
Distributed Systems – Paxos
Chapter 19: Distributed Databases
Gregory Kesden, CSE-291 (Storage Systems) Fall 2017
Gregory Kesden, CSE-291 (Cloud Computing) Fall 2016
The SNOW Theorem and Latency-Optimal Read-Only Transactions
Accessing nearby copies of replicated objects
Be Fast, Cheap and in Control
EECS 498 Introduction to Distributed Systems Fall 2017
Providing Secure Storage on the Internet
Implementing Consistency -- Paxos
Principles of Computer Security
Replication Improves reliability Improves availability
EEC 688/788 Secure and Dependable Computing
Consistency and Replication
Free Transactions with Rio Vista
Architectures of distributed systems Fundamental Models
Fault-tolerance techniques RSM, Paxos
Distributed Systems CS
From Viewstamped Replication to BFT
EEC 688/788 Secure and Dependable Computing
Scalable Causal Consistency
CAP Theorem and Consistency Models
Atomic Commit and Concurrency Control
Fault-Tolerant State Machine Replication
IS 651: Distributed Systems Final Exam
Decoupled Storage: “Free the Replicas!”
Distributed Systems CS
Replicated state machine and Paxos
Architectures of distributed systems
Last Class: Web Caching
The SMART Way to Migrate Replicated Stateful Services
Implementing Consistency -- Paxos
CSE 486/586 Distributed Systems Consistency --- 1
Sisi Duan Assistant Professor Information Systems
Presentation transcript:

Prophecy: Using History for High-Throughput Fault Tolerance Siddhartha Sen Joint work with Wyatt Lloyd and Mike Freedman Princeton University

Non-crash failures happen Model as Byzantine (malicious) It’s hard to model non-crash failures. One way we can model them… Independence

Mask Byzantine faults Service Clients Given this failure model, we then try to mask the failure so the client doesn’t see it. Clients Service

Mask Byzantine faults Throughput Clients Replicated service

Mask Byzantine faults Linearizability (strong consistency) Clients Throughput Linearizability (strong consistency) Clients Replicated service

Byzantine fault tolerance (BFT) Low throughput Modifies clients Long-lived sessions

Prophecy High throughput + good consistency No free lunch: Read-mostly workloads Slightly weakened consistency

Byzantine fault tolerance (BFT) Low throughput Modifies clients Long-lived sessions D-Prophecy Prophecy

Traditional BFT reads application Agree? … Clients Replica Group

A cache solution cache application Agree? … Clients Replica Group

A cache solution … Problems: Huge cache Invalidation Clients application Problems: Huge cache Invalidation Agree? … Your cache becomes as large as your disk Clients Replica Group

A compact cache … Clients Replica Group Requests Responses req1 resp1 application Requests Responses req1 resp1 req2 resp2 req3 resp3 … … Clients … Replica Group

A compact cache … Clients Replica Group Requests Responses Requests application Requests Responses Requests Responses sketch(req1) sketch(resp1) sketch(req2) sketch(resp2) sketch(req3) sketch(resp3) … … Clients … Replica Group

A sketcher … Clients Replica Group sketcher application That solves the big cache problem. To solve the invalidation problem, we will just make sure to get the full response from one replica. Clients Replica Group

A sketcher sketch webpage ………… ………… … Clients ………… Replica Group

Fast, load-balanced reads Executing a read sketch webpage ………… Agree? Fast, load-balanced reads ………… ………… … END: Suppose the responses don’t match… how could this happen? Clients ………… Replica Group

Executing a read … Clients Replica Group sketch webpage Agree? …………

Executing a read … key-value store replicated state machine Clients sketch webpage key-value store ………… replicated state machine ………… … Clients ………… Replica Group

Executing a read … Maintain a fresh cache Clients Replica Group sketch webpage Agree? ………… Maintain a fresh cache ………… ………… … Clients ………… Replica Group

Did we achieve linearizability? NO!

Executing a read … Clients Replica Group sketch webpage ………… ………… …………

Executing a read … Clients Replica Group sketch webpage Agree? …………

Executing a read … Fast reads may be stale Clients Replica Group sketch webpage ………… Agree? Fast reads may be stale ………… ………… … Clients ………… Replica Group

Load balancing … Pr(k stale) = gk Clients Replica Group sketch webpage ………… ………… Agree? Pr(k stale) = gk ………… … Clients ………… Replica Group

D-Prophecy vs. BFT … Traditional BFT: D-Prophecy: Each replica executes read Linearizability D-Prophecy: One replica executes read “Delay-once” linearizability Clients Replica Group … We call this system D-Prophecy

Byzantine fault tolerance (BFT) Low throughput Modifies clients Long-lived sessions D-Prophecy Prophecy We’ve solved the throughput problem for read requests at the cost of slightly weakened consistency But we still have two other problems, and these are bad for internet services

Key-exchange overhead 11% 3% Sessions containing only 8 requests, throughput is 1/10 of its maximum

Internet services … Clients Replica Group So if we apply our current design to internet services, we are faced with the problems of high per-session overhead, and modified clients. We can solve both of these problems by introducing a proxy in between the client and replica group … Clients Replica Group

Consolidate sketchers A proxy solution Consolidate sketchers Proxy Sketcher … Clients Replica Group

Sketcher must be fail-stop A proxy solution Sketcher must be fail-stop Sketcher … Clients Trusted Replica Group

A proxy solution … Trust middlebox already Small and simple Clients Sketcher must be fail-stop Trust middlebox already Small and simple Sketcher Our implementation required less than 3000 LOC … Clients Trusted Replica Group

Fast, load-balanced reads Prophecy Executing a read Fast, load-balanced reads ………… ………… ………… ………… q Sketcher ………… … Clients Trusted Req Resp s(q)  ………… Replica Group

Prophecy … Fast reads may be stale Clients Replica Group Sketcher Req ………… ………… ………… ………… ………… ………… Sketcher … Clients Trusted Req Resp s(q)  ………… ………… Replica Group

Delay-once linearizability No assumptions about organization of system state or consistency model of replica group

Delay-once linearizability Read-after-write property  W, R, W, W, R, R, W, R  No assumptions about organization of system state or consistency model of replica group

Delay-once linearizability Read-after-write property  W, R, W, W, R, R, W, R  No assumptions about organization of system state or consistency model of replica group

Example application Upload embarrassing photos Weak may reorder 1. Remove colleagues from ACL 2. Upload photos 3. (Refresh) Weak may reorder Delay-once preserves order Eventual consistency: updates may appear in different orders at different replicas

Byzantine fault tolerance (BFT) Low throughput Modifies clients Long-lived sessions D-Prophecy Prophecy

Implementation Modified PBFT C++, Tamer async I/O PBFT is stable, complete Competitive with Zyzzyva et. al. C++, Tamer async I/O Sketcher: 2000 LOC PBFT library: 1140 LOC PBFT client: 1000 LOC

Evaluation Prophecy vs. proxied-PBFT D-Prophecy vs. PBFT Proxied systems D-Prophecy vs. PBFT Non-proxied systems

Evaluation Prophecy vs. proxied-PBFT We will study: Proxied systems Performance on “null” workloads Performance with real replicated service Where system bottlenecks, how to scale

Basic setup Sketcher (concurrent) Clients (100) Replica Group (PBFT)

Fraction of failed fast reads Alexa top sites: < 15% Fraction of failed fast reads If you have a transition ratio of 1, then every read is being preceded by a write. If you have a transition ratio of 0, then reads are only preceded by other reads.

Small benefit on null reads

Apache webserver setup Sketcher Clients Replica Group

Large benefit on real workload 3.7x 2.0x Concurrent reads of a 1-byte webpage

Benefit grows with work 94s (Apache) Null workloads are misleading!

Benefit grows with work

Single sketcher bottlenecks Concurrent reads of a 1-byte webpage

Scaling out First tier does load balancing, second tier does consistent hashing… no more complex than how you’d scale-out and partition a system today

Scales linearly with replicas Concurrent reads of a 1-byte webpage

Summary Prophecy good for Internet services Fast, load-balanced reads D-Prophecy good for traditional services Prophecy scales linearly while PBFT stays flat Limitations: Read-mostly workloads (meas. study corroborates) Delay-once linearizability (useful for many apps)

Thank You

Additional slides

Transitions Prophecy good for read-mostly workloads Are transitions rare in practice?

Measurement study Alexa top sites Access main page every 20 sec for 24 hrs

Mostly static content

Mostly static content 15%

Dynamic content Rabin fingerprinting on transitions 43% differ by single contiguous change Sampled 4000 of them, over half due to: Load balancing directives Random IDs in links, function parameters “change” = insertion/deletion/substitution (i.e. edit distance 1)