Prophecy: Using History for High-Throughput Fault Tolerance Siddhartha Sen Joint work with Wyatt Lloyd and Mike Freedman Princeton University
Non-crash failures happen Model as Byzantine (malicious) It’s hard to model non-crash failures. One way we can model them… Independence
Mask Byzantine faults Service Clients Given this failure model, we then try to mask the failure so the client doesn’t see it. Clients Service
Mask Byzantine faults Throughput Clients Replicated service
Mask Byzantine faults Linearizability (strong consistency) Clients Throughput Linearizability (strong consistency) Clients Replicated service
Byzantine fault tolerance (BFT) Low throughput Modifies clients Long-lived sessions
Prophecy High throughput + good consistency No free lunch: Read-mostly workloads Slightly weakened consistency
Byzantine fault tolerance (BFT) Low throughput Modifies clients Long-lived sessions D-Prophecy Prophecy
Traditional BFT reads application Agree? … Clients Replica Group
A cache solution cache application Agree? … Clients Replica Group
A cache solution … Problems: Huge cache Invalidation Clients application Problems: Huge cache Invalidation Agree? … Your cache becomes as large as your disk Clients Replica Group
A compact cache … Clients Replica Group Requests Responses req1 resp1 application Requests Responses req1 resp1 req2 resp2 req3 resp3 … … Clients … Replica Group
A compact cache … Clients Replica Group Requests Responses Requests application Requests Responses Requests Responses sketch(req1) sketch(resp1) sketch(req2) sketch(resp2) sketch(req3) sketch(resp3) … … Clients … Replica Group
A sketcher … Clients Replica Group sketcher application That solves the big cache problem. To solve the invalidation problem, we will just make sure to get the full response from one replica. Clients Replica Group
A sketcher sketch webpage ………… ………… … Clients ………… Replica Group
Fast, load-balanced reads Executing a read sketch webpage ………… Agree? Fast, load-balanced reads ………… ………… … END: Suppose the responses don’t match… how could this happen? Clients ………… Replica Group
Executing a read … Clients Replica Group sketch webpage Agree? …………
Executing a read … key-value store replicated state machine Clients sketch webpage key-value store ………… replicated state machine ………… … Clients ………… Replica Group
Executing a read … Maintain a fresh cache Clients Replica Group sketch webpage Agree? ………… Maintain a fresh cache ………… ………… … Clients ………… Replica Group
Did we achieve linearizability? NO!
Executing a read … Clients Replica Group sketch webpage ………… ………… …………
Executing a read … Clients Replica Group sketch webpage Agree? …………
Executing a read … Fast reads may be stale Clients Replica Group sketch webpage ………… Agree? Fast reads may be stale ………… ………… … Clients ………… Replica Group
Load balancing … Pr(k stale) = gk Clients Replica Group sketch webpage ………… ………… Agree? Pr(k stale) = gk ………… … Clients ………… Replica Group
D-Prophecy vs. BFT … Traditional BFT: D-Prophecy: Each replica executes read Linearizability D-Prophecy: One replica executes read “Delay-once” linearizability Clients Replica Group … We call this system D-Prophecy
Byzantine fault tolerance (BFT) Low throughput Modifies clients Long-lived sessions D-Prophecy Prophecy We’ve solved the throughput problem for read requests at the cost of slightly weakened consistency But we still have two other problems, and these are bad for internet services
Key-exchange overhead 11% 3% Sessions containing only 8 requests, throughput is 1/10 of its maximum
Internet services … Clients Replica Group So if we apply our current design to internet services, we are faced with the problems of high per-session overhead, and modified clients. We can solve both of these problems by introducing a proxy in between the client and replica group … Clients Replica Group
Consolidate sketchers A proxy solution Consolidate sketchers Proxy Sketcher … Clients Replica Group
Sketcher must be fail-stop A proxy solution Sketcher must be fail-stop Sketcher … Clients Trusted Replica Group
A proxy solution … Trust middlebox already Small and simple Clients Sketcher must be fail-stop Trust middlebox already Small and simple Sketcher Our implementation required less than 3000 LOC … Clients Trusted Replica Group
Fast, load-balanced reads Prophecy Executing a read Fast, load-balanced reads ………… ………… ………… ………… q Sketcher ………… … Clients Trusted Req Resp s(q) ………… Replica Group
Prophecy … Fast reads may be stale Clients Replica Group Sketcher Req ………… ………… ………… ………… ………… ………… Sketcher … Clients Trusted Req Resp s(q) ………… ………… Replica Group
Delay-once linearizability No assumptions about organization of system state or consistency model of replica group
Delay-once linearizability Read-after-write property W, R, W, W, R, R, W, R No assumptions about organization of system state or consistency model of replica group
Delay-once linearizability Read-after-write property W, R, W, W, R, R, W, R No assumptions about organization of system state or consistency model of replica group
Example application Upload embarrassing photos Weak may reorder 1. Remove colleagues from ACL 2. Upload photos 3. (Refresh) Weak may reorder Delay-once preserves order Eventual consistency: updates may appear in different orders at different replicas
Byzantine fault tolerance (BFT) Low throughput Modifies clients Long-lived sessions D-Prophecy Prophecy
Implementation Modified PBFT C++, Tamer async I/O PBFT is stable, complete Competitive with Zyzzyva et. al. C++, Tamer async I/O Sketcher: 2000 LOC PBFT library: 1140 LOC PBFT client: 1000 LOC
Evaluation Prophecy vs. proxied-PBFT D-Prophecy vs. PBFT Proxied systems D-Prophecy vs. PBFT Non-proxied systems
Evaluation Prophecy vs. proxied-PBFT We will study: Proxied systems Performance on “null” workloads Performance with real replicated service Where system bottlenecks, how to scale
Basic setup Sketcher (concurrent) Clients (100) Replica Group (PBFT)
Fraction of failed fast reads Alexa top sites: < 15% Fraction of failed fast reads If you have a transition ratio of 1, then every read is being preceded by a write. If you have a transition ratio of 0, then reads are only preceded by other reads.
Small benefit on null reads
Apache webserver setup Sketcher Clients Replica Group
Large benefit on real workload 3.7x 2.0x Concurrent reads of a 1-byte webpage
Benefit grows with work 94s (Apache) Null workloads are misleading!
Benefit grows with work
Single sketcher bottlenecks Concurrent reads of a 1-byte webpage
Scaling out First tier does load balancing, second tier does consistent hashing… no more complex than how you’d scale-out and partition a system today
Scales linearly with replicas Concurrent reads of a 1-byte webpage
Summary Prophecy good for Internet services Fast, load-balanced reads D-Prophecy good for traditional services Prophecy scales linearly while PBFT stays flat Limitations: Read-mostly workloads (meas. study corroborates) Delay-once linearizability (useful for many apps)
Thank You
Additional slides
Transitions Prophecy good for read-mostly workloads Are transitions rare in practice?
Measurement study Alexa top sites Access main page every 20 sec for 24 hrs
Mostly static content
Mostly static content 15%
Dynamic content Rabin fingerprinting on transitions 43% differ by single contiguous change Sampled 4000 of them, over half due to: Load balancing directives Random IDs in links, function parameters “change” = insertion/deletion/substitution (i.e. edit distance 1)