Download presentation
Presentation is loading. Please wait.
1
I Can’t Believe It’s Not Causal
I Can’t Believe It’s Not Causal ! Scalable Causal Consistency with No Slowdown Cascades Syed Akbar Mehdi1, Cody Littley1, Natacha Crooks1, Lorenzo Alvisi1,4, Nathan Bronson2, Wyatt Lloyd3 1UT Austin, 2Facebook, 3USC, 4Cornell University
2
Causal Consistency: Great In Theory
Eventual Consistency Strong Consistency Higher Perf. Stronger Guarantees Lots of exciting research building scalable causal data-stores, e.g., COPS [SOSP 11] Bolt-On [SIGMOD 13] Chain Reaction [EuroSys 13] Eiger [NSDI 13] Orbe [SOCC 13] GentleRain [SOCC 14] Cure [ICDCS 16] TARDiS [SIGMOD 16]
3
Causal Consistency: But In Practice …
The middle child of consistency models Reality: Largest web apps use eventual consistency, e.g., Espresso TAO Manhattan
4
Key Hurdle: Slowdown Cascades
Reality at Scale Implicit Assumption of Current Causal Systems
5
Key Hurdle: Slowdown Cascades
Wait Reality at Scale Implicit Assumption of Current Causal Systems Enforce Consistency
6
Replicated and sharded storage for a social network
Datacenter A Datacenter B Replicated and sharded storage for a social network
7
Writes causally ordered as 𝑊 1 → 𝑊 2 → 𝑊 3
Read (W1) W2 W3 Datacenter A Datacenter B Writes causally ordered as 𝑊 1 → 𝑊 2 → 𝑊 3
8
Current causal systems enforce consistency as a datastore invariant
W1 Applied ? W1 Read (W1) W2 W2 Applied ? W2 Buffered W3 W3 Buffered Datacenter A Datacenter B Current causal systems enforce consistency as a datastore invariant
9
Slowdown Cascade W1 W1 Delayed Applied ? W1 Read (W1) Slowdown cascades affect all previous causal systems because they enforce consistency inside the data store W2 W2 Applied ? W2 Buffered W3 W3 Buffered Datacenter A Datacenter B Alice’s advisor unnecessarily waits for Justin Bieber’s update despite not reading it
10
Slowdown Cascades in Eiger (NSDI ‘13)
Replicated write buffers grow arbitrarily because Eiger enforces consistency inside the datastore
11
OCCULT Observable Causal Consistency Using Lossy Timestamps
12
Observable Causal Consistency
Causal Consistency guarantees that each client observes a monotonically non-decreasing set of updates (including its own) in an order that respects potential causality between operations Key Idea: Don’t implement a causally consistent data store Let clients observe a causally consistent data store
13
How do clients observe a causally consistent datastore ?
Datacenter A Datacenter B How do clients observe a causally consistent datastore ?
14
Master Slave Master Slave Slave Master Datacenter A Datacenter B Writes accepted only by master shards and then replicated asynchronously in-order to slaves
15
7 7 Master Slave 4 4 Master Slave 8 8 Slave Master Datacenter A Datacenter B Each shard keeps track of a shardstamp which counts the writes it has applied
16
7 7 4 3 2 Master Slave Client 1 4 4 Client 3 Master Slave 6 2 5 8 8 Client 2 Slave Master Datacenter A Datacenter B Causal Timestamp: Vector of shardstamps which identifies a global state across all shards
17
7 8 7 w(a) a 4 4 4 4 3 3 3 3 2 2 2 2 Master Slave Client 1 4 4 Client 3 Master Slave 6 2 5 8 8 Client 2 Slave Master Datacenter A Datacenter B Write Protocol: Causal timestamps stored with objects to propagate dependencies
18
8 8 8 7 w(a) a 4 3 2 4 3 2 Master Slave Client 1 4 4 Client 3 Master Slave 6 2 5 8 8 Client 2 Slave Master Datacenter A Datacenter B Write Protocol: Server shardstamp is incremented and merged into causal timestamps
19
Read Protocol: Always safe to read from master
8 7 w(a) a 8 3 2 8 3 2 Master Slave Client 1 4 4 r(a) Client 3 Master Slave 6 2 5 8 8 Client 2 Slave Master Datacenter A Datacenter B Read Protocol: Always safe to read from master
20
8 7 w(a) a a 8 8 3 3 2 2 8 3 2 Master Slave Client 1 4 4 r(a) Client 3 Master Slave 8 6 3 2 5 5 8 8 Client 2 Slave Master Datacenter A Datacenter B Read Protocol: Object’s causal timestamp merged into client’s causal timestamp
21
8 7 w(a) a 8 3 2 8 3 2 Master Slave Client 1 4 5 5 5 4 r(a) w(b) Client 3 Master Slave b 8 8 3 3 5 5 8 8 Client 2 Slave Master Datacenter A Datacenter B Read Protocol: Causal timestamp merging tracks causal ordering for writes following reads
22
8 7 w(a) a a 8 8 3 3 2 2 8 3 2 Master Slave Client 1 5 4 r(a) b b 8 8 5 5 5 5 w(b) Client 3 Master Slave 8 5 5 8 8 Client 2 Slave Master Datacenter A Datacenter B Replication: Like eventual consistency; asynchronous, unordered, writes applied immediately
23
8 7 w(a) a 8 3 2 a 8 3 2 8 3 2 Delayed! Master Slave Client 1 5 4 r(a) b 8 5 5 b 8 5 5 5 w(b) Client 3 Master Slave 8 5 5 8 8 Client 2 Slave Master Datacenter A Datacenter B Replication: Slaves increment their shardstamps using causal timestamp of a replicated write
24
Read Protocol: Clients do consistency check when reading from slaves
8 7 w(a) a 8 3 2 a 8 3 2 8 3 2 Delayed! Master Slave ≥ ? Client 1 5 5 5 r(b) r(a) b 8 5 5 b 8 5 5 w(b) Client 3 Master Slave 8 5 5 8 8 Client 2 Slave Master Datacenter A Datacenter B Read Protocol: Clients do consistency check when reading from slaves
25
b’s dependencies are delayed, but we can read it anyway!
8 7 w(a) a 8 3 2 a 8 3 2 8 3 2 Delayed! Master Slave 5 ≥ ? Client 1 5 5 r(b) r(a) b’s dependencies are delayed, but we can read it anyway! b 8 5 5 b b 8 8 5 5 5 5 w(b) Client 3 Master Slave 8 5 5 8 8 Client 2 Slave Master Datacenter A Datacenter B Read Protocol: Clients do consistency check when reading from slaves
26
Read Protocol: Clients do consistency check when reading from slaves
8 7 7 ≥ 8 ? w(a) Stale Shard ! a 8 3 2 a 8 3 2 r(a) 8 3 2 Delayed! Master Slave Client 1 5 5 r(b) r(a) 8 5 5 b 8 5 5 b 8 5 5 w(b) Client 3 Master Slave 8 5 5 8 8 Client 2 Slave Master Datacenter A Datacenter B Read Protocol: Clients do consistency check when reading from slaves
27
Read Protocol: Resolving stale reads
8 7 7 ≥ 8 ? w(a) Stale Shard ! a 8 3 2 a 8 3 2 r(a) 8 3 2 Delayed! Master Slave Client 1 5 5 r(b) r(a) 8 5 5 b 8 5 5 b 8 5 5 w(b) Client 3 Master Slave Options: Retry locally Read from master 8 5 5 8 8 Client 2 Slave Master Datacenter A Datacenter B Read Protocol: Resolving stale reads
28
Causal Timestamp Compression
What happens at scale when number of shards is (say) 100,000 ? 400 234 23 87 9 102 78 Size(Causal Timestamp) == 100,000 ?
29
Causal Timestamp Compression: Strawman
To compress down to n, conflate shardstamps with same ids modulo n 1000 89 13 209 Compress 1000 209 Problem: False Dependencies
30
Causal Timestamp Compression: Strawman
To compress down to n, conflate shardstamps with same ids modulo n 1000 89 13 209 Compress 1000 209 Problem: False Dependencies Solution: Use system clock as the next value of shardstamp on a write Decouples shardstamp value from number of writes on each shard
31
Causal Timestamp Compression: Strawman
To compress from N to n, conflate shardstamps with same ids modulo n 1000 89 13 209 Compress 1000 209 Problem: Modulo arithmetic still conflates unrelated shardstamps
32
Causal Timestamp Compression
Insight: Recent shardstamps more likely to create false dependencies Use high resolution for recent shardstamps and conflate the rest Shardstamps Catch-all shardstamp 4000 3989 3880 3873 3723 3678 Shard IDs 45 89 34 402 123 *
33
Causal Timestamp Compression
Insight: Recent shardstamps more likely to create false dependencies Use high resolution for recent shardstamps and conflate the rest Shardstamps Catch-all shardstamp 4000 3989 3880 3873 3723 3678 Shard IDs 45 89 34 402 123 * 0.01 % false dependencies with just 4 shardstamps and 16K logical shards
34
Transactions in OCCULT
Scalable causally consistent general purpose transactions
35
Properties of Transactions
Atomicity Read from a causally consistent snapshot No concurrent conflicting writes
36
Properties of Transactions
Observable atomicity Observably read from a causally consistent snapshot No concurrent conflicting writes
37
Properties of Transactions
Observable Atomicity Observably read from a causally consistent snapshot No concurrent conflicting writes Properties of Protocol No centralized timestamp authorities (e.g. per-datacenter) Transactions ordered using causal timestamps
38
Properties of Transactions
Observable Atomicity Observably read from a causally consistent snapshot No concurrent conflicting writes Properties of Protocol No centralized timestamp authority (e.g. per-datacenter) Transactions ordered using causal timestamps Transaction commit latency is independent of number of replicas
39
Properties of Transactions
Observable Atomicity Observably read from causally consistent snapshot No concurrent conflicting writes Three Phase Protocol Read Phase Buffer writes at client Validation Phase Client validates A, B and C using causal timestamps Commit Phase Buffered writes committed in an observably atomic way
40
Alice and her advisor are managing lists of students for three courses
a = [] a = [] Master Slave b = [Bob] 1 b = [Bob] 1 1 1 Master Slave 1 c = [Cal] c = [Cal] 1 1 1 Master Slave Datacenter A Datacenter B Alice and her advisor are managing lists of students for three courses
41
a = [] a = [] Master Slave b = [Bob] 1 b = [Bob] 1 1 1 Master Slave 1 c = [Cal] c = [Cal] 1 1 1 Master Slave Datacenter A Datacenter B Observable atomicity and causally consistent snapshot reads enforced by same mechanism
42
Transaction T1 : Alice adding Abe to course a
a = [] a = [] Master Slave Start T1 r(a) = [] w(a = [Abe]) b = [Bob] 1 b = [Bob] 1 1 1 Master Slave c = [Cal] 1 c = [Cal] 1 1 1 Master Slave Datacenter A Datacenter B Transaction T1 : Alice adding Abe to course a
43
Transaction T1 : After Commit
a = [Abe] 1 a = [] 1 1 Master Slave Start T1 r(a) = [] w(a = [Abe]) Commit T1 b = [Bob] 1 b = [Bob] 1 1 1 Master Slave c = [Cal] 1 c = [Cal] 1 1 1 Master Slave Datacenter A Datacenter B Transaction T1 : After Commit
44
Transaction T2 : Alice moving Bob from course b to course c
a = [Abe] 1 a = [] 1 1 1 1 Master Slave Start T1 r(a) = [] w(a = [Abe]) Commit T1 Start T2 r(b) = [Bob] r(c) = [Cal] w(b = []) w(c = [Bob, Cal]) b = [Bob] 1 b = [Bob] 1 1 1 Master Slave c = [Cal] 1 1 c = [Cal] 1 1 Master Slave Datacenter A Datacenter B Transaction T2 : Alice moving Bob from course b to course c
45
Atomicity through causality: Make writes dependent on each other
a = [Abe] 1 a = [] 1 1 1 1 Atomicity through causality: Make writes dependent on each other Master Slave Start T1 r(a) = [] w(a = [Abe]) Commit T1 Start T2 r(b) = [Bob] r(c) = [Cal] w(b = []) w(c = [Bob, Cal]) b = [Bob] 1 2 b = [Bob] 1 1 1 Master Slave c = [Cal] 1 2 c = [Cal] 1 1 1 Master Slave Datacenter A Datacenter B Observable Atomicity: Make writes causally dependent on each other
46
1 a = [Abe] a = [] 1 1 1 2 2 2 2 2 2 1 Master Slave Start T1 r(a) = [] w(a = [Abe]) Commit T1 Start T2 r(b) = [Bob] r(c) = [Cal] w(b = []) w(c = [Bob, Cal]) b = [Bob] 1 b = [Bob] 1 1 2 1 Master Slave 1 c = [Cal] c = [Cal] 1 1 2 1 Master Slave Datacenter A Datacenter B Observable Atomicity: Same commit timestamp makes writes causally dependent on each other
47
a = [Abe] 1 a = [] 1 2 2 1 Master Slave Start T1 r(a) = [] w(a = [Abe]) Commit T1 Start T2 r(b) = [Bob] r(c) = [Cal] w(b = []) w(c = [Bob, Cal]) Commit T2 b = [] 2 b = [Bob] 1 1 2 2 1 Master Slave c = [Bob, Cal] 2 1 c = [Cal] 1 2 2 1 Master Slave Datacenter A Datacenter B Observable Atomicity: Same commit timestamp makes writes causally dependent on each other
48
Transaction writes replicate asynchronously
1 a = [Abe] a = [] 1 2 2 1 Master Slave Start T1 r(a) = [] w(a = [Abe]) Commit T1 Start T2 r(b) = [Bob] r(c) = [Cal] w(b = []) w(c = [Bob, Cal]) Commit T2 b = [] 2 b = [Bob] 1 1 2 2 1 Master Slave c = [Bob, Cal] 2 1 c = [Cal] c = [Bob, Cal] 1 1 2 2 2 2 1 Master Slave Datacenter A Datacenter B Transaction writes replicate asynchronously
49
Transaction writes replicate asynchronously
a = [Abe] 1 a = [] 1 2 2 1 Delayed! Master Slave Start T1 r(a) = [] w(a = [Abe]) Commit T1 Start T2 r(b) = [Bob] r(c) = [Cal] w(b = []) w(c = [Bob, Cal]) Commit T2 b = [] 2 b = [Bob] 1 Delayed! 1 2 2 1 Master Slave 2 c = [Bob, Cal] c = [Bob, Cal] 2 1 2 2 1 2 2 Master Slave Datacenter A Datacenter B Transaction writes replicate asynchronously
50
Alice’s advisor reads the lists in a transaction
a = [Abe] 1 a = [] 1 2 2 1 Delayed! Master Slave Start T1 r(a) = [] w(a = [Abe]) Commit T1 Start T2 r(b) = [Bob] r(c) = [Cal] w(b = []) w(c = [Bob, Cal]) Commit T2 Start T3 b = [] 2 b = [Bob] 1 Delayed! 1 2 2 1 Master Slave c = [Bob, Cal] 2 2 c = [Bob, Cal] 1 2 2 1 2 2 Master Slave Datacenter A Datacenter B Alice’s advisor reads the lists in a transaction
51
Alice’s advisor reads the lists in a transaction
a = [Abe] 1 a = [] 1 2 2 1 Delayed! 1 Master Slave Start T1 r(a) = [] w(a = [Abe]) Commit T1 Start T2 r(b) = [Bob] r(c) = [Cal] w(b = []) w(c = [Bob, Cal]) Commit T2 Start T3 r(b) = [Bob] b = [] 2 b = [Bob] 1 Delayed! 1 2 2 1 Master Slave c = [Bob, Cal] 2 2 c = [Bob, Cal] 1 2 2 1 2 2 Master Slave Datacenter A Datacenter B Alice’s advisor reads the lists in a transaction
52
1 a = [Abe] a = [] Delayed! 1 2 2 1 1 Master Slave Start T1 r(a) = [] w(a = [Abe]) Commit T1 Start T2 r(b) = [Bob] r(c) = [Cal] w(b = []) w(c = [Bob, Cal]) Commit T2 Start T3 r(b) = [Bob] b = [] 2 1 b = [Bob] 1 Delayed! 1 2 2 1 1 Master Slave 2 c = [Bob, Cal] 2 c = [Bob, Cal] T3 Read Set b = [Bob] 1 2 2 1 2 2 Master Slave Datacenter A Datacenter B Transactions maintain a Read Set to validate atomicity and causal snapshot reads
53
a = [Abe] 1 a = [] 1 2 2 1 Delayed! 1 2 2 Master Slave Start T1 r(a) = [] w(a = [Abe]) Commit T1 Start T2 r(b) = [Bob] r(c) = [Cal] w(b = []) w(c = [Bob, Cal]) Commit T2 Start T3 r(b) = [Bob] r(c) = [Bob,Cal] b = [] 2 b = [Bob] 1 Delayed! 1 2 2 1 Master Slave c = [Bob, Cal] 2 2 c = [Bob, Cal] 2 T3 Read Set b = [Bob] 1 2 2 1 1 1 1 2 2 2 2 c = [Bob, Cal] Master Slave Datacenter A Datacenter B Transactions maintain a Read Set to validate atomicity and read from causal snapshot
54
a = [Abe] 1 a = [] 1 2 2 1 Delayed! 1 2 2 Master Slave Start T1 r(a) = [] w(a = [Abe]) Commit T1 Start T2 r(b) = [Bob] r(c) = [Cal] w(b = []) w(c = [Bob, Cal]) Commit T2 Start T3 r(b) = [Bob] r(c) = [Bob,Cal] b = [] 2 b = [Bob] 1 Delayed! 1 2 2 1 Master Slave c = [Bob, Cal] 2 c = [Bob, Cal] 2 T3 Read Set b = [Bob] 1 2 2 1 2 2 1 1 1 c = [Bob, Cal] Master Slave 2 1 2 2 2 Datacenter A Datacenter B Validation failure: c knows more writes from grey shard than applied at the time b was read
55
Ordering Violation: Detected in the usual way. Red Shard is stale !
a = [Abe] 1 a = [] Delayed! 1 2 2 1 1 2 2 Master Slave Start T1 r(a) = [] w(a = [Abe]) Commit T1 Start T2 r(b) = [Bob] r(c) = [Cal] w(b = []) w(c = [Bob, Cal]) Commit T2 Start T3 r(b) = [Bob] r(c) = [Bob,Cal] r(a) = [] b = [] 2 b = [Bob] 1 Delayed! 1 2 2 1 Master Slave 2 c = [Bob, Cal] c = [Bob, Cal] 2 T3 Read Set b = [Bob] 1 2 2 1 1 1 2 2 c = [Bob, Cal] Master Slave 2 1 2 2 Datacenter A Datacenter B Ordering Violation: Detected in the usual way. Red Shard is stale !
56
Properties of Transactions
Observable Atomicity Observably read from causally consistent snapshot No concurrent conflicting writes Three Phase Protocol Read Phase Buffer writes at client Validation Phase Client validates A, B and C using causal timestamps Commit Phase Buffered writes committed in an observably atomic way
57
Properties of Transactions
Observable Atomicity Observably read from causally consistent snapshot No concurrent conflicting writes Three Phase Protocol Read Phase Buffer writes at client Validation Phase Client validates A, B and C using causal timestamps Commit Phase Buffered writes committed in an observably atomic way 2. Validation Phase Validate Read Set to verify A and B Validate Overwrite Set to verify C
58
Evaluation
59
Evaluation Setup Occult implemented by modifying Redis Cluster (baseline) Evaluated on CloudLab Two datacenters in WI and SC 20 server machines (4 server processes per machine) 16K logical shards YCSB used as the benchmark For graphs shown here read-heavy (95% reads) workload with zipfian distribution
60
Evaluation Setup Occult implemented by modifying Redis Cluster (baseline) Evaluated on CloudLab Two datacenters in WI and SC 20 server machines (4 server processes per machine) 16K logical shards YCSB used as the benchmark For graphs shown here read-heavy (95% reads) workload with zipfian distribution We show cost of providing consistency guarantees
61
Goodput Comparison
62
4 shardstamps per causal timestamp
Goodput Comparison 8.7% 31% 39.6% 4 shardstamps per causal timestamp
63
Effect of slow nodes on Occult Latency
64
Conclusions Enforcing causal consistency in the data store is vulnerable to slowdown cascades Sufficient to ensure that clients observe causal consistency: Use lossy timestamps to provide the guarantee Avoid slowdown cascades Observable enforcement can be extended to causally consistent transactions Make writes causally dependent on each other to observe atomicity Also avoids slowdown cascades
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.