Transactional storage for geo-replicated systems Yair Sovran, Russell Power, Marcos K. Aguilera, Jinyang Li NYU and MSR SVC
Life in a web startup
Web apps need geo-replicated storage Geo-replicated transactional storage
Consistency vs. performance: existing tradeoffs Eventual Consistency Less coordination More anomalies More coordination Fewer anomalies Serializability Maximize multi-site performance Have few anomalies Maximize multi-site performance Have few anomalies Snapshot Isolation
Our contribution 1.New semantics: Parallel Snapshot Isolation (PSI) 2.Walter: implementing PSI efficiently – Preferred site – Counting set 3.Application experience
Snapshot isolation Timeline of storage state Read-X Write-X Commit Read-Y Write-Y Commit Snapshot isolations guarantees 1.Read snapshots from global timeline 2.Prohibit write-write conflict 3.Preserve causality T1 T2
PSI avoids global transaction ordering Site1 Site2 Site1 timeline Site2 timeline Read-X Write-X Commit Read-Y Write-Y Commit A transaction commits locally first, then propagates to remote sites. T1 T2 Walter achieves this efficiently Snapshot isolations guarantees 1.Read snapshots from global timeline 2.Prohibit write-write conflict 3.Preserve causality Parallel Per-site
PSI has few anomalies short forkNoYes long forkNo Yes conflicting forkNo Yes AnomalySerializ- ability Snapshot Isolation PSIEventual dirty readNo Yes non-repeatable read No Yes lost updateNo Yes
PSIs anomaly T1 T2 Short fork (allowed by snapshot isolation) T1 commits T2 commits Long fork (disallowed by snapshot isolation) T1 T2 T1 commits T2 commits T1 and T2 propagate to both sites
Walter overview C C Start_TX Commit_TX Read Write C C C C C C C C C C Replicate data Coordinate for PSI Site1 Site2 Main challenge: avoid write-write conflict across sites Walters solution 1.Preferred site 2.Counting set
Technique #1: preferred site Associate each users data with a preferred site Common case: write at preferred site fast commit – Rare case: write at non-preferred site cross-site 2-phase commit Bobs photos Bobs photos Alices photos Write C C C C Alices photos Bobs photos Write (fast commit) slow commit Site1Site2 Alice Bob
Technique #2: counting set Problem: some objects are modified from many sites Counting set: a data type free of write-write conflict Be-friend Eve write C C C C Site 1Site 2 Eves friendlist Eves friendlist Alice Bob
Technique #2: counting set add(Bob) Add/del operations commute no need to check for write-write conflict Caveat: application developers must deal with counts C C Bob 1 Alice 1 Bob 1 add(Alice) C C add Alice 1 Eves friendlist Alice Bob Be-friend Eve Site1 Site2
Site failure Two options to handle a site failure – Conservative: block writes whose preferred site failed – Aggressive: re-assign preferred site elsewhere Warning: Committed but not-yet- replicated transactions may be lost
Application #1: WaltSocial Wall and Friendlist are counting sets Meow says: Meow Meow Meow Bob-cat says: I saw a mouse Peanut says: awldaiwdliawd Meow says: I think I ate too much catnip last night. Meow. Befriend transaction A read Alices profile B read Bobs profile Add A.uid to B.friendlist Add B.uid to A.friendlist Add Alice is now friends with Bob to A.wall Add Bob is now friends with Alice to B.wall
Applications #2: Twitter clone Third party app in PHP Our port: switch storage backend from Redis to Walter Each users timeline is a counting set Post-status transaction write status to new object O foreach f in users followers add O to fs timeline_cset
Evaluation Walter prototype – Implemented in C++ with PHP binding – Custom RPC library with Protocol Buffers Testbed: Amazon EC2 – Extra-large instance – Up to 4-sites (Virginia, California, Ireland, Singapore) Full replication across sites
Walter scales Read/write a 100-byte object Reads working set fits in memory Read Write
WaltSocial achieves low latency A post-on-wall transaction reads 2 objects, writes 2 objects, updates 2 counting sets A post-on-wall transaction reads 2 objects, writes 2 objects, updates 2 counting sets
Walter lets ReTwis scale to >1 sites Read Timeline Post status Follow user Redis Walter (1-site) Walter (2-site)
Related work Cloud storage systems – Single-site: Bigtable, Sinfonia, Percolator – No/limited transaction: Dynamo, COPS, PNUTS – Synchronous replication: Megastore, Scatter Replicated database systems – Eager vs. lazy replication – Escrow transactions: for numeric data Conflict-free replicated data types – Inspired counting sets
Conclusion PSI is a good tradeoff for geo-replicated storage – Allows fast commit with asynchronous replication – Prohibits write-write conflict and preserves causality Walter realizes PSI efficiently – Preferred site – Conflict-free counting set