When Scalability Meets Consistency: Genuine Multiversion Update-Serializable Partial Data Replication Sebastiano Peluso, Pedro Ruivo, Paolo Romano, Francesco Quaglia and Luís Rodrigues 1st Euro-TM Workshop on Distributed Transactional Memory (WDTM 2012), Lisbon, Portugal
Talk Structure Motivation and related work The GMU protocol Experimental results 1st Euro-TM Workshop on Distributed Transactional Memory (WDTM 2012), Lisbon, Portugal 1
Motivation and related work 1st Euro-TM Workshop on Distributed Transactional Memory (WDTM 2012), Lisbon, Portugal 2
Distributed STMs STMs are being employed in new scenarios: Database caches in three-tier web apps (FénixEDU) HPC programming language (X10) In-memory cloud data grids (Coherence, Infinispan) New challenges: Scalability Fault-tolerance 1st Euro-TM Workshop on Distributed Transactional Memory (WDTM 2012), Lisbon, Portugal REPLICATION 3
Full Replication All sites store the whole set of data Full replication in transactional systems is a very investigated problem: Several solutions in DBMS world: Update anywhere-anytime-anyway solutions [SIGMOD96] Deferred-update replication techniques [JDPD03, VLDB00] Lazy techniques by relaxing consistency properties [SOSP07] Specific solutions for DSTMs: Efficient coding of the read-set [PRDC09] Communication/computation overlapping [NCA10] Lease-based commits [Middleware10] 1st Euro-TM Workshop on Distributed Transactional Memory (WDTM 2012), Lisbon, Portugal 4
Partial Replication It is a way to increase scalability. Each site stores a partial copy of the data. Genuine partial replication schemes maximize scalability by ensuring that: Only data sites that replicate data item read or written by a transaction T, exchange messages for executing/committing T. Existing 1-Copy Serializable implementations enforce distributed validation of read-only transactions [SRDS10]: considerable overheads in typical workloads 1st Euro-TM Workshop on Distributed Transactional Memory (WDTM 2012), Lisbon, Portugal 5
Objectives 1st Euro-TM Workshop on Distributed Transactional Memory (WDTM 2012), Lisbon, Portugal Requirements Read-only transactions never abort or block Genuine certification mechanism Objectives Partially replicated DSTM Scalability and performance as first class targets Find a sweet spot in the consistency/performance tradeoff 6
Issues with Partial Replication Extending existing local multiversion (MV) STMs is not enough Local MV STMs rely on a single global counter to track version advancement Problem: Commit of transactions should involve ALL NODES NO GENUINENESS = POOR SCALABILITY 1st Euro-TM Workshop on Distributed Transactional Memory (WDTM 2012), Lisbon, Portugal 7
GMU: Genuine Multiversion Update serializable replication [ICDCS12] 1st Euro-TM Workshop on Distributed Transactional Memory (WDTM 2012), Lisbon, Portugal 8
Key concepts In the execution/commit phase of a transaction T, ONLY nodes which store data items accessed by T are involved. It uses multiple versions for each data item It builds visible snapshots = freshest consistent snapshots taking into account: 1. causal dependencies vs. previously committed transactions at the time a transaction began, 2. previous reads executed by the same transaction Vector clocks used to establish visible snapshots 1st Euro-TM Workshop on Distributed Transactional Memory (WDTM 2012), Lisbon, Portugal G M U 9
Main data structures (i) For each node N: VCLog: sequence of vector clocks of “recently” committed transactions on N PrepareVC: vector clock greater than or equal to the most recent vector clock in VCLog 1st Euro-TM Workshop on Distributed Transactional Memory (WDTM 2012), Lisbon, Portugal 10
Main data structures (ii) For each transaction T: VC: a vector clock that is initialized with the most recent vector clock in local VCLog, updated upon reads during execution >> to ensure that T observes the most recent serializable snapshot, at commit time >> to assign final vector clock to the transaction (and to its write-set). 1st Euro-TM Workshop on Distributed Transactional Memory (WDTM 2012), Lisbon, Portugal 11
Main data structures (iii) A chain of versions per data item id: previous: value: 2 VN: 8 previous: value: 1 VN: 5 previous: value: 0 VN: 2 id Versions: 1st Euro-TM Workshop on Distributed Transactional Memory (WDTM 2012), Lisbon, Portugal i n-2 n-1 Transaction T commits on node i T’s Vector Clock 12
T reads id on node i: Rule 1 Informally: it avoids reading remotely “too old” versions Formally: if it is the first read of T on i wait that VCLog.mostRecVC i [i] >= T.VC[i] this ensures that causal dependencies are enforced 1st Euro-TM Workshop on Distributed Transactional Memory (WDTM 2012), Lisbon, Portugal 13
Rule 1 in action Node 0 Node 1 (it stores X) Node 2 (it stores Y) X (2) T 1 :R(X) (1,1,1) (1,2,2) (1,1,1) Y (2) (1,2,2) T 0 :W(X,v) T 0 :W(Y,w) (1,1,1) T 1 :R(Y) Y (2) (1,2,2) Most recent VC in VCLog T 1.VC T 0 :Commit Commit (1,2,2) T 1.VC 1st Euro-TM Workshop on Distributed Transactional Memory (WDTM 2012), Lisbon, Portugal 14
T reads id on node i: Rule 2 Informally: it maximizes freshness by moving T’s VC ahead in time “as much as possible” in commit log Formally: if it is the first read of T on i, select the most recent VC in i’s Commit Log s.t. VC[j] <= T.VC[j] for each node j on which T has already read T.VC=MAX{VC, T.VC} Note: this updates only the entries of T.VC of the nodes from which T had not read yet 1st Euro-TM Workshop on Distributed Transactional Memory (WDTM 2012), Lisbon, Portugal 15
Rule 2 in action Node 0 Node 1 (it stores X) Node 2 (it stores Y) X (21) Y (11) X (20) T 0 :R(X) (1,1,1) (1,21,21) (1,1,1) Y (21) (1,21,21) T 1 :W(X,v) T 1 :W(Y,w) X (20) (1,20,1) T 0 :R(Y)Y (11) T 0 :Commit (1,20,1) Most recent VC in VCLog T 0.VC T 1 :Commit Commit (1,20,11) T 0.VC 1st Euro-TM Workshop on Distributed Transactional Memory (WDTM 2012), Lisbon, Portugal (1,1,11) Y (1) 16
T reads id on node i: Rule 3 Informally: observe the most recent consistent version of id based on T’s history (previous reads) Formally: iterate over the versions of id and return the most recent one s.t. id.version.VN <= T.VC[i] 1st Euro-TM Workshop on Distributed Transactional Memory (WDTM 2012), Lisbon, Portugal 17
Committing read-only transactions Read-only transactions commit locally: No additional validations No possibility of aborts … and are never blocked, as in typical multiversion schemes. 1st Euro-TM Workshop on Distributed Transactional Memory (WDTM 2012), Lisbon, Portugal 18
Committing update transactions Run 2PC : Upon prepare message reception (participant-side i): Acquire read & write locks Validate read-set Increase PrepareVC[i] number and send PrepareVC back If all replies are positive (coordinator-side): Build a commit vector clock Broadcast back commit message 1st Euro-TM Workshop on Distributed Transactional Memory (WDTM 2012), Lisbon, Portugal 19
Building the commit Vector Clock A variant of the Skeen’s algorithm is implemented [SKEEN85]. This allows to keep track causal dependencies developed by: a transaction T during its execution, the most recent committed transactions at the nodes contacted by T 1st Euro-TM Workshop on Distributed Transactional Memory (WDTM 2012), Lisbon, Portugal 20
Consistency criterion GMU ensures Extended Update Serializability: Update Serializability ensures: 1-Copy-Serializabilty (1CS) on the history restricted to committed update transactions 1CS on the history restricted to committed update transactions and any single read-only transaction: but it can admit non-1CS histories containing at least 2 read-only transactions Extended Update Serializability: ensures US property also to executing transactions analogous to opacity in STMs 1st Euro-TM Workshop on Distributed Transactional Memory (WDTM 2012), Lisbon, Portugal 21
Experimental Results 1st Euro-TM Workshop on Distributed Transactional Memory (WDTM 2012), Lisbon, Portugal 22
Experiments on private cluster 8 core physical nodes TPC-C - 90% read-only xacts - 10% update xacts - 4 threads per node - moderate contention (15% abort rate at 20 nodes) 1st Euro-TM Workshop on Distributed Transactional Memory (WDTM 2012), Lisbon, Portugal 23
Experiments on private cluster 8 core physical nodes TPC-C - 90% read-only xacts - 10% update xacts - 4 threads per node - moderate contention (15% abort rate at 20 nodes) 1st Euro-TM Workshop on Distributed Transactional Memory (WDTM 2012), Lisbon, Portugal 24
FutureGrid Experiments All nodes are 2-core VMs deployed in the same site TPC-C - 90% read-only xacts - 10% update xacts - 1 thread per node - low/moderate contention, also at 40 nodes 1st Euro-TM Workshop on Distributed Transactional Memory (WDTM 2012), Lisbon, Portugal 25
Thanks for the attention 1st Euro-TM Workshop on Distributed Transactional Memory (WDTM 2012), Lisbon, Portugal 26
References 1st Euro-TM Workshop on Distributed Transactional Memory (WDTM 2012), Lisbon, Portugal [ICDCS12] Sebastiano Peluso, Pedro Ruivo, Paolo Romano, Francesco Quaglia, Luís Rodrigues. “When Scalability Meets Consistency: Genuine Multiversion Update-Serializable Partial Replication”. The IEEE 32 nd International Conference on Distributed Computing Systems, June, [JDPD03] Fernando Pedone, Rachid Guerraoui, André Schiper. “The Database State Machine Approach”. Journal of Distributed and Parallel Databases, vol. 14, issue 1, 71-98, July, [Middleware10] Nuno Carvalho, Paolo Romano, Luís Rodrigues. “Asynchronous lease-based replication of software transactional memory”. Proc. of the 11 th ACM/IFIP/USENIX International Conference on Middleware, , [NCA10] Roberto Palmieri, Francesco Quaglia, Paolo Romano. “AGGRO: Boosting STM Replication via Aggressively Optimistic Transaction Processing”. Proc. of the 9 th IEEE International Symposium on Networking Computing and Applications, 20-27, [PRDC09] Maria Couceiro, Paolo Romano, Nuno Carvalho, Luís Rodrigues. “D2STM: Dependable Distributed Software Trasanctional Memory”. Proc. of 15 th IEEE Pacific Rim International Symposium on Dependable Computing, , [SIGMOD96] Jim Gray, Pat Helland, Patrick O’Neil, Dennis Shasha. “The dangers of replication and solutions”. Proc. of the 1996 ACM SIGMOD international conference on Management of data, vol. 25, issue 2, , June, [SKEEN85] D. Skeen. “Unpublished communication”, Referenced in K. Birman, T. Joseph “Reliable Communication in the Presence of Failures”, ACM Trans. on Computer Systems, 47-76, 1987 [SOSP07] G. DeCandia et al. “Dynamo: Amazon’s Highly Available key-value Store”. Proc. of the 21 st ACM SIGOPS Symposium on Operating Systems Principles, 2007 [SRDS10] Nicolas Schiper, Pierre Sutra, Fernando Pedone. “P-Store: Genuine Partial Replication in Wide Area Networks”. Proc. of the 29 th Symposium of Reliable Distributed Systems, [VLDB00] Bettina Kemme, Gustavo Alonso. “Don’t Be Lazy, Be Consistent: Postgres-R, A New Way to Implement Database Replication”. Proc. of the 26 th International Conference on Very Large Data Bases, ,