CSCI5570 Large Scale Data Processing Systems NewSQL James Cheng CSE, CUHK Slide Ack.: modified based on the slides from Lianghong Xu
Calvin: fast distributed transactions for partitioned database systems Alexander Thomson, Thaddeus Diamond, Shu-Chun Weng, Kun Ren, Philip Shao, Daniel J. Abadi SIGMOD 2012
High-level Overview A transaction processing and replication layer Based on generic, non-transactional data store Full ACID for distributed transactions Active, consistent replication Horizontal scalability Key technical feature Scalable distributed transaction processing by a deterministic locking mechanism that eliminates distributed commit protocols
The Problem Distributed transactions are expensive Agreement protocol Multiple roundtrips in 2-phase commit Much longer than transaction logic itself Limited scalability Locking Lock held during the entire transaction Possible deadlock
Consistent replication Many systems allow inconsistent replication Replicas can diverge, but are eventually consistent Dynamo, SimpleDB, Cassandra… Consistent replication: emerging trend (for OLTP) Instant failover Increased latency, especially for geo-replication Cost is only in latency, not throughput or contention http://stackoverflow.com/questions/1970345/what-is-thread-contention lock contention: This occurs whenever one process or thread attempts to acquire a lock held by another process or thread. The more fine-grained the available locks, the less likely one process/thread will request a lock held by the other. (For example, locking a row rather than the entire table, or locking a cell rather than the entire row.) Lock contention occurs when one process or thread attempts to acquire a lock held by another process or thread
Goals of Calvin Eliminate 2-phase commit (scalability) Reduce lock duration (throughput) Provide consistent replication (consistency) Approach: Decoupling “transaction logic” from “heavy-lifting tasks” replication, locking, disk access … Deterministic concurrency control
Non-deterministic Database Systems Serial ordering cannot be pre-determined given certain transaction inputs Determined in the runtime logically equivalent to some serial ordering Ordering can diverge for different executions Due to different thread scheduling, network latencies … Abort on non-deterministic events, e.g., node failure
Serial Ordering in 2-phase Locking
Deterministic database systems Given transactional inputs, serial ordering is pre-determined Acquire looks in the order of the pre-determined transactional input All replicas can be made to emulate the same serial execution order Database states can be guaranteed not to diverge
Deterministic database systems Given transactional inputs, serial ordering is pre-determined Benefits: No agreement protocol No need to check node failures Recovery from other replicas No aborts due to non-deterministic events Consistent replication made easier Only need to replicate transactional inputs Execute same transaction inputs in replicating sites Disadvantage Reduced concurrency due to the pre-determined ordering
Architecture Consistent replication Deterministic locking Client Requests Replicated requests Consistent replication Sequencer Sequencer Sequencer Sequencer Partial Requests batch Deterministic locking Scheduler Scheduler Scheduler Scheduler Remote access create, read, update and delete Local access “CRUD” storage interface Storage Storage Storage Storage Partition 1 Partition 2 Partition 1 Partition 2 Replica A Replica B
Sequencer and replication Transactional inputs are divided into batches each fallen into a 10-msec epoch Sequencer places transactions in each batch as a global sequence Nodes are organized into replication groups, each containing all replicas of a particular partition (e.g., partition 1 in replica A and partition 1 in replica B form a replication group) Two modes of replication asynchronous mode synchronous mode
Sequencer and replication Only transactional inputs need to be replicated Asynchronous replication one replica is designated as master master replica handles all transaction inputs each sequencer on master propagates to slave sequencers in its replication group low latency, but complex failure recovery Synchronous replication all sequencers in a replication group use Paxos (ZooKeeper) to agree on a combined batch of transactions for each epoch larger latency but throughput not affected contention footprint (i.e., total duration a transaction holds its locks) remains the same
Scheduler and deterministic locking Each node’s scheduler only lock records stored locally Single-threaded locking manager Resemble 2-phase locking Enforce transaction ordering from sequencers if transactions A and B both request locks on record R, and A before B in the sequence, then A must request lock on R before B, and lock manager must grant lock on R to A before B No deadlock Question: what if A and B come from 2 sequencers, X and Y, that are not in the same replication group? How to order A and B? it seems that X and Y do not talk to each other (or also use Paxos to agree on a global order?)
Deterministic locking Read A Read B Write A T1 Pre-determined serial ordering Read B Write B T2 Read C T3 T1 must precede T2 T1: Lw(A) T1: Lr(B) T2: Lw(B) T3: Lr(C) Memory A B C Disk
Worker execution A B C Read A Read B Write A T1 Pre-determined serial ordering Read B Write B T2 Read C T3 Active participant: write set is stored locally Passive participant: only read set is stored Memory A B C Disk Active participant Passive participant Passive participant
Worker execution A B C Read A Read B Write A T1 Pre-determined serial ordering Read B Write B T2 Read C T3 Memory A B C Disk Active participant
Problem of deterministic locking Non-deterministic locking Read A Read B Write A Read A Read B Write B Read C Write A Need to fetch from disk T1 Reordering Read B Write B T2 Read C T3 T2 cannot proceed because T1 is block waiting on disk access to A Even if T1 does NOT acquire lock for B yet T3 can still proceed
Calvin’s solution Delay forwarding transaction requests Problem Prefetch data from disk to eliminate contention Hope: most (99%) transactions execute without access to disk Problem How long to delay? Need to estimate disk and network latency Tracking all in-memory records Need a scalable approach Future work
Checkpointing Fault-tolerance made easier in Calvin Instant failover Only log transactional inputs, no REDO log Checkpointing for fast recovery Failure recovery from a recent state Rather than from scratch
Checkpointing Naive synchronous checkpointing Freeze one replica only for checkpointing Difficult to bring the frozen replica up-to-date Asynchronous variation of Zigzag Zigzag 2 copies per record in each replica, 2 times memory footprint 1 copy is up-to-date, the other is the latest checkpoint Calvin’s Zigzag algorithm Specify a point in the global serial order, two versions for each record: before: only updated by transactions preceding the point after: only updated by transactions after the point Once all transactions preceding the point completed, before version is immutable and checkpointed Then before version is deleted, after version will be used for the next checkpointing
Scalability Calvin scales (near-) linearly Throughput comparable to world-record With much cheaper hardware
Scalability Scales linearly for low contention workloads Figure 5: Total and per-node microbenchmark throughput, varying deployment size. Scales linearly for low contention workloads Scales sub-linearly when contention is high Stragglers (slow machines, execution process skew) Exacerbated with high contention
Scalability under high contention Slowdown is measured by comparing with running a perfectly partitionable, low-contention version of the same workload Calvin scales better than 2PC in face of high contention