Presented by Dr. Greg Speegle April 12, 2013
Two-phase commit slow relative to local transaction processing CAP Theorem Option 1: Reduce availability Option 2: Reduce consistency Goal: Provide availability and consistency by changing transaction semantics
Normal transaction execution Submit SQL statements Subsequent operations dependent on results Deterministic transaction execution Submit all requests before start Example: Auto-commit Difficult for dependent execution
Sequencing Layer Per replica Creates universal transaction execution order Scheduling Layer Per data store Executes transactions consistently with order Storage Layer CRUD interface
Dataset partitioned Partitions are replicated One copy of each partition forms replica All replicas of one partition form replication group Master/slave within replication group (for asynchronous replication)
Requests (deterministic transaction) submitted locally Epoch – 10ms group of requests Asynchronous replication – master receives all requests & determines order Synchronous replication – Paxos determines order Batch sent to scheduler
Logical concurrency control & recovery (e.g., no TIDs) Lock manager distributed (lock only keys stored locally) Strict 2PL with changes: If t0 and t1 conflict and t0 precedes t1 in sequence order, t0 locks before t1 All lock requests by transaction processed together in sequence order
Transaction executes after all locks acquired Read/Write set analysis Local vs Remote Read-only nodes are passive participants Write nodes are active participants Local Reads Distribute reads to active participants Collect remote read results Apply local writes
Deadlock Free (acyclic waits-for graph) Dependent Transactions Read-only reconnaissance query generates read set Transaction executed with resulting read/write locks Re-execute if changes Maximum conflict footprint under 2PL
Disk I/O problem Pause t0 when I/O required A t1 can “jump ahead” of t0 (get conflicting lock before t0) Solution: Delay t0, but request data So t1 may precede t0 in sequence (assume) and execution
Logging requires only ordered transactions to restore after failure At checkpoint time (global epoch time) Keep two versions of data, before & after Transaction access appropriate data After all “before” transactions terminate, flush all data Throw away “before” version if “after” exists 20% throughput impact
TPC-C benchmark (order placing) Throughput scales linearly with number of machines Per-node throughput appears asymptotic At high contention, outperforms RDBMS At low contention, worse performance
Adds ACID capability to any CRUD system Performs nearly linear scale-up Requires deterministic transactions