CSCI5570 Large Scale Data Processing Systems

CSCI5570 Large Scale Data Processing Systems
NewSQL James Cheng CSE, CUHK Slide Ack.: modified based on the slides from Lianghong Xu

Calvin: fast distributed transactions for partitioned database systems
Alexander Thomson, Thaddeus Diamond, Shu-Chun Weng, Kun Ren, Philip Shao, Daniel J. Abadi SIGMOD 2012

High-level Overview A transaction processing and replication layer
Based on generic, non-transactional data store Full ACID for distributed transactions Active, consistent replication Horizontal scalability Key technical feature Scalable distributed transaction processing by a deterministic locking mechanism that eliminates distributed commit protocols

The Problem Distributed transactions are expensive Agreement protocol
Multiple roundtrips in 2-phase commit Much longer than transaction logic itself Limited scalability Locking Lock held during the entire transaction Possible deadlock

Consistent replication
Many systems allow inconsistent replication Replicas can diverge, but are eventually consistent Dynamo, SimpleDB, Cassandra… Consistent replication: emerging trend (for OLTP) Instant failover Increased latency, especially for geo-replication Cost is only in latency, not throughput or contention lock contention: This occurs whenever one process or thread attempts to acquire a lock held by another process or thread. The more fine-grained the available locks, the less likely one process/thread will request a lock held by the other. (For example, locking a row rather than the entire table, or locking a cell rather than the entire row.) Lock contention occurs when one process or thread attempts to acquire a lock held by another process or thread

Goals of Calvin Eliminate 2-phase commit (scalability)
Reduce lock duration (throughput) Provide consistent replication (consistency) Approach: Decoupling “transaction logic” from “heavy-lifting tasks” replication, locking, disk access … Deterministic concurrency control

Non-deterministic Database Systems
Serial ordering cannot be pre-determined given certain transaction inputs Determined in the runtime logically equivalent to some serial ordering Ordering can diverge for different executions Due to different thread scheduling, network latencies … Abort on non-deterministic events, e.g., node failure

Serial Ordering in 2-phase Locking

Deterministic database systems
Given transactional inputs, serial ordering is pre-determined Acquire looks in the order of the pre-determined transactional input All replicas can be made to emulate the same serial execution order Database states can be guaranteed not to diverge

Deterministic database systems
Given transactional inputs, serial ordering is pre-determined Benefits: No agreement protocol No need to check node failures Recovery from other replicas No aborts due to non-deterministic events Consistent replication made easier Only need to replicate transactional inputs Execute same transaction inputs in replicating sites Disadvantage Reduced concurrency due to the pre-determined ordering

Architecture Consistent replication Deterministic locking
Client Requests Replicated requests Consistent replication Sequencer Sequencer Sequencer Sequencer Partial Requests batch Deterministic locking Scheduler Scheduler Scheduler Scheduler Remote access create, read, update and delete Local access “CRUD” storage interface Storage Storage Storage Storage Partition 1 Partition 2 Partition 1 Partition 2 Replica A Replica B

Sequencer and replication
Transactional inputs are divided into batches each fallen into a 10-msec epoch Sequencer places transactions in each batch as a global sequence Nodes are organized into replication groups, each containing all replicas of a particular partition (e.g., partition 1 in replica A and partition 1 in replica B form a replication group) Two modes of replication asynchronous mode synchronous mode

Sequencer and replication
Only transactional inputs need to be replicated Asynchronous replication one replica is designated as master master replica handles all transaction inputs each sequencer on master propagates to slave sequencers in its replication group low latency, but complex failure recovery Synchronous replication all sequencers in a replication group use Paxos (ZooKeeper) to agree on a combined batch of transactions for each epoch larger latency but throughput not affected contention footprint (i.e., total duration a transaction holds its locks) remains the same

Scheduler and deterministic locking
Each node’s scheduler only lock records stored locally Single-threaded locking manager Resemble 2-phase locking Enforce transaction ordering from sequencers if transactions A and B both request locks on record R, and A before B in the sequence, then A must request lock on R before B, and lock manager must grant lock on R to A before B No deadlock Question: what if A and B come from 2 sequencers, X and Y, that are not in the same replication group? How to order A and B? it seems that X and Y do not talk to each other (or also use Paxos to agree on a global order?)

Deterministic locking
Read A Read B Write A T1 Pre-determined serial ordering Read B Write B T2 Read C T3 T1 must precede T2 T1: Lw(A) T1: Lr(B) T2: Lw(B) T3: Lr(C) Memory A B C Disk

Worker execution A B C Read A Read B Write A T1 Pre-determined
serial ordering Read B Write B T2 Read C T3 Active participant: write set is stored locally Passive participant: only read set is stored Memory A B C Disk Active participant Passive participant Passive participant

Worker execution A B C Read A Read B Write A T1 Pre-determined
serial ordering Read B Write B T2 Read C T3 Memory A B C Disk Active participant

Problem of deterministic locking
Non-deterministic locking Read A Read B Write A Read A Read B Write B Read C Write A Need to fetch from disk T1 Reordering Read B Write B T2 Read C T3 T2 cannot proceed because T1 is block waiting on disk access to A Even if T1 does NOT acquire lock for B yet T3 can still proceed

Calvin’s solution Delay forwarding transaction requests Problem
Prefetch data from disk to eliminate contention Hope: most (99%) transactions execute without access to disk Problem How long to delay? Need to estimate disk and network latency Tracking all in-memory records Need a scalable approach Future work

Checkpointing Fault-tolerance made easier in Calvin
Instant failover Only log transactional inputs, no REDO log Checkpointing for fast recovery Failure recovery from a recent state Rather than from scratch

Checkpointing Naive synchronous checkpointing
Freeze one replica only for checkpointing Difficult to bring the frozen replica up-to-date Asynchronous variation of Zigzag Zigzag 2 copies per record in each replica, 2 times memory footprint 1 copy is up-to-date, the other is the latest checkpoint Calvin’s Zigzag algorithm Specify a point in the global serial order, two versions for each record: before: only updated by transactions preceding the point after: only updated by transactions after the point Once all transactions preceding the point completed, before version is immutable and checkpointed Then before version is deleted, after version will be used for the next checkpointing

Scalability Calvin scales (near-) linearly
Throughput comparable to world-record With much cheaper hardware

Scalability Scales linearly for low contention workloads
Figure 5: Total and per-node microbenchmark throughput, varying deployment size. Scales linearly for low contention workloads Scales sub-linearly when contention is high Stragglers (slow machines, execution process skew) Exacerbated with high contention

Scalability under high contention
Slowdown is measured by comparing with running a perfectly partitionable, low-contention version of the same workload Calvin scales better than 2PC in face of high contention

CSCI5570 Large Scale Data Processing Systems

Similar presentations

Presentation on theme: "CSCI5570 Large Scale Data Processing Systems"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSCI5570 Large Scale Data Processing Systems

Similar presentations

Presentation on theme: "CSCI5570 Large Scale Data Processing Systems"— Presentation transcript:

Similar presentations

About project

Feedback