COS 418: Distributed Systems Lecture 16 Wyatt Lloyd

Slides:

Advertisements

Similar presentations

Wyatt Lloyd * Michael J. Freedman * Michael Kaminsky David G. Andersen * Princeton, Intel Labs, CMU Dont Settle for Eventual : Scalable Causal Consistency.

Advertisements

High throughput chain replication for read-mostly workloads

Spanner: Google’s Globally-Distributed Database James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat,

Failure Detection The ping-ack failure detector in a synchronous system satisfies – A: completeness – B: accuracy – C: neither – D: both.

SPANNER: GOOGLE’S GLOBALLYDISTRIBUTED DATABASE James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat,

Presented By Alon Adler – Based on OSDI ’12 (USENIX Association)

Group Communications Group communication: one source process sending a message to a group of processes: Destination is a group rather than a single process.

Low-Latency Multi-Datacenter Databases using Replicated Commit

Spanner Lixin Shi Quiz2 Review (Some slides from Spanner’s OSDI presentation)

Spanner: Google’s Globally-Distributed Database James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman,Sanjay Ghemawat,

CSC 536 Lecture 10. Outline Case study Google Spanner Consensus, revisited Raft Consensus Algorithm.

Megastore: Providing Scalable, Highly Available Storage for Interactive Services J. Baker, C. Bond, J.C. Corbett, JJ Furman, A. Khorlin, J. Larson, J-M.

CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Replication Steve Ko Computer Sciences and Engineering University at Buffalo.

CSE 486/586, Spring 2014 CSE 486/586 Distributed Systems Google Spanner Steve Ko Computer Sciences and Engineering University at Buffalo.

Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.

Distributed Systems Lecture 5 Time and synchronization 1.

Spanner: Google’s Globally Distributed Database

Shuai Mu, Lamont Nelson, Wyatt Lloyd, Jinyang Li

a journey from the simple to the optimal

Alternative system models

Spanner Storage insights

CSE 486/586 Distributed Systems Consistency --- 2

CSE 486/586 Distributed Systems Consistency --- 1

The SNOW Theorem and Latency-Optimal Read-Only Transactions

CPS 512 midterm exam #1, 10/7/2016 Your name please: ___________________ NetID:___________ /60 /40 /10.

Strong Consistency & CAP Theorem

Spanner: Google’s Globally-Distributed Database

Distributed Transactions and Spanner

The SNOW Theorem and Latency-Optimal Read-Only Transactions

Strong Consistency & CAP Theorem

COS 518: Advanced Computer Systems Lecture 11 Michael Freedman

MVCC and Distributed Txns (Spanner)

Replication and Consistency

Concurrency Control II (OCC, MVCC) and Distributed Transactions

EECS 498 Introduction to Distributed Systems Fall 2017

EECS 498 Introduction to Distributed Systems Fall 2017

EECS 498 Introduction to Distributed Systems Fall 2017

EECS 498 Introduction to Distributed Systems Fall 2017

Spanner: Google’s Globally-Distributed Database

Implementing Consistency -- Paxos

Replication and Consistency

EECS 498 Introduction to Distributed Systems Fall 2017

Distributed Systems, Consensus and Replicated State Machines

Concurrency Control II (OCC, MVCC)

CSE 486/586 Distributed Systems Consistency --- 1

EECS 498 Introduction to Distributed Systems Fall 2017

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

Scalable Causal Consistency

Lecture 21: Replication Control

Atomic Commit and Concurrency Control

EEC 688/788 Secure and Dependable Computing

EECS 498 Introduction to Distributed Systems Fall 2017

Concurrency Control II and Distributed Transactions

View Change Protocols and Reconfiguration

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

The SMART Way to Migrate Replicated Stateful Services

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

Lecture 21: Replication Control

COS 518: Advanced Computer Systems Lecture 12 Michael Freedman

CSE 486/586 Distributed Systems Consistency --- 2

Implementing Consistency -- Paxos

CSE 486/586 Distributed Systems Consistency --- 3

Concurrency control (OCC and MVCC)

Replication and Consistency

Presentation transcript:

COS 418: Distributed Systems Lecture 16 Wyatt Lloyd Spanner COS 418: Distributed Systems Lecture 16 Wyatt Lloyd Slides adapted from Mike Freedman’s, which are adapted from the Spanner OSDI talk

Why Google Built Spanner 2005 – BigTable [OSDI 2006] Eventually consistent across datacenters Lesson: “don’t need distributed transactions” 2008? – MegaStore [CIDR 2011] Strongly consistent across datacenters Option for distributed transactions Performance was not great… 2011 – Spanner [OSDI 2012] Strictly Serializable Distributed Transactions “We wanted to make it easy for developers to build their applications”

Spanner: Google’s Globally-Distributed Database OSDI 2012

Google’s Setting Dozens of datacenters (zones) Per zone, 100-1000s of servers Per server, 100-1000 shards (tablets) Every shard replicated for fault-tolerance (e.g., 5x)

Scale-out vs. Fault Tolerance P P P Q Q Q Every shard replicated via MultiPaxos So every “operation” within transactions across tablets actually a replicated operation within Paxos RSM Paxos groups can stretch across datacenters!

Read-Only Transactions Transactions that only read data Predeclared, i.e., developer uses READ_ONLY flag / interface Reads are dominant operations e.g., FB’s TAO had 500 reads : 1 write [ATC 2013] e.g., Google Ads (F1) on Spanner from 1? DC in 24h: 21.5B reads 31.2M single-shard transactions 32.1M multi-shard transactions

Make Read-Only Txns Efficient Ideal: Read-only transactions that are non-blocking Arrive at shard, read data, send data back Impossible with Strict Serializability SNOW after the break! Goal 1: Lock-free read-only transactions Goal 2: Non-blocking stale read-only txns

Disruptive idea: Do clocks really need to be arbitrarily unsynchronized? Can you engineer some max divergence?

TrueTime Consider event enow which invoked tt = TT.now(): “Global wall-clock time” with bounded uncertainty ε is worst-case clock divergence Timestamps become intervals, not single values TT.now() time earliest latest 2*ε Consider event enow which invoked tt = TT.now(): Guarantee: tt.earliest <= tabs(enow) <= tt.latest

TrueTime for Read-Only Txns Assign all transactions a wall-clock commit time (s) All replicas of all shards track how up-to-date they are with tsafe: all transactions with s < tsafe have committed on this machine Goal 1: Lock-free read-only transactions Current time ≤ TT.now.latest() sread = TT.now.latest() wait until sread < tsafe Read data as of sread Goal 2: Non-blocking stale read-only txns Similar to above, except explicitly choose time in the past (Trades away consistency for better perf, e.g., lower latency)

Timestamps and TrueTime Acquired locks Release locks T Pick s > TT.now().latest s Wait until TT.now().earliest > s Commit wait Pick s > TT.now().latest ensures s, the commit time, is > the start time of the transaction (just like read-only transactions) Commit wait, until TT.now().earliest > s, ensures that s is < the end time of the transaction (DRAW NEXT PART OUT ON BOARD) Now a read-only transaction that starts after a transactions end will definitely see it, without locks!: txn_end < read_begin (definition) read_begin < s_read (picking of s_read) s_txn < txn_end (commit wait) => s_txn < s_read, read_only transaction will see all required transactions, without locks!! average ε average ε

Commit Wait Enables efficient read-only transactions Cost: 2ε extra latency Reduce/eliminate by overlapping with: Replication Two-phase commit

Commit Wait and Replication Start consensus Achieve consensus Notify followers Acquired locks Release locks T Pick s Commit wait done Also need to do replication, with paxos! Can do these in parallel. Consensus often takes longer. Need to wait to notify followers, to ensure their t_safe is correct Sufficient for single-shard transactions!

Client-Driven Transactions for Multi-Shard Transactions Client: 2PL w/ 2PC Issues reads to leader of each shard group, which acquires read locks and returns most recent data Locally performs writes Chooses coordinator from set of leaders, initiates commit Sends commit message to each leader, include identify of coordinator and buffered writes Waits for commit from coordinator

Commit Wait and 2PC On commit msg from client, leaders acquire local write locks If non-coordinator: Choose prepare ts > previous local timestamps Log prepare record through Paxos Notify coordinator of prepare timestamp If coordinator: Wait until hear from other participants Choose commit timestamp >= prepare ts, > local ts Logs commit record through Paxos Wait commit-wait period Sends commit timestamp to replicas, other leaders, client All apply at commit timestamp and release locks

Commit Wait and 2PC Leaders acquire local write locks Acquired locks TC Acquired locks TP1 Acquired locks TP2 Leaders acquire local write locks If non-coordinator: Choose prepare ts > previous local timestamps Log prepare record through Paxos Notify coordinator of prepare timestamp If coordinator: Wait until hear from other participants Choose commit timestamp >= prepare ts, > local ts Logs commit record through Paxos Wait commit-wait period Sends commit timestamp to replicas, other leaders, client All apply at commit timestamp and release locks Compute sp for each Client issues reads to leader of each shard group, which acquires read locks and returns most recent data

Commit Wait and 2PC Leaders acquire local write locks Start logging Done logging Acquired locks TC Acquired locks TP1 Acquired locks TP2 Prepared Leaders acquire local write locks If non-coordinator: Choose prepare ts > previous local timestamps Log prepare record through Paxos Notify coordinator of prepare timestamp If coordinator: Wait until hear from other participants Choose commit timestamp >= prepare ts, > local ts Logs commit record through Paxos Wait commit-wait period Sends commit timestamp to replicas, other leaders, client All apply at commit timestamp and release locks Send sp Compute sp for each Locally performs writes Chooses coordinator from set of leaders, initiates commit Sends commit msg to each leader, incl. identity of coordinator

Commit Wait and 2PC Leaders acquire local write locks Start logging Done logging Acquired locks Release locks TC Committed Notify participants sc Acquired locks Release locks TP1 Acquired locks Release locks TP2 Prepared Leaders acquire local write locks If non-coordinator: Choose prepare ts > previous local timestamps Log prepare record through Paxos Notify coordinator of prepare timestamp If coordinator: Wait until hear from other participants Choose commit timestamp >= prepare ts, > local ts Logs commit record through Paxos Wait commit-wait period Sends commit timestamp to replicas, other leaders, client All apply at commit timestamp and release locks Send sp Compute sp for each Commit wait done Compute overall sc Client waits for commit from coordinator

Example Remove X from friend list Risky post P TC T2 sp= 6 sc= 8 Remove myself from X’s friend list TP sp= 8 sc= 8 Time <8 8 15 My friends [X] [] My posts [P] X’s friends [me] []

Disruptive idea: Do clocks really need to be arbitrarily unsynchronized? Can you engineer some max divergence?

TrueTime Architecture GPS timemaster GPS timemaster GPS timemaster GPS timemaster Atomic-clock timemaster GPS timemaster Client Datacenter 1 Datacenter 2 … Datacenter n Compute reference [earliest, latest] = now ± ε

TrueTime Implementation now = reference now + local-clock offset ε = reference ε + worst-case local-clock drift = 1ms + 200 μs/sec time ε 0sec 30sec 60sec 90sec +6ms What about faulty clocks? Bad CPUs 6x more likely in 1 year of empirical data

Spanner Make it easy for developers to build apps! Reads dominant, make them lock-free TrueTime exposes clock uncertainty Commit wait ensures transactions end after their commit time Read at TT.now.latest() Globally-distributed database 2PL w/ 2PC over Paxos!