Spanner: Google’s Globally-Distributed Database

Slides:



Advertisements
Similar presentations
Remus: High Availability via Asynchronous Virtual Machine Replication
Advertisements

Spanner: Google’s Globally-Distributed Database James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat,
Spanner: Google’s Globally-Distributed Database By - James C
D u k e S y s t e m s Time, clocks, and consistency and the JMM Jeff Chase Duke University.
Distributed databases
SPANNER: GOOGLE’S GLOBALLYDISTRIBUTED DATABASE James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat,
Presented By Alon Adler – Based on OSDI ’12 (USENIX Association)
NoSQL Databases: MongoDB vs Cassandra
CMPT Dr. Alexandra Fedorova Lecture X: Transactions.
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
GentleRain: Cheap and Scalable Causal Consistency with Physical Clocks Jiaqing Du | Calin Iorgulescu | Amitabha Roy | Willy Zwaenepoel École polytechnique.
Database System Architectures  Client-server Database System  Parallel Database System  Distributed Database System Wei Jiang.
Spanner Lixin Shi Quiz2 Review (Some slides from Spanner’s OSDI presentation)
Distributed Databases
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
Spanner: Google’s Globally-Distributed Database James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman,Sanjay Ghemawat,
1 Large-scale Incremental Processing Using Distributed Transactions and Notifications Written By Daniel Peng and Frank Dabek Presented By Michael Over.
Evaluating the Running Time of a Communication Round over the Internet Omar Bakr Idit Keidar MIT MIT/Technion PODC 2002.
CSC 536 Lecture 10. Outline Case study Google Spanner Consensus, revisited Raft Consensus Algorithm.
Jason Baker, Chris Bond, James C. Corbett, JJ Furman, Andrey Khorlin, James Larson,Jean Michel L´eon, Yawei Li, Alexander Lloyd, Vadim Yushprakh Megastore.
Transaction Communications Yi Sun. Outline Transaction ACID Property Distributed transaction Two phase commit protocol Nested transaction.
Large-scale Incremental Processing Using Distributed Transactions and Notifications Daniel Peng and Frank Dabek Google, Inc. OSDI Feb 2012 Presentation.
Distributed Database Systems Overview
Concurrency and Transaction Processing. Concurrency models 1. Pessimistic –avoids conflicts by acquiring locks on data that is being read, so no other.
Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes.
INTRODUCTION TO DBS Database: a collection of data describing the activities of one or more related organizations DBMS: software designed to assist in.
XA Transactions.
1 Multiversion Reconciliation for Mobile Databases Shirish Hemanath Phatak & B.R.Badrinath Presented By Presented By Md. Abdur Rahman Md. Abdur Rahman.
Distributed DBMS, Query Processing and Optimization
Accounting in DataGrid HLR software demo Andrea Guarise Milano, September 11, 2001.
CSE 486/586, Spring 2014 CSE 486/586 Distributed Systems Google Spanner Steve Ko Computer Sciences and Engineering University at Buffalo.
Distributed Systems Lecture 5 Time and synchronization 1.
Distributed Databases – Advanced Concepts Chapter 25 in Textbook.

CS 540 Database Management Systems
Spanner: Google’s Globally Distributed Database
Spanner Storage insights
The SNOW Theorem and Latency-Optimal Read-Only Transactions
Operational & Analytical Database
LAB: Web-scale Data Management on a Cloud
CPS 512 midterm exam #1, 10/7/2016 Your name please: ___________________ NetID:___________ /60 /40 /10.
Outline Introduction Background Distributed DBMS Architecture
Introduction to NewSQL
Distributed Transactions and Spanner
The SNOW Theorem and Latency-Optimal Read-Only Transactions
MVCC and Distributed Txns (Spanner)
Replication and Consistency
Concurrency Control II (OCC, MVCC) and Distributed Transactions
EECS 498 Introduction to Distributed Systems Fall 2017
EECS 498 Introduction to Distributed Systems Fall 2017
I Can’t Believe It’s Not Causal
Spanner: Google’s Globally-Distributed Database
EECS 498 Introduction to Distributed Systems Fall 2017
Consistency Models.
Replication and Consistency
Concurrency Control II (OCC, MVCC)
Fundamentals of Databases
Outline Introduction Background Distributed DBMS Architecture
Atomic Commit and Concurrency Control
Concurrency Control II and Distributed Transactions
Decoupled Storage: “Free the Replicas!”
COS 418: Distributed Systems Lecture 16 Wyatt Lloyd
Distributed Database Management Systems
Concurrency control (OCC and MVCC)
Replication and Consistency
Performance Evaluation of a Communication Round over the Internet
Distributed Databases
Transactions, Properties of Transactions
Evaluating the Running Time of a Communication Round over the Internet
Presentation transcript:

Spanner: Google’s Globally-Distributed Database Wilson Hsieh representing a host of authors OSDI 2012

What is Spanner? Distributed multiversion database General-purpose transactions (ACID) SQL query language Schematized tables Semi-relational data model Running in production Storage for Google’s ad data Replaced a sharded MySQL database OSDI 2012

Example: Social Network Sao Paulo Santiago Buenos Aires San Francisco Seattle Arizona Brazil x1000 User posts Friend lists User posts Friend lists User posts Friend lists User posts Friend lists User posts Friend lists x1000 Moscow Berlin Krakow US London Paris Berlin Madrid Lisbon x1000 Russia Spain OSDI 2012

Overview Feature: Lock-free distributed read transactions Property: External consistency of distributed transactions First system at global scale Implementation: Integration of concurrency control, replication, and 2PC Correctness and performance Enabling technology: TrueTime Interval-based global time OSDI 2012

Why consistency matters Read Transactions Generate a page of friends’ recent posts Consistent view of friend list and their posts Why consistency matters Remove untrustworthy person X as friend Post P: “My government is repressive…” OSDI 2012

Single Machine Block writes Generate my page Friend1 post User posts Friend lists User posts Friend lists Friend2 post … Friend999 post Friend1000 post OSDI 2012

Multiple Machines Block writes User posts Friend lists User posts Friend1 post Friend2 post … Generate my page User posts Friend lists User posts Friend lists Friend999 post Friend1000 post OSDI 2012

Multiple Datacenters User posts Friend lists Friend1 post x1000 US Spain … Generate my page User posts Friend lists x1000 Friend999 post Brazil User posts Friend lists x1000 Friend1000 post Russia OSDI 2012

Version Management Transactions that write use strict 2PL Each transaction T is assigned a timestamp s Data written by T is timestamped with s Time <8 8 15 My friends [X] [] My posts [P] X’s friends [me] [] OSDI 2012

Synchronizing Snapshots Global wall-clock time == External Consistency: Commit order respects global wall-time order == Timestamp order respects global wall-time order given timestamp order == commit order OSDI 2012

Timestamps, Global Clock Strict two-phase locking for write transactions Assign timestamp while locks are held Acquired locks Release locks T Pick s = now() OSDI 2012

Timestamp Invariants Timestamp order == commit order Timestamp order respects global wall-time order T3 T4 OSDI 2012

TrueTime “Global wall-clock time” with bounded uncertainty TT.now() earliest latest 2*ε OSDI 2012

Timestamps and TrueTime Acquired locks Release locks T Pick s = TT.now().latest s Wait until TT.now().earliest > s Commit wait average ε average ε OSDI 2012

Commit Wait and Replication Start consensus Achieve consensus Notify slaves Acquired locks Release locks T Pick s Commit wait done OSDI 2012

Commit Wait and 2-Phase Commit Start logging Done logging Acquired locks Release locks TC Committed Notify participants of s Acquired locks Release locks TP1 Acquired locks Release locks TP2 Prepared Send s Compute s for each Commit wait done Compute overall s OSDI 2012

Example Remove X from my friend list Risky post P TC T2 sC=6 s=8 s=15 Remove myself from X’s friend list TP sP=8 s=8 Time <8 8 15 My friends [X] [] My posts [P] X’s friends [me] [] OSDI 2012

What Have We Covered? Lock-free read transactions across datacenters External consistency Timestamp assignment TrueTime Uncertainty in time can be waited out OSDI 2012

What Haven’t We Covered? How to read at the present time Atomic schema changes Mostly non-blocking Commit in the future Non-blocking reads in the past At any sufficiently up-to-date replica OSDI 2012

TrueTime Architecture GPS timemaster GPS timemaster GPS timemaster GPS timemaster Atomic-clock timemaster GPS timemaster Client Datacenter 1 Datacenter 2 … Datacenter n Compute reference [earliest, latest] = now ± ε OSDI 2012

TrueTime implementation now = reference now + local-clock offset ε = reference ε + worst-case local-clock drift ε +6ms drift rate: 200 μs/sec reference uncertainty time 0sec 30sec 60sec 90sec OSDI 2012

What If a Clock Goes Rogue? Timestamp assignment would violate external consistency Empirically unlikely based on 1 year of data Bad CPUs 6 times more likely than bad clocks OSDI 2012

Network-Induced Uncertainty OSDI 2012

What’s in the Literature External consistency/linearizability Distributed databases Concurrency control Replication Time (NTP, Marzullo) OSDI 2012

Future Work Improving TrueTime Building out database features Lower ε < 1 ms Building out database features Finish implementing basic features Efficiently support rich query patterns OSDI 2012

Conclusions Reify clock uncertainty in time APIs Known unknowns are better than unknown unknowns Rethink algorithms to make use of uncertainty Stronger semantics are achievable Greater scale != weaker semantics OSDI 2012

Thanks To the Spanner team and customers To our shepherd and reviewers To lots of Googlers for feedback To you for listening! Questions? OSDI 2012