Presented By Alon Adler – Based on OSDI ’12 (USENIX Association)

Slides:



Advertisements
Similar presentations
Chen Zhang Hans De Sterck University of Waterloo
Advertisements

Wyatt Lloyd * Michael J. Freedman * Michael Kaminsky David G. Andersen * Princeton, Intel Labs, CMU Dont Settle for Eventual : Scalable Causal Consistency.
Megastore: Providing Scalable, Highly Available Storage for Interactive Services. Presented by: Hanan Hamdan Supervised by: Dr. Amer Badarneh 1.
There is more Consensus in Egalitarian Parliaments Presented by Shayan Saeed Used content from the author's presentation at SOSP '13
Consistency Guarantees and Snapshot isolation Marcos Aguilera, Mahesh Balakrishnan, Rama Kotla, Vijayan Prabhakaran, Doug Terry MSR Silicon Valley.
Data Management in the Cloud Paul Szerlip. The rise of data Think about this o For the past two decades, the largest generator of data was humans -- now.
Spanner: Google’s Globally-Distributed Database James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat,
Spanner: Google’s Globally-Distributed Database By - James C
SPANNER: GOOGLE’S GLOBALLYDISTRIBUTED DATABASE James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat,
CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications C. Faloutsos – A. Pavlo Lecture#26: Database Systems.
NoSQL Databases: MongoDB vs Cassandra
DBMS Functions Data, Storage, Retrieval, and Update
GentleRain: Cheap and Scalable Causal Consistency with Physical Clocks Jiaqing Du | Calin Iorgulescu | Amitabha Roy | Willy Zwaenepoel École polytechnique.
Low-Latency Multi-Datacenter Databases using Replicated Commit
Spanner Lixin Shi Quiz2 Review (Some slides from Spanner’s OSDI presentation)
Distributed Databases
Distributed storage for structured data
Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.
Distributed Deadlocks and Transaction Recovery.
Databases with Scalable capabilities Presented by Mike Trischetta.
Spanner: Google’s Globally-Distributed Database James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman,Sanjay Ghemawat,
Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E.
Distributed Systems Tutorial 11 – Yahoo! PNUTS written by Alex Libov Based on OSCON 2011 presentation winter semester,
1 Large-scale Incremental Processing Using Distributed Transactions and Notifications Written By Daniel Peng and Frank Dabek Presented By Michael Over.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Presented by Dr. Greg Speegle April 12,  Two-phase commit slow relative to local transaction processing  CAP Theorem  Option 1: Reduce availability.
Database Management System Module 5 DeSiaMorewww.desiamore.com/ifm1.
CSC 536 Lecture 10. Outline Case study Google Spanner Consensus, revisited Raft Consensus Algorithm.
Megastore: Providing Scalable, Highly Available Storage for Interactive Services J. Baker, C. Bond, J.C. Corbett, JJ Furman, A. Khorlin, J. Larson, J-M.
Jason Baker, Chris Bond, James C. Corbett, JJ Furman, Andrey Khorlin, James Larson,Jean Michel L´eon, Yawei Li, Alexander Lloyd, Vadim Yushprakh Megastore.
Large-scale Incremental Processing Using Distributed Transactions and Notifications Daniel Peng and Frank Dabek Google, Inc. OSDI Feb 2012 Presentation.
Presented By: Shreya Patel ( ) Vidhi Patel ( ) Universal College Of Engineering And Technology.
Transactions and Concurrency Control Distribuerade Informationssystem, 1DT060, HT 2013 Adapted from, Copyright, Frederik Hermans.
Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes.
Eiger: Stronger Semantics for Low-Latency Geo-Replicated Storage Wyatt Lloyd * Michael J. Freedman * Michael Kaminsky † David G. Andersen ‡ * Princeton,
Databases Illuminated
Megastore: Providing Scalable, Highly Available Storage for Interactive Services Jason Baker, Chris Bond, James C. Corbett, JJ Furman, Andrey Khorlin,
INTRODUCTION TO DBS Database: a collection of data describing the activities of one or more related organizations DBMS: software designed to assist in.
Homework 4 Code for word count com/content/repositories/releases/com.cloud era.hadoop/hadoop-examples/
Spanner: the basics Jeff Chase CPS 512 Fall 2015.
Feb 1, 2001CSCI {4,6}900: Ubiquitous Computing1 Eager Replication and mobile nodes Read on disconnected clients may give stale data Eager replication prohibits.
Dynamo: Amazon’s Highly Available Key-value Store DAAS – Database as a service.
NoSQL Or Peles. What is NoSQL A collection of various technologies meant to work around RDBMS limitations (mostly performance) Not much of a definition...
Bigtable: A Distributed Storage System for Structured Data
CSE 486/586, Spring 2014 CSE 486/586 Distributed Systems Google Spanner Steve Ko Computer Sciences and Engineering University at Buffalo.
Advanced Database CS-426 Week 6 – Transaction. Transactions and Recovery Transactions A transaction is an action, or a series of actions, carried out.
Bigtable: A Distributed Storage System for Structured Data Google Inc. OSDI 2006.
Bigtable A Distributed Storage System for Structured Data.
Distributed Databases – Advanced Concepts Chapter 25 in Textbook.
Spanner: Google’s Globally Distributed Database
a journey from the simple to the optimal
Spanner Storage insights
CSE-291 (Cloud Computing) Fall 2016
CPS 512 midterm exam #1, 10/7/2016 Your name please: ___________________ NetID:___________ /60 /40 /10.
Spanner: Google’s Globally-Distributed Database
Distributed Transactions and Spanner
The SNOW Theorem and Latency-Optimal Read-Only Transactions
MVCC and Distributed Txns (Spanner)
Concurrency Control II (OCC, MVCC) and Distributed Transactions
EECS 498 Introduction to Distributed Systems Fall 2017
Spanner: Google’s Globally-Distributed Database
EECS 498 Introduction to Distributed Systems Fall 2017
Chapter 10 Transaction Management and Concurrency Control
Cloud scale storage: The Google File system
Distributed Transactions
Atomic Commit and Concurrency Control
Concurrency Control II and Distributed Transactions
COS 418: Distributed Systems Lecture 16 Wyatt Lloyd
Transactions, Properties of Transactions
Presentation transcript:

Presented By Alon Adler – Based on OSDI ’12 (USENIX Association) Spanner Google’s scalable, multi-version, globally-distributed and synchronously-replicated database Presented By Alon Adler – Based on OSDI ’12 (USENIX Association)

Why Spanner born? Google had BigTable and MegaStore. Why not BigTable ? Can’t handle with complex, evolving schemes. Only eventual consistency across datacenters. Transactional scope limited to single row. Why not MegaStore ? Low performance.

So, What is Spanner? At high level of abstraction, it is a database that shards data across many set of Paxos state machines in datacenters spread all over the world. Spanner is designed to scale up to millions of machines across hundreds of datacenters and trillions of database rows. Spanner maintained multiple replicas for each data. Replication is used for global availability. Applications can use Spanner for high availability even in face of wide-area natural disasters.

So, What is Spanner? Spanner supports general-purpose transactions (ACID). Atomicity, Consistency, Isolation, Durability. Sometimes “Eventually-consistent” of BigTable isn’t good enough. Spanner provides a SQL-based query language. Which provides to the applications the ability to handle complex schemes.

A Spanner deployment is called “universe”. universemaster – status of all zones placement driver – transfers data between zones location proxies – Used by clients to locate spanservers that hold the data they need zonemaster allocates data to spanservers Thousands of spanservers per zone

Spanserver Software Stack Tables sharded across rows into tablets (like bigtable) . Tablet maps (key:string, timestamp:int64)->string. Each spanserver is responsible for. 100-1000 tablets Paxos state machine enables support synchronous replication.

Paxos State Machine Paxos state machines - to implement a consistently replicated bag of mapping. The key-value mapping state of each replica is stored in its corresponding tablet. Writes must initiate the Paxos protocol at the leader. The set of replicas is collectively a Paxos Group. Each replica can be located on different datacenter.

Spanner’s Features As a globally-distributed database , Spanner provides several interesting features. Applications can specify constraints to control which datacenters contain which data : How far data is from its users (to control read latency). How replicas are from each other (to control write latency). How many replicas are maintained (to control durability, availability and read performance). In addition, Spanner has two features that are difficult to implement in a distributed database : Externally-consistent reads and writes. Globally-consistent reads across the database at a timestamp.

Spanner’s Features Why Externally-consistent reads and writes and Globally-consistent reads across the database at a timestamp are difficult to implement in a distributed database. Because we don't have a global “Wall Clock”.

So, what we can do? Global “Wall-Clock” time == External Consistency : Commit order respects global wall-time order. So, we will transform the problem to : Timestamp order respects global wall-time order. timestamp order == commit order.

Assigning timestamps to RW transactions Transaction that write use 2PL. Each transaction T is assigned a timestamp s. Data written by T is timestamped with s. Assign timestamp while locks are held. Acquired locks Release locks T Pick s = now()

Timestamp Invariants Timestamp order == commit order Timestamp order respects global wall-time order T3 T4

TrueTime API The key enabler of these properties (previous slide) is a new TrueTime API and its implementation. The API exposes clock uncertainty, and the guarantees on Spanner’s timestamps depend on the bounds that the implementation provides. The implementation keeps uncertainty small (generally less than 10ms) by using multiple modern clock references (GPS and atomic clocks).

TrueTime “Global wall-clock time” with bounded uncertainty. TT.now() earliest latest 2*ε

Timestamps and TrueTime Acquired locks Release locks T Pick s = TT.now().latest s Wait until TT.now().earliest > s Commit wait average ε average ε

Operations Spanner supports: Read-write transaction. Read-only transaction. Snapshot reads. Read-only transaction must be pre-declared as not have any writes. Reads in read-only transactions execute at a system-chosen timestamp without locking, so that incoming writes are not blocked. Snapshot read is a read in the past that execute without locking. Client can either specify a timestamp or provide an upper bound.

Reads within read-write transactions Writes that occur in a transaction are buffered at the client until commit, as a result reads in a transaction do not see the effects of them. The client issues reads to the leader replica of the appropriate group. Acquires read locks and then reads the most recent data. While a client transaction remains open, it sends “keep-alive” messages. When a client has completed all reads and buffered all writes , write protocol begin.

RW transactions which involves one Paxos Group Start consensus Achieve consensus Notify slaves Acquired locks Release locks T Pick s Commit wait done

RW transactions which involves more than one Paxos Group – 2PC protocol Start logging Done logging Acquired locks Release locks TC Committed Notify participants of s Acquired locks Release locks TP1 Acquired locks Release locks TP2 Prepared Send s Compute s for each Commit wait done Compute overall s

Example Remove X from my friend list Risky post P TC T2 sC=6 s=8 s=15 Remove myself from X’s friend list TP sP=8 s=8 Time <8 8 15 My friends [X] [] My posts [P] X’s friends [me] []

Serving Reads at a Timestamp Each replica maintains 𝑡 𝑆𝑎𝑓𝑒 . A replica can satisfy a read at a timestamp t if t <= 𝑡 𝑆𝑎𝑓𝑒 . 𝑡 𝑆𝑎𝑓𝑒 = min( 𝑡 𝑆𝑎𝑓𝑒 𝑃𝑎𝑥𝑜𝑠 , 𝑡 𝑆𝑎𝑓𝑒 𝑇𝑀 ). 𝑡 𝑆𝑎𝑓𝑒 𝑃𝑎𝑥𝑜𝑠 is timestamp of highest-applied Paxos write. 𝑡 𝑆𝑎𝑓𝑒 𝑇𝑀 is much harder: = ∞ if no pending 2PC transaction. = minimum (s-prepare i,g ) over i prepared transactions in group g. Thus, 𝑡 𝑆𝑎𝑓𝑒 is maximum timestamp at which reads are safe.

Read-Only transactions Executes in two phases: Assign a timestamp 𝑆 𝑅𝑒𝑎𝑑 . Reads as snapshot reads at 𝑆 𝑅𝑒𝑎𝑑 . The snapshot reads can execute at any replicas that are up-to-date. The simple assignment of 𝑆 𝑅𝑒𝑎𝑑 =TT.now().latest , preservers external consistency. Such a timestamp may require the execution of data reads at 𝑆 𝑅𝑒𝑎𝑑 to block if 𝑡 𝑆𝑎𝑓𝑒 has not advanced sufficiently. To reduce the chances of blocking, Spanner should assign the oldest timestamp that preserved external consistency.

Read-Only transactions Assigning a timestamp requires a negotiation phase between all of the Paxos groups that are involved in the read. As a result , Spanner requires a scope expression that summarizes the keys that will be read. If the scope’s values are served by a single Paxos group: The client issues the read-only transaction to the group leader. The leader assign 𝑆 𝑅𝑒𝑎𝑑 = LastTS() (=the timestamp of the last committed write at Paxos). And execute the read at any up-to-date replica. If the scope’s values are served by multiple Paxos groups: 𝑆 𝑅𝑒𝑎𝑑 = TT.now().latest (which may wait for safe time to advance).

Benchmarks 50 Paxos groups, 2500 buckets, 4KB reads or writes, datacenters 1ms apart. Latency remains mostly constant as number of replicas increases because Paxos executes in parallel at a group’s replicas.

Benchmarks All leaders explicitly placed in zone Z1. Red-line – Killing non-leader no effects on read throughput. Green-line – Killing leader-soft giving the leaders time to handoff leadership. Blue-line – Killing leader-hard no warning for leaders.

Questions? Thanks!