Presentation is loading. Please wait.

Presentation is loading. Please wait.

SPANNER: GOOGLE’S GLOBALLYDISTRIBUTED DATABASE James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat,

Similar presentations


Presentation on theme: "SPANNER: GOOGLE’S GLOBALLYDISTRIBUTED DATABASE James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat,"— Presentation transcript:

1 SPANNER: GOOGLE’S GLOBALLYDISTRIBUTED DATABASE James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, Dale Woodford Google, Inc

2 OUTLINE  Introduction  Implementation  Data model  True Time  Concurrency Control  Conclusion

3 INTRODUCTION First multi version, globally distributed, synchronously replicated database Big table like versioned key-value store Megastore like schematized semi-relational tables Interesting Features: Data replication can be configured at a fine grain by applications Externally consistent read and writes Globally consistent reads across the database at a timestamp Assigns timestamps to txn’s even though distributed

4 IMPLEMENTATION

5 A Spanner deployment is called a universe. Spanner is organized as a set of zones Zones are the unit of administrative deployment. The set of zones is also the set of locations across which data can be replicated Zone has zone master and between 100 to 1000 span servers. Zone master assigns data to span server, span server serves data to user Universe masters displays status. (for debugging and more) The placement driver handles automated movement of data across zones by communicating with span servers

6 STATE MACHINE REPLICATION AND PAXOS State Machine Approach: 1.Place copies of the State Machine on multiple, independent servers. 2.Receive client requests, interpreted as Inputs to the State Machine. 3.Choose an ordering for the Inputs. 4.Execute Inputs in the chosen order on each server. 5.Respond to clients with the Output from the State Machine. 6.Monitor replicas for differences in State or Output.

7 An order of Inputs may be defined using a voting protocol The problem of voting for a single value by a group of independent entities is called Consensus Inputs may be ordered by their position in the series of consensus instances (Consensus Order) Paxos [6] is a protocol for solving consensus, and may be used as the protocol for implementing Consensus Order. [6] Paxos requires a single leader to ensure liveness That is, one of the replicas must remain leader long enough to achieve consensus on the next operation of the state machine requirement is that one replica remain leader long enough to move the system forward.

8 SPAN SERVER SOFTWARE STACK

9 Tablet implements a bag of mappings similar to big table’s mappings (key:string, timestamp:int64) -> string Tablet’s state is stored in form of b-tree files stored in colossus To support replication, each span server implements a paxos machine on top of each tablet. Paxos state machines are used to implement a consistently replicated bag of mappings Writes must initiate paxos protocol at the leader. Reads access state directly from tablet at any replica that is up to date Set of replicas is collectively a paxos group

10 At each replica that is a leader, span server implements a lock table to implement concurrency control Long lived paxos leader are critical to implement concurrency control Transaction managers are used to support distributed transaction Replica that is used to implenet txn manager is participant leader. Other replicas are participant slaves. If a txn involves more than one paxos group, the group leaders must coordinate to perform two phase lock. One group leader is chosen as coordinate leader and the others are slaves.

11 DIRECTORIES AND PLACEMENT

12 Directory is a bucketing abstraction implemented on top of key-value mappings. Allows application to control locality of their data Directory is a unit of data placement. Data in a directory has same replication configuration Movedir is the background task used to move data from a directory to directory. For data to be transfere from a paxos group to another, the transfer occurs through directories.

13 DATA MODEL Support schematized semi-relational tables Data model is layered on top of the directory-bucketed key-value mappings An application creates one or more databases in a universe. Each database can contain an unlimited number of schematized tablesdirectory- bucketed key-value mappings Spanner’s data model is not purely relational, in that rows must have names Spanner database must be partitioned by clients into one or more hierarchiesof tables

14

15 TRUE TIME Notations: TTinterval – Time interval with bounded time uncertainity. [earliest,latest] TTstamp- Endpoints of TTinterval TT.now() – Returns a TTinterval TT.after(t)– true if t has definitely passed TT.before(t) – true if has definitely not passed T abs (e) – absolute time of an event e

16 WHAT GOOD DOES TRUE TIME DO? Guarantees: For an invocation tt=TT.now(), tt.earliest <= t abs (e now ) <= tt.latest It makes sure that an event occurs during or after earliest time and ends before the end of the time interval Guarantees correctness properties around concurrency control Implements features like: I.Externally consistent transactions II.Lock-free read only transactions III.Non-blocking reads in the past These features guarantee that a audit read at timestamp t will see the effects of every transaction committed as of t.

17 CONCURRENCY CONTROL TimeStamp Management: Spanner supports read-write, read-only transcations and snapshot reads Read-Only: Read-only transactions should declare that it has not writes Reads in a read-only transaction execute at a system-chosen timestamp without locking, so that incoming writes are not blocked. Reads in read-only can execute on any replica that is up-to-date SnapShot reads: A read in the past that executes without locking. Client can specify the timestamp or let the spanner choose one given a upper bound Execution takes place on any replica that is up to date

18 A commit is mandatory once a timestamp is chosen for either read-only or snapshot txns. When a server fails, clients can internally continue the query on a different server by repeating the timestamp and the current read

19 DETAILS Spanner requires a scope expression for every read-only transaction, which is an expression that summarizes the keys that will be read by the entire transaction If the scope’s values are served by a single Paxos group: The client issues the read-only transaction to that group’s leader. leader assigns s read and executes the read LastTS(): timestamp of the last committed write at a Paxos group the assignment s read = LastTS() trivially satisfies external consistency If Scope values are served multiple paxos group: Spanner has its reads execute at sread = TT.now().latest

20 o Assigning timestamps to Read-Write transactions: Uses two-Phase locking. Timestamps can be assigned after acquiring locks and before releasing. For a txn, spanner assigns the timestamp as the same which paxos assigns to paxos write that represents the txn commit Depends on Monotonicity In-variant: Spanner assigns timestamps to Paxos writes in monotonically increasing order, even across leaders A single leader replica can trivially assign timestamps in monotonically increasing order. leader must only assign timestamps within the interval of its leader lease

21 Enforcing External-consistency invariant: If the start of a transaction T2 occurs after the commit of a transaction T1, then the commit timestamp of T2 must be greater than the commit timestamp of T1. t abs (e 1 commit ) s 1 < s 2 The protocol for executing transactions and assigning timestamps obeys two rules: Start: Coordinate leader assigns timestamp s i Txn T i no less than value TT.now() Commit Wait: coordinator leader ensures that clients cannot see any data committed by Ti until TT.after(si) is true

22 DETAILS Reads within RW Txn uses Wound-Wait to avoid dead locks

23 CONCLUSION  Distributed transaction consistency  Eventual consistency replication support across data centers  Closing the gap between big table and Megastore while preserving their postives

24 THANK YOU!


Download ppt "SPANNER: GOOGLE’S GLOBALLYDISTRIBUTED DATABASE James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat,"

Similar presentations


Ads by Google