Spanner: the basics Jeff Chase CPS 512 Fall 2015.

Spanner: the basics Jeff Chase CPS 512 Fall 2015

“Globally distributed”: zones The placement driver handles automated movement of data across zones on the timescale of minutes. The placement driver periodically communicates with the spanservers to find data that needs to be moved, either to meet updated replication constraints or to balance load.

Spanner: the software stack Groups store many tablets, and use Paxos to replicate. TM needed only for multi-tablet transactions across groups. Lock table needed for 2PC.

Spanner: balancing load Zones are units of “administrative deployment” and “physical isolation”. Tablet data is stored on replica groups spanning zones. Spanner offers primitives to control replication and migrate tablet data among tablets and groups. – balance load – improve locality (bring related items into the same tablet) Why is load balancing needed? – With K/V stores we presumed that randomly distributed keys lead to balanced load. – Why is Spanner different?

Spanner: balancing load Directories (and fragments) are moved among groups (i.e., among tablets). A directory is “a set of contiguous keys that share a common prefix”.

Spanner Ignore for now that tablet storage is Paxos-replicated. Logically, each tablet store is a “participant” in 2PC. Transactions with 2PC/2PL. Is that it?

Spanner Ignore for now that tablet storage is Paxos-replicated. Logically, each tablet store is a “participant” in 2PC. Transactions with 2PC/2PL. Is that it? Key goal: lock-free read-only transactions Why lock-free read transactions? – 2PL is too expensive for long reads. – Why? – Why do we need long reads? Solution: multi-version storage with version timestamps – Drawbacks?

©Silberschatz, Korth and Sudarshan16.10Database System Concepts 3 rd Edition Multiversion Schemes Multiversion schemes keep old versions of data item to increase concurrency.  Multiversion Timestamp Ordering  Multiversion Two-Phase Locking Each successful write results in the creation of a new version of the data item written. Use timestamps to label versions. When a read(Q) operation is issued, select an appropriate version of Q based on the timestamp of the transaction, and return the value of the selected version. reads never have to wait as an appropriate version is returned immediately.

©Silberschatz, Korth and Sudarshan16.11Database System Concepts 3 rd Edition Multiversion Timestamp Ordering Each data item Q has a sequence of versions. Each version Q k contains three data fields:  Content -- the value of version Q k.  W-timestamp(Q k ) -- timestamp of the transaction that created (wrote) version Q k  R-timestamp(Q k ) -- largest timestamp of a transaction that successfully read version Q k when a transaction T i creates a new version Q k of Q, Q k 's W- timestamp and R-timestamp are initialized to TS(T i ). R-timestamp of Q k is updated whenever a transaction T j reads Q k, and TS(T j ) > R-timestamp(Q k ).

©Silberschatz, Korth and Sudarshan16.12Database System Concepts 3 rd Edition Multiversion Timestamp Ordering (Cont) The multiversion timestamp scheme presented next ensures serializability. Suppose that transaction T i issues a read(Q) or write(Q) operation. Let Q k denote the version of Q whose write timestamp is the largest write timestamp less than or equal to TS(T i ). 1. If transaction T i issues a read(Q), then the value returned is the content of version Q k. 2. If transaction T i issues a write(Q), and if TS(T i ) < R- timestamp(Q k ), then transaction T i is rolled back. Otherwise, if TS(T i ) = W-timestamp(Q k ), the contents of Q k are overwritten, otherwise a new version of Q is created. Reads always succeed; a write by T i is rejected if some other transaction T j that (in the serialization order defined by the timestamp values) should read T i 's write, has already read a version created by a transaction older than T i.

©Silberschatz, Korth and Sudarshan16.13Database System Concepts 3 rd Edition Multiversion Two-Phase Locking Differentiates between read-only transactions and update transactions Update transactions acquire read and write locks, and hold all locks up to the end of the transaction. That is, update transactions follow rigorous two-phase locking.  Each successful write results in the creation of a new version of the data item written.  each version of a data item has a single timestamp whose value is obtained from a counter ts-counter that is incremented during commit processing. Read-only transactions are assigned a timestamp by reading the current value of ts-counter before they start execution; they follow the multiversion timestamp-ordering protocol for performing reads.

©Silberschatz, Korth and Sudarshan16.14Database System Concepts 3 rd Edition Multiversion Two-Phase Locking (Cont.) When an update transaction wants to read a data item, it obtains a shared lock on it, and reads the latest version. When it wants to write an item, it obtains X lock on; it then creates a new version of the item and sets this version's timestamp to . When update transaction T i completes, commit processing occurs:  T i sets timestamp on the versions it has created  Read-only transactions that start after T i commits will see the values updated by T i. Read-only transactions timestamped before T i commits will see a value before the updates by T i. Only serializable schedules are produced.

©Silberschatz, Korth and Sudarshan16.17Database System Concepts 3 rd Edition Deadlock Handling Consider the following two transactions: T 1 : write (X) T 2 : write(Y) write(Y) write(X) Schedule with deadlock T1T1 T2T2 lock-X on X write (X) lock-X on Y write (X) wait for lock-X on X wait for lock-X on Y

©Silberschatz, Korth and Sudarshan16.18Database System Concepts 3 rd Edition Deadlock Handling System is deadlocked if there is a set of transactions such that every transaction in the set is waiting for another transaction in the set. Deadlock prevention protocols ensure that the system will never enter into a deadlock state. Some prevention strategies :  Require that each transaction locks all its data items before it begins execution (predeclaration).  Impose partial ordering of all data items and require that a transaction can lock data items only in the order specified by the partial order (graph-based protocol).

©Silberschatz, Korth and Sudarshan16.19Database System Concepts 3 rd Edition More Deadlock Prevention Strategies Following schemes use transaction timestamps for the sake of deadlock prevention alone. wait-die scheme — non-preemptive  older transaction may wait for younger one to release data item. Younger transactions never wait for older ones; they are rolled back instead.  a transaction may die several times before acquiring needed data item wound-wait scheme — preemptive  older transaction wounds (forces rollback) of younger transaction instead of waiting for it. Younger transactions may wait for older ones.  may be fewer rollbacks than wait-die scheme.

©Silberschatz, Korth and Sudarshan16.20Database System Concepts 3 rd Edition Deadlock prevention (Cont.) Both in wait-die and in wound-wait schemes, a rolled back transactions is restarted with its original timestamp. Older transactions thus have precedence over newer ones, and starvation is hence avoided. Timeout-Based Schemes :  a transaction waits for a lock only for a specified amount of time. After that, the wait times out and the transaction is rolled back.  thus deadlocks are not possible  simple to implement; but starvation is possible. Also difficult to determine good value of the timeout interval.

Spanner uses wound-wait for deadlock prevention

4 Choosing Timestamps We now describe how Spanner assigns timestamps to R/W transactions,…, so as to avoid central control…[except by 2PC coordinator for each T]. The following two informal rules give insight on the selection of timestamps for [update or R/W] transactions in Spanner. Rule 1: The timestamp for T is a real time after all the reads have returned and before the transaction releases any locks. Rule 2: Each participant contributes a lower bound on the transaction timestamp T: The lower bound at each participant is greater than any timestamp it has written in the past locally.

2PC in Spanner (Rule 2) “commit or abort T?” “commit with timestamp > s i ” TM/C RM/P precommit or prepare votedecide notify RMs pick a timestamp s for each update transaction T at prepare time, when T holds all of its locks. Everyone selects an s that exceeds the possible commit stamp of any previous T known. “Commit with timestamp s > MAX(s i ), but wait until s is definitely past.” release locks, stamp writes, expose writes

Another view…

Jointly, these rules provide the following properties: from Rule 1, it follows that if transaction T2 starts after transaction T1 ends, then T2 must have a higher timestamp than T1; from Rule 2, it follows that if transaction T2 reads something that transaction T1 wrote, then T2 must have a higher timestamp than T1 (note that this can happen even if T2 starts before T1 ends), and from Rule 2, it also follows that if transaction T2 overwrites something that transaction T1 previously wrote, then T2 must have a higher timestamp than T1. Additionally, these rules mean that a server never has to block before replying when asked for data with a timestamp that is lower than the bound that the server proposed for any pending transaction (i.e. one that hasn’t yet committed).

Spanner: choosing timestamps Rule 1: The timestamp for T is a real time after all the reads have returned and before the transaction releases any locks. Why real time? Why can’t they be e.g. logical clocks? Choosing timestamps is easy on a single-server implementation of multi-version 2PL timestamping. But Spanner is a global system with 1000s of servers! Idea: use global wall clock time. How to keep clocks (loosely) synchronized? How to account for errors and drift? A: TrueTime.

TrueTime API

From Practical Uses of Synchronized Clocks Barbara Liskov, PODC 1991 It is worth noting that practical clock synchronization algorithms must provide efficient engineering solutions to a number of problems… e.g., how to avoid being mistaken about the time when a message containing a time value is delayed in the network. The algorithms that exist today are robust in the face of problems such as network congestion and links with widely varying delays. Network latency complicates clock synchronization from time servers. [from wikipedia]

From Practical Uses of Synchronized Clocks Barbara Liskov, PODC 1991 Clock synchronization algorithms synchronize clocks with some skew epsilon (E): They guarantee that if Cl and C2 are the clocks at two nodes of a network then at any instant the time at C1 differs from the time at C2 by no more than E. As mentioned, the synchronization property cannot be provided absolutely, but only with some very high probability. … Clock synchronization algorithms typically synchronize clocks with “real” time, i.e., at any moment a node’s clock differs from real time by no more than E/2. …. At the root of such algorithms is a dependence on devices that sample universal time; such devices are attached to time servers, and the algorithm spreads the information about the current time from the servers to other nodes in the network.

Timestamp Invariants OSDI 201230 Timestamp order == commit order Timestamp order respects global wall-time order T2T2 T3T3 T4T4 T1T1

TrueTime “Global wall-clock time” with bounded uncertainty time earliestlatest TT.now() 2*ε OSDI 201231

Timestamps and TrueTime T Pick s = TT.now().latest Acquired locks Release locks Wait until TT.now().earliest > ss OSDI 2012 average ε Commit wait average ε 32

Commit Wait and Replication OSDI 2012 T Acquired locks Release locks Start consensusNotify slaves Commit wait donePick s 33 Achieve consensus

5 Choosing Timestamps for Read-Only Transactions A read inside a read-only transaction…must return the latest value written in a transaction that ends before the read starts. The upper bound on TrueTime’s current clock reading suffices: any T that already ended has a lower timestamp [due to commit wait]. But: using this timestamp may cause the transaction to wait. So in the special case of a read-only query addressed to a single server, Spanner instead sends a special read request that tells the server to use the latest timestamp it has written locally to any data. If there is no conflicting transaction in progress, then that read can be answered immediately. Spanner also supports read-only transactions from arbitrary snapshot times, which are properly named snapshot-reads.

Q Why might a read-only T timestamped now.latest have to wait? How long might it have to wait?

Spanner and Paxos Long leader lease: let’s ignore Paxos “phase 1”. “Each write is stored both in transaction log and in Paxos log.” – What does that mean?

Paxos (from PMMC / paxos.systems) In Spanner Paxos groups, the leaders and acceptors are colocated with the replicas.

Phase 2+: Leading a Round Congratulations! You were accepted to lead a round. – Choose a “suitable value” v for this ballot. – Command the acceptors to accept the value (2a). – If a majority hear and obey, the round succeeds. Did a majority respond (2b) and assent? – Yes: tell everyone the round succeeded (3). – No: move on, e.g., ask for another round. “Can I lead b?”“OK…”“v!” L N 1a1b “v?”“OK” 2a 2b3 repeat…

Acceptors (Phase 2) Commanded (2a) to accept v for current ballot? – Accept v and log it in persistent memory. – Discard any previously accepted value. – Respond (2b) with accept, or deny (if ballot is old). “b?” “v!” L N 1a1b “v?” 2a 2b3 log vote

Leader: Choosing a Suitable Value A majority of agents responded (1b): good. Did any accept a value for some previous ballot (2b)? – Yes: they tell you the ballot ID and value. Find the most recent value that any responding agent accepted, and choose it for this ballot too. – No: choose any value you want. “Can I lead round 5?” “OK, but I accepted 7 for round 4” “7!” L N 1a1b “7?”“OK” 2a 2b3

A Paxos Round “Can I lead b?” “OK, but”“v!” L N 1a1b “v?”“OK” 2a 2b3 ProposePromiseAcceptAck Commit Where is the consensus “point of no return”? Wait for majority log safe Self-appoint

Spanner and Paxos What does ACID across the world here say about NoSQL? Does this “break CAP”? – E.g., could a group of Paxos groups ever partition? – What happens if it does partition? What about performance cost? – Their view: transactions are worth it. Can we do better? – It’s an open research problem!

Spanner: the basics Jeff Chase CPS 512 Fall 2015.

Similar presentations

Presentation on theme: "Spanner: the basics Jeff Chase CPS 512 Fall 2015."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Spanner: the basics Jeff Chase CPS 512 Fall 2015.

Similar presentations

Presentation on theme: "Spanner: the basics Jeff Chase CPS 512 Fall 2015."— Presentation transcript:

Similar presentations

About project

Feedback