1 Advanced Database Topics Copyright © Ellis Cohen Concurrency Control for Distributed Databases These slides are licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 2.5 License. For more information on how you may use them, please see
2 Copyright © Ellis Cohen, Topics Distributed Lock-Based Concurrency Control Distributed Abort Protocol Distributed Atomic Commit Protocols Distributed Optimistic Concurrency Control
3 Copyright © Ellis Cohen, Distributed Lock-Based Concurrency Control
4 Copyright © Ellis Cohen, Sub-query Distribution Suppose a coordinator wants to execute the query that lists the project managed by the highest paid employee SELECT * FROM Projs WHERE pmgr = (SELECT empno FROM Emps WHERE sal = (SELECT max(sal) FROM Emps)) If subordinate S1 holds the Projs table, and subordinate S2 holds the Emps tables, then the coordinator will request S2 to execute the sub-query SELECT empno FROM Emps WHERE sal = (SELECT max(sal) FROM Emps) Will get the result back (let's call it result), and request S1 to execute (and return the results of) the sub-query SELECT * FROM Projs WHERE pmgr = result
5 Copyright © Ellis Cohen, Sub-transactions Imagine a coordinator C has started a transaction TC, and is executing a query as part of TC. –The coordinator divides the query up into sub-queries, which it sends to various subordinates. –It labels each subquery with TC, the identity of the main transaction. When a subordinate S is passed a sub-query –If it has not yet seen the label TC, it creates a local transaction TS (called a sub-transaction), and associates TS with TC. –If it has seen TC before, it looks up the corresponding TS. In either case, S runs the sub-query as part of the local sub-transaction TS
6 Copyright © Ellis Cohen, Centralized Locking Each query & commit funneled through central Lock Manager site which maintains all locks Evaluation: Only supports table-level granularity (but predicate locks could achieve the effect of row-level granularity) Cost Issue: Requires extra communication for each query Reliability Issue: single point of failure; crash of Lock Manager requires abort of all transactions + election of new Lock Manager Scalability Issue: Lock Manager is bottleneck Note: Depending upon pattern of communication, address both reliability & scalability via hierarchy of lock managers
7 Copyright © Ellis Cohen, Distributed Deadlock Prevention Each subordinate –Locks its own DB objects –Can make WAIT/WOUND/DIE decisions locally (requires transaction properties - e.g. timestamp, priority - passed with each sub-query) WOUND or DIE –Aborts local sub-transaction –Notifies coordinator who aborts main transaction (if not already aborted) & informs other subordinates (and, if hierarchical, notifies its parent coordinator) Consider two transactions, T1 and T2, managed by different coordinators C1 and C2, that both try to lock the same resource. If T1's clock is set a year in the past, will it ever be wounded or die?
8 Copyright © Ellis Cohen, Local & Global WFG's Consider T1 locks A at site S1, requests B at site S2 T2 locks B at site S2, requests A at site S1 T2 A T1 S1 knows: S2 knows: Local WFGs (Wait For Graphs) S1 knows: T2 T1 S2 knows: T1 T2 Need to build Global WFG to discover cycle T1 T2 T1 T2 T1 B T2 T1
9 Copyright © Ellis Cohen, DDBMS Deadlock Detection Timeout-based Deadlock Detection (Oracle) Subordinate detects local deadlocks via local WFG Use timeouts to detect global deadlocks Centralized Deadlock Detection Each subordinate sends local WFG to central site regularly which informs coordinator of deadlock Can also do this hierarchically Phantom Deadlock Problem Suppose central site detects deadlock between T1 and T2, and chooses to tell T1's coordinator to abort In the meantime, T2 is aborted for some other reason (e.g. T2's coordinator crashes) How could phantom deadlocks be avoided?
10 Copyright © Ellis Cohen, Distributed Deadlock Detection Path Pushing Algorithm When coordinator makes a subquery for transaction T, pass along sites at which T has already acquired locks If subquery causes wait, and deadlock can't be detected locally, send (own & propagated) knowledge about path to sites at which T has acquired locks, as well other [higher numbered] waiting sites you know about
11 Copyright © Ellis Cohen, Distributed Abort Protocol
12 Copyright © Ellis Cohen, Distributed System Failures Site failures Site crashes or is unable to respond to messages Link failures Messages may be undeliverable, lost, or garbled, so understandable response is not received Link failures can cause network partition; some sites become unreachable from other sites Failure detection Usually via timeouts (time it takes for remote site to respond to message exceeds threshold) If failure is suspected (a message timed out), a ping message can be sent to site; if ping response is received, timeout period can be extended (but not indefinitely)
13 Copyright © Ellis Cohen, Distributed Algorithms Because of failures, distributed algorithms are complicated. In designing distributed algorithms, we need to work out The messages that need to go back and forth between nodes, and how a node responds to each message, to accomplish the algorithm How to handle timeouts: what to do when a node expects a message, but doesn’t received it in a reasonable time How to handle recovery: what a node does on recovering, if it crashed while it was in participating in the distributed algorithm
14 Copyright © Ellis Cohen, Aborting Distributed Transactions To explore distributed algorithms, we'll consider distributed abort: How a coordinator gets all the subordinates to abort a transaction. Coordinator Subordinate ABORT ABORT- ACK ABORT ABORT- ACK First, what could make a coordinator start an ABORT
15 Copyright © Ellis Cohen, Causes of Distributed Abort Subordinate Raise error in executing a sub-query Crashes (or appears to) Coordinator Raise error in executing local sub-query Crashes (or appears to) Told to ROLLBACK (by application) Told to ABORT (e.g. deadlock detection)
16 Copyright © Ellis Cohen, Standard Abort Protocol COORDINATOR (Abort) (when it decides / is told to abort) Force Abort to log (with list of subordinates) Send ABORT to each Subordinate Aborts main transaction SUBORDINATE (Abort) (when it receives an ABORT message) Force Abort to Log (unless already aborted) Send ABORT-ACK to coordinator Abort own subtransaction (unless already aborted) COORDINATOR (AbortComplete) (when it receives all ABORT-ACK back) Write AllAbortsDone to log Suppose it doesn't receive all ACKS back? Is ABORT-ACK even necessary?
17 Copyright © Ellis Cohen, Timeouts SUBORDINATE (Waiting) Subordinates at any time can send an INQUIRE message to the coordinator. If response is –ACTIVE wait some more –ABORT Do standard Abort action –none decide whether to abort or to wait some more COORDINATOR (waiting for ABORT-ACK) Regularly keep sending ABORT & wait for ABORT-ACK
18 Copyright © Ellis Cohen, Recovery COORDINATOR (on discovering Abort T in log, without corresponding AllAbortsDone) Send ABORTs to all subordinates (in Abort entry) (on discovering Start T in log, without corresponding Commit or Abort) Subordinates are unknown: Answer INQUIREs. SUBORDINATE (on discovering Abort T in Log) Send ABORT-ACK to coordinator (on discovering Start T in Log, but no corresponding Commit or Abort) Send ABORT to coordinator (directs coordinator to abort transaction) Force ABORT to log Abort own subtransaction Are ABORT-ACK & AllAbortsDone necessary?
19 Copyright © Ellis Cohen, ABORT-ACK & AllAbortsDone The ABORT-ACK message and the AllAbortsDone log entry are not completely necessary. That's because subordinates can abort on their own (for any reason, but especially) if they don't hear from the coordinator. ACKs and completion log entries are much more crucial when we talk abort commit
20 Copyright © Ellis Cohen, Distributed Atomic Commit Protocols (ACP)
21 Copyright © Ellis Cohen, Atomic Commit Protocols Distributed Atomic Commit Protocols ensure atomicity & durability in distributed environments –A transaction which executes at multiple sites must either be committed at all sites or aborted at all sites –Not acceptable to have a transaction committed at one site and aborted at another 2 Phase Commit (2PC) Industry Standard Protocol 3 Phase Commit (3PC) Extension of 2PC which reduces blocking when coordinator fails occur during protocol
22 Copyright © Ellis Cohen, PC Motivation Suppose Transaction coordinator, with subordinates S1 and S2 is ready to commit (in particular, all subqueries have finished successfully) Coordinator sends COMMIT messages for the transaction to S1 and S2. S1 commits its local subtransactions. S2 crashes just before receiving the COMMIT message (and before writing any local subtransaction state to stable storage) -- i.e. S2 aborts. Problem Need a way to ensure that once the coordinator has decided to commit & has started to send COMMIT messages, a subordinate crash does not cause that subtransaction to abort
23 Copyright © Ellis Cohen, Simplified 2 Phase Commit Coordinator Subordinate PREPARE COMMIT- ACK 1a YES 1b 2b COMMIT 2a
24 Copyright © Ellis Cohen, PC Approach PREPARE Phase: Coordinator sends PREPARE message to each subordinate Each subordinate prepares to commit by ensuring that the sub-transaction can be made locally durable (e.g. by forcing out log entries, including the Prepare log entry) Once the subordinate has prepared it can commit even after it crashes, and it is not allowed to abort unless it knows the coordinator aborted the transaction COMMIT Phase: Coordinator sends COMMIT only after all subordinates are prepared. The transaction is unalterably committed when the Commit entry is forced to the coordinator's log (because if it crashes, it can complete the commit on recovery)
25 Copyright © Ellis Cohen, Prepare Phase COORDINATOR (Prepare) (when it decides / is told to commit) Force out log (with Prepare entry containing list of subordinates) Send PREPARE to each Subordinate (with list of subordinates) SUBORDINATE (Prepare) (when it receives a PREPARE message) Decides whether it can commit (NO only if it is already aborting or it uses optimistic concurrency and local validation fails) NO Force Abort to Log (unless already aborted) Send NO to coordinator Abort own subtransaction (unless already aborted) YES Force out Log with Prepare entry Send YES to coordinator
26 Copyright © Ellis Cohen, Period of Uncertainty Once a subordinate answers YES to PREPARE The subordinate cannot unilaterally decide whether to commit or abort The subtransaction enters a period of uncertainty, not knowing whether the main transaction will ultimately commit or abort The subordinate must wait until the coordinator tells it which to do
27 Copyright © Ellis Cohen, Coordinator Commit Phase The coordinator waits for all subordinates to respond If any subordinate responds NO, or does not respond within the timeout period (possibly after sending PREPARE again), the coordinator –Forces Abort to the log –Sends ABORT to each subordinate that did not respond with a NO –Aborts the main transaction If all subordinates respond YES within the timeout period, the coordinator –Forces Commit to the log This is the moment at which the transaction is durably committed –Sends COMMIT to each subordinate –Commits the main transaction
28 Copyright © Ellis Cohen, Subordinate Commit Phase SUBORDINATE (receiving ABORT) –Force Abort to log –Abort own subtransaction SUBORDINATE (receiving COMMIT) –Force Commit to log –Send COMMIT-ACK back to Coordinator COORDINATOR (receiving all COMMIT-ACKs) –Writes CommitComplete to Log –If it times out waiting for a COMMIT-ACK from a subordinate, it will keep sending COMMITs
29 Copyright © Ellis Cohen, Subordinate Timeouts SUBORDINATE (waiting for Prepare/Abort) Send an INQUIRE message to the coordinator. If response is –ACTIVE wait some more –ABORT Do standard Abort action –PREPARING Do standard Prepare action –none decide whether to abort or to wait some more SUBORDINATE (after Prepare) Send an INQUIRE message to the coordinator. If response is –PREPARING continue to wait –ABORT Do standard Abort action –COMMIT Do standard Commit action –none Cannot make a unilateral decision! Must either wait or find out the transaction disposition in some other way (e.g. by using a Termination Protocol)
30 Copyright © Ellis Cohen, Recovery COORDINATOR (on discovering Commit T in log, without corresponding CommitComplete) Send COMMIT to all subordinates. (on discovering Prepare T in log, without corresponding Commit or Abort) Send ABORTs to all subordinates SUBORDINATE (on discovering Commit T in Log) Send COMMIT-ACK to coordinator (on discovering Prepare T in Log) Send YES to coordinator (on discovering Start T in Log, but no corresponding Commit or Abort) Send ABORT to coordinator (directs coordinator to abort transaction) Force ABORT to log Abort own subtransaction
31 Copyright © Ellis Cohen, Termination Protocol Motivation A subordinate can get stuck in a period of uncertainty if –The subordinate has already prepared –Either (a) the coordinator crashed or (b) the coordinator & subordinate became disconnected before the coordinator could send ABORT or COMMIT to the subordinate. However, –Maybe the coordinator did get an ABORT or COMMIT message off to another subordinate. –The subordinate might be able to proceed if it could check with the other subordinates!
32 Copyright © Ellis Cohen, Termination Protocol Along with PREPARE message, each subordinate gets a list of other subordinates If coordinator does not respond to INQUIRE, it sends INQUIRE to (some or all of) the other subordinates. Other subordinates respond –COMMIT - if it received COMMIT from coordinator –ABORT - if it aborted -- e.g. it received ABORT from coordinator, or it responded NO to PREPARE, or didn't receive PREPARE, and chooses to abort –UNCERTAIN - otherwise Subordinate commits or aborts if COMMIT or ABORT is received from any other subordinate, else it remains uncertain (occasionally keep trying INQUIREs to coordinator & other subordinates) Blocking problem: If all responses are UNCERTAIN or time out, a subordinate may have to wait for coordinator recovery or network repair
33 Copyright © Ellis Cohen, PC Motivation If a subordinate is uncertain, and every subordinate it can communicate with is uncertain, they ALL MUST WAIT. With 3PC, if the group of communicating subordinates are a [weighted] majority of the participants, they can always proceed!
34 Copyright © Ellis Cohen, PC Extends 2PC to 3 phases: PREPARE, PRECOMMIT, COMMIT A subordinate is uncertain after sending YES and before getting back PRECOMMIT A [weighted] minority partition of subordinates must wait for network repair. A [weighted] majority partition of the subordinates Aborts if all are uncertain Else if at least one has received PRECOMMIT, uses an election protocol to elect a new coordinator if necessary (e.g. the one with the highest IP address), who then continues with the protocol A coordinator (original or elected) sends COMMIT when it gets PRECOMMIT-ACKS from a [weighted] majority of the subordinates
35 Copyright © Ellis Cohen, Distributed Optimistic Concurrency Control
36 Copyright © Ellis Cohen, Optimistic Concurrency Control Assumes (optimistically) that a transaction will not have conflicts with other transactions, avoiding the overhead of locks. Cache-Based: Reads all possible data from and writes all data to its client cache. Validation-Based: When the transaction commits, writes all changes back the DB server, but only after validating that the data it used during the transaction is still up-to-date.
37 Copyright © Ellis Cohen, Distributed Validation S TblB TblA AB B's cache for S A's cache for S When S commits, A & B will both receive PREPARE messages. They will each locally do validation for their respective sub- transactions, and only respond YES if validation succeeds. Consider a distributed DB which uses server-managed client caches. Note: With a client-side cache, S would need, as part of PREPARE, to pass back to A & B the timestamps of the data items read from A and B respectively. How can this be supported if cross-DB query processing (e.g. joins) are done at other nodes, and only the final results are passed back to S?
38 Copyright © Ellis Cohen, Distributed Ordering Problem What if S and T want to commit at the same time, A receives S's PREPARE message first, and B receives T's PREPARE message first result can be non-serializable ST TblB AB B's cache for S B's cache for T TblA A's cache for S A's cache for T
39 Copyright © Ellis Cohen, Non-Serializable Result S 1) UPDATE AT SET a2 = a ) UPDATE BT SET b2 = b1 3) COMMIT T 1) UPDATE BT SET b2 = b ) UPDATE AT SET a2 = a1 3) COMMIT Assume a1=1 a2=2 b1=3 b2=4 There are two possible serial schedules S T a2=1 b2=103 T S a2=101 b2=3 But suppose S & T execute in parallel, and send PREPAREs to A and B in parallel If A get PREPAREs & validates T after S, no R/W conflicts and both validations succeed a2=1 If B get PREPAREs & validates S after T, no R/W conflicts and both validations succeed b2=3 When using Distributed Optimistic Concurrency Control subordinates cannot independently order commits!
40 Copyright © Ellis Cohen, Timestamped Cache Checking Suppose all sites have access to the same global clock, and when S and T want to commit, they pass the current global time as part of their PREPARE messages (the PrepareTime) ST A TblA A's cache for S A's cache for T Suppose T sends PREPARE to A after S does, and suppose A receives them in the same order. When A receives T's PREPARE, it's PrepareTime is larger than every PrepareTime it already received, including S's. A can do Timestamped Cache Checking: For every local data item A read that is in A's cache for T, check whether A's version is the latest one (compare its read timestamp in the cache to the local DB's timestamp for it)
41 Copyright © Ellis Cohen, Out of Order Prepares ST A TblA A's cache for S A's cache for T Suppose T sends PREPARE to A after S does, but A receives them in the opposite order. A receives PREPARE for T first, validates it, and responds YES, and then receives PREPARE for S, with an earlier PrepareTime. Problem: If S wrote something that T read, T read the wrong version of it; T should have read the version that S wrote. Too late to fail validation for T, but we can fail validation for S. Problem: If T wrote (and committed) something that S read, S read the wrong version of it. S should have read the data before T persisted it! Also fail validation for S in this case. T already committed, but S should have committed first These checks must be done in addition to Timestamped Cache Checking
42 Copyright © Ellis Cohen, Loosely Synchronized Clocks In fact, distributed systems generally do not all have access to a global time. Instead, they use a Distributed Time Service, which sends time messages between sites, and ensures that all clocks stay reasonably close to one another. Increasing clock skew Will, at worst, cause the algorithm described to fail more validations unnecessarily (since more PREPARE's will appear to be received out of order), but Will not cause validation to incorrectly succeed. Are out of order PREPAREs a problem for Timestamp-Based or Read-Consistent concurrency control?
43 Copyright © Ellis Cohen, Timestamp-Based Concurrency Ordering does not affect the Timestamped-Based Concurrency Control Algorithm Ordering already taken into account Data items are marked with read times as well as their write times. Timestamp-based checks effectively already do the appropriate validation based on order. Increasing clock skew Simply causes more timestamp-based checks to fail