Distributed Commit Dr. Yingwu Zhu
Failures in a distributed system Consistency requires agreement among multiple servers – Is transaction X committed? – Have all servers applied update X to a replica? Achieving agreement w/ failures is hard – Impossible to distinguish host vs. network failures This class: – all-or-nothing – all-or-nothing atomicity in distributed systems
Distributed Commit Problem Some applications perform operations on multiple databases – Transfer funds between two bank accounts – Debiting one account and crediting another We would like a guarantee that either all the databases get updated, or none does all-or-none Distributed Commit Problem (all-or-none semantics): – Operation is committed when all participants can perform it – Once a commit decision is reached, this requirement holds even if some participants fail and later recover
Transaction Transaction behave as one operation Atomicity: all-or-none, if transaction failed then no changes apply to the database Consistency: there is no violation of the database integrity constraints Isolation: partial results are hidden (due to incomplete transactions) Durability: the effects of transactions that were committed are permanent
Example Bank ABank B Transfer $1000 From A:$3000 To B:$2000 Clients want all-or-nothing transactions – Transfer either happens or not at all client
Strawman solution Bank ABank B Transfer $1000 From A:$3000 To B:$2000 client Transaction coordinator
Strawman solution What can go wrong? – A does not have enough money – B’s account no longer exists – B has crashed – Coordinator crashes client transaction coordinator bank Abank B start done A=A-1000 B=B+1000
One-Phase Commit A coordinator tells all other processes (participants) whether or not to perform the operation in question Problem: – If one participant fails to perform the operation, no way to tell the coordinator all-or-none – Violate the all-or-none rule!
Two-Phase Commit (2PC) Overview Assumes a coordinator that initiates the commit/abort Each participant votes if it is ready to commit temporary (placed in temp area) – Until the commit actually occurs, the update is considered temporary (placed in temp area) – The participant is permitted to discard a pending update Until all participants vote “ok”, a participant can abort Coordinator decides outcome and informs all participants
2PC: More Details Operates in rounds Coordinator assigns unique identifiers for each protocol run. How? It’s time to use logical clocks: run identifier can be process ID and the value of logical clock Messages carry the identifier of protocol run they are part of Since lots of messages must be stored, a garbage collection must be performed, the challenge is to determine when it is safe to remove the information
Participant States Initial state: p i is not aware that protocol started, ends when p i received the ready_to_commit and it is ready to send its Ok Prepared to commit: p i sent its Ok, saves in temp area and waits for the final decision (Commit or Abort) from coordinator Commit or abort state: p i knows the final decision, it must execute it
2PC State Transition (a) The finite state machine for the coordinator in 2PC. (b) The finite state machine for a participant. Timeout mechanism is used here for coordinator and participants. Coordinator blocked in “WAIT”, participant blocked in “INIT”
2PC Ideal world: coordinator and participants never fail. How 2PC works?
2PC: Back to Reality Each participant can fail at any time Coordinator can fail at any time Question: how to make 2PC work?
Problem Solving Step-by-Step Step1: assume the coordinator never fails but the participant could fail at any time Step2: assume the coordinator could fail at any time
Step1: what can go wrong on participants! Initial state (INIT): if p i crashes before it received vote-request. It does not send it’s vote back, the coordinator will abort the protocol (not enough votes are received). – implemented by timeouts. Prepared to commit(READY): if p i crashes before it learns the outcome, resources remained blocked. It is critical that a crashed participant learns the outcome of pending operations when it comes back: How? COMMIT or ABORT state: p i crashes before executing, it must complete the commit or abort repeatedly in spite of being interrupted by failures: How?
Group Discussions How to modify the 2PC to address the participant failures?
What modifications are needed? resume point A process must remember in what state it was before crashing resume point redeem order A process must find out the outcome (by contacting the coordinator) redeem order garbage coordinator The coordinator must find out when a process indeed completed the decision, since it can crash before executing it garbage coordinator
2PC: Overcome Participant Failures Coordinator: – Multicast: vote-request – Collect replies/votes All vote-commit => log ‘commit’ to ‘outcomes’ table and send commit Else => log ‘abort’ send abort – Collect acknowledgments – Garbage-collect protocol outcome information Participant: – vote-request => log its vote and send vote-(commit/abort) – commit => make changes permanent, send acknowledgment – abort => delete temp area – After failure: For each pending protocol: contact coordinator (or other participants) to learn outcome
Step 2: What if coordinator fails? If coordinator crashed during first phase (WAIT): – some participants will be ready to commit – others will not be able to (they voted on abort) – other processes may not know what the state is If coordinator crashed during its decision or before sending it out: – some processes will be in READY state – some others will know the outcome
Group Discussions How to overcome the coordinator failures?
Improvement If coordinator fails, processes are blocked waiting for it to recover pending After the coordinator recovers, there are pending protocols that must be finished Coordinator must remember its state before crashing – Write INIT & WAIT state on permanent storage – write GLOBAL_COMMIT or GLOBAL_ABORT on permanent storage before sending commit or abort decision to other processes and push these operations through Participants may see duplicated messages (due to message re-transmission by coordinator)
2PC: Overcome Coordinator Failures (1) Coordinator: Multicast: vote-request Collect replies/votes – All vote-commit => log ‘commit’ to ‘outcomes’ table, wait until safe on persistent storage and send commit – Else => log ‘abort’, send abort Collect acknowledgments Garbage collect protocol outcome information After failure: For each pending protocol in outcomes table – Possibly re-transmit VOTE_REQUEST if in WAIT – Send outcome (commit or abort) – Wait for acknowledgments – Garbage collect outcome information
2PC: Overcome Coordinator Failures (2) Participant: first time message received Vote-request – save to temp area and reply its vote –(commit / abort) Global_commit – make changes permanent Global_abort – delete temp area Message is a duplicate (recovering coordinator) – Send acknowledgment After failure: For each pending protocol: – contact coordinator to learn outcome
2PC: Coordinator Outline of the steps taken by the coordinator in 2PC protocol....
2PC: participant The steps taken by a participant process in 2PC.
2PC: decision query from other participants State of QAction by P COMMITTransition to COMMIT ABORTTransition to ABORT INITTransition to ABORT READYContact another participant*
Problem with 2PC The crash of the coordinator may block participants to reach a final decision until it recovers – during the decision stage – All participants in READY status, cannot cooperatively decide the final decision! – Solution: 3PC
Another example of 2PC
2 PC Phase 1 (voting phase) – (1) coordinator sends canCommit? to participants – (2) participant replies with vote (Yes or No); before voting Yes prepares to commit by saving objects in temp area, and if No aborts Phase 2 (completion according to outcome of vote) – (3) coordinator collects votes (including own) if no failures and all Yes, sends doCommit to participants otherwise, sends doAbort to participants – (4) participants that voted Yes wait for doCommit or doAbort and act accordingly; confirm their action to coordinator by sending haveCommitted
Communication in 2 PC
3 PC Overview Remember that 2 PC blocks when coordinator crashes during the decision stage – Participants are blocked until the coordinator recovers! Guarantees that the protocol will not block when only fail-stop failures occur – Avoid blocks in 2 PC – Model is not realistic, but still interesting to look at A process fails only by crashing, crashes are accurately detectable Requires a fourth round for garbage collection
3PC Key Idea prepare-to- commit a subset of alive participants Introduces an additional round of communication and delays to prepare-to- commit state to ensure that the state of the system can always be deduced by a subset of alive participants that can communicate with each other – before the commit, coordinator tells all participants that everyone sent Oks (ready_commit)
3PC: Coordinator Coordinator: Multicast: vote-request Collect votes/replies – All commit => log ‘precommit’ and send precommit – Else => log ‘abort’, send abort Collect acks from non-failed participants – All ‘ready-commit’ => log commit and send global- commit Collect acknowledgements Garbage collect protocol outcome information
3PC: Participant Participant: logs state on each message Vote-request – save to temp area and reply vote-(commit/abort) precommit – Enter precommit state, send ack (ready-commit) commit – make changes permanent abort – delete temp area After failure: Collect participant state information All precommit or any committed – Push forward the commit Else – Push back the abort
3PC State Transition (a) The finite state machine for the coordinator in 3PC. (b) The finite state machine for a participant. Question: Now can participants be blocked in READY status?
Check if 2PC’s problem has been solved? A participant can be blocked in – INIT: abort (no participant in PRECOMMIT, why?) – READY: if a majority of participants in READY, safe to abort – PRECOMMIT: if all participants in PRECOMMIT, then COMMIT, otherwise safe to abort In summary, if a participant in READY, all crashed participants can only recover to INIT, ABORT, or PRECOMMIT, which allows surviving processes can always come to a final decision
Summary Slides
2 PC blocking Is a blocking protocol Consists of a coordinator and participants. 1.Coordinator multicasts a VOTE_REQUEST message to all participants. 2.When a participant receives a VOTE_REQUEST message, it replies (unicast) with either VOTE_COMMIT or VOTE_ABORT. A VOTE_COMMIT response is essentially a contractual guarantee that it will be able to commit. 3.Coordinator collects all votes. If all are VOTE_COMMIT, then it multicasts a GLOBAL_COMMIT message. Otherwise, it will multicast a GLOBAL_ABORT message. 4.When a participant receives GLOBAL_COMMIT, it locally commits; if it receives GLOBAL_ABORT, it locally aborts.
2 PC FSMs Where does the waiting/blocking occur? – Coordinator-WAIT – Participant-INIT – Participant-READY Coordinator Participant
Two-Phase Commit Recovery What happens in case of a crash? How do we detect a crash? – If timeout in Coordinator-WAIT, then abort. – If timout in Participant-INIT, then abort. – If timout in Participant-READY, then need to find out if globally committed or aborted. Just wait for Coordinator to recover. Check with others. Coordinator Participant Wait State Wait States
Two-Phase Commit Recovery If in Participant-READY, and we wish to check with others: – If Q is in COMMIT, then commit. If Q is in ABORT, then ABORT. – If Q in INIT, then can safely ABORT. – If all in READY, nothing can be done. 3 PC Coordinator Wait State Participant Wait States