Chapter 8 – Fault Tolerance Section 8.5 Distributed Commit Heta Desai Dr. Yanqing Zhang Csc 8320 - Advanced Operating Systems October 14 th, 2015.

Chapter 8 – Fault Tolerance Section 8.5 Distributed Commit Heta Desai Dr. Yanqing Zhang Csc 8320 - Advanced Operating Systems October 14 th, 2015

Today’s Presentation Outline  Background and terms  Problem Formulation  Two-Phase Commit  Three-Phase Commit  Future work

Background  What is the problem?  Atomic Multicasting - more general problem due to distributed commit  Why is it important?  Atomic multicast ensures that : (1) The correct addressees of every message agree either to deliver or not to deliver (2) No two correct processes deliver any two messages in a different order

Distributed Commit  Given a process group and an operation  The operation might or might not be committable at all processes  Either everybody eventually commits or everybody eventually aborts  Even servers which crash and come back to live  Consistency: all nodes see the same data at the same time.  Availability: node failures do not prevent survivors from continuing to operate.

Distributed Commit  Can we not just do this with Virtual Synchrony?  Coordinator multicasts vote request  All processes respond to request  Coordinator multicasts vote result  COMMIT iff all vote COMMIT  This handles some error cases  But, what if a participant B crashes between a backup votes COMMIT and the COMMIT result is broadcast and then comes back to live?

What can go wrong?

Two Phase Commit (2PC)

The finite state machine for the coordinator. The finite state machine for the participant

Two Phase Commit (2PC)  Failures – Crash and omission – Detect via timeouts  Processes may recover – Need for logging states

Two Phase Commit (2PC) - Perspective  Coordinator think – Blocks in wait = Participant may have failed  That participant might vote ABORT, in which case a GLOBAL COMMIT would be wrong and irreversible  So, must do a GLOBAL ABORT  Participant think – Blocks in Ready = Coordinator may have failed  Some participants may have already committed

Two Phase Commit (2PC)

Actions taken by a participant P when residing in state READY and having contacted another participant Q

Two Phase Commit – Bad State so Yellow needs to wait for Blue or Green to come up again and inspect their log files!

Two Phase Commit (2PC)  Two-Phase Commit has the problem that if the coordinator and one participant crashes at a bad time the entire system freezes until one of them is up again  Getting a server up and running again typically involves human (a.k.a. very slow) intervention

Three Phase Commit  Three-Phase Commit enhances Two Phase Commit in that it is non-blocking in many more cases  As long as the live participants can make a majority decision they can continue on their own  Majority among all, not only the live ones  If there are many participants, this makes it very unlikely that 3PC blocks

Three Phase Commit  The states of the coordinator and each participant satisfy the following two conditions:  There is no single state from which it is possible to make a transition directly to either a COMMIT or an ABORT state.  There is no state in which it is not possible to make a final decision, and from which a transition to a COMMIT state can be made.

Three Phase Commit

 So what if a failure occurs? – Need to be able to recover to a correct state  Backward recovery – Bring the system to backward to a correct, previous state – Restore  Forward recovery – Bring the system forward to a correct, new state

References 1.Mikito Takada. “Distributed Systems for fun and profit” Online ebook http://book.mixu.net/distsys/ebook.html#intro http://book.mixu.net/distsys/ebook.html#intro 2.Chapter Slides. http://www.comp.nus.edu.sg/~tankl/cs5225/2008/commit2.pdfhttp://www.comp.nus.edu.sg/~tankl/cs5225/2008/commit2.pdf 3.Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. 0-13-239227-5 4.Part 2 slides https://services.brics.dk/java/courseadmin/dDist/documents/getDocument/Fault+Tolerance +(2).pdf?d=113839 https://services.brics.dk/java/courseadmin/dDist/documents/getDocument/Fault+Tolerance +(2).pdf?d=113839 5.Genuine Atomic Multicast, in Proc.11th Int. Workshop on Distributed Algorithms (WDAG’97), Lecture Notes in Computer Science, vol. 1320, Springer, Berlin, pp. 141–154.

Chapter 8 – Fault Tolerance Section 8.5 Distributed Commit Heta Desai Dr. Yanqing Zhang Csc 8320 - Advanced Operating Systems October 14 th, 2015.

Similar presentations

Presentation on theme: "Chapter 8 – Fault Tolerance Section 8.5 Distributed Commit Heta Desai Dr. Yanqing Zhang Csc 8320 - Advanced Operating Systems October 14 th, 2015."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chapter 8 – Fault Tolerance Section 8.5 Distributed Commit Heta Desai Dr. Yanqing Zhang Csc 8320 - Advanced Operating Systems October 14 th, 2015.

Similar presentations

Presentation on theme: "Chapter 8 – Fault Tolerance Section 8.5 Distributed Commit Heta Desai Dr. Yanqing Zhang Csc 8320 - Advanced Operating Systems October 14 th, 2015."— Presentation transcript:

Similar presentations

About project

Feedback