Chapter 8 – Fault Tolerance Section 8.5 Distributed Commit Heta Desai Dr. Yanqing Zhang Csc 8320 - Advanced Operating Systems October 14 th, 2015.

Slides:



Advertisements
Similar presentations
(c) Oded Shmueli Distributed Recovery, Lecture 7 (BHG, Chap.7)
Advertisements

CS 603 Handling Failure in Commit February 20, 2002.
1 ICS 214B: Transaction Processing and Distributed Data Management Lecture 12: Three-Phase Commits (3PC) Professor Chen Li.
Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Transaction Processing Lecture ACID 2 phase commit.
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Non-blocking Atomic Commitment Aaron Kaminsky Presenting Chapter 6 of Distributed Systems, 2nd edition, 1993, ed. Mullender.
Distributed DBMSPage © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database.
Atomic TransactionsCS-4513 D-term Atomic Transactions in Distributed Systems CS-4513 Distributed Computing Systems (Slides include materials from.
Distributed Systems CS Fault Tolerance- Part III Lecture 15, Oct 26, 2011 Majd F. Sakr, Mohammad Hammoud andVinay Kolar 1.
Atomic TransactionsCS-502 Fall Atomic Transactions in Distributed Systems CS-502, Operating Systems Fall 2007 (Slides include materials from Operating.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering.
1 CS 194: Distributed Systems Distributed Commit, Recovery Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering.
CS 603 Three-Phase Commit February 22, Centralized vs. Decentralized Protocols What if we don’t want a coordinator? Decentralized: –Each site broadcasts.
©Silberschatz, Korth and Sudarshan19.1Database System Concepts Distributed Transactions Transaction may access data at several sites. Each site has a local.
CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 18: Replication Control All slides © IG.
1 ICS 214B: Transaction Processing and Distributed Data Management Distributed Database Systems.
Distributed Commit. Example Consider a chain of stores and suppose a manager – wants to query all the stores, – find the inventory of toothbrushes at.
Distributed Commit Dr. Yingwu Zhu. Failures in a distributed system Consistency requires agreement among multiple servers – Is transaction X committed?
Distributed Transactions March 15, Transactions What is a Distributed Transaction?  A transaction that involves more than one server  Network.
Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.
Chapter 19 Recovery and Fault Tolerance Copyright © 2008.
Distributed Transactions Chapter 13
Distributed Systems CS Fault Tolerance- Part III Lecture 19, Nov 25, 2013 Mohammad Hammoud 1.
1 8.3 Reliable Client-Server Communication So far: Concentrated on process resilience (by means of process groups). What about reliable communication channels?
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Distributed Transaction Management, Fall 2002Lecture Distributed Commit Protocols Jyrki Nummenmaa
Fault Tolerance CSCI 4780/6780. Distributed Commit Commit – Making an operation permanent Transactions in databases One phase commit does not work !!!
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Chapter 8 Fault.
Commit Algorithms Hamid Al-Hamadi CS 5204 November 17, 2009.
More on Fault Tolerance Chapter 7. Topics Group Communication Virtual Synchrony Atomic Commit Checkpointing, Logging, Recovery.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Chapter 8 Fault.
Committed:Effects are installed to the database. Aborted:Does not execute to completion and any partial effects on database are erased. Consistent state:
Fault Tolerance Chapter 7.
IM NTU Distributed Information Systems 2004 Distributed Transactions -- 1 Distributed Transactions Yih-Kuen Tsay Dept. of Information Management National.
Revisiting failure detectors Some of you asked questions about implementing consensus using S - how does it differ from reaching consensus using P. Here.
Multi-phase Commit Protocols1 Based on slides by Ken Birman, Cornell University.
Fault Tolerance Chapter 7. Goal An important goal in distributed systems design is to construct the system in such a way that it can automatically recover.
10-Jun-16COMP28112 Lecture 131 Distributed Transactions.
EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
More on Fault Tolerance
Fault Tolerance Prof. Orhan Gemikonakli
Fault Tolerance Chap 7.
Atomic Transactions in Distributed Systems
Outline Introduction Background Distributed DBMS Architecture
CSC 8320 Advanced Operating Systems Xueting Liao
Commit Protocols CS60002: Distributed Systems
Outline Introduction Background Distributed DBMS Architecture
Distributed Systems CS
Distributed Systems CS
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Lecture 21: Replication Control
EEC 688/788 Secure and Dependable Computing
Distributed Transactions
Distributed Databases Recovery
Distributed Transactions
EEC 688/788 Secure and Dependable Computing
Transactions in Distributed Systems
EEC 688/788 Secure and Dependable Computing
Distributed Transactions
EEC 688/788 Secure and Dependable Computing
Distributed Transactions
Lecture 21: Replication Control
Abstractions for Fault Tolerance
Last Class: Fault Tolerance
Presentation transcript:

Chapter 8 – Fault Tolerance Section 8.5 Distributed Commit Heta Desai Dr. Yanqing Zhang Csc Advanced Operating Systems October 14 th, 2015

Today’s Presentation Outline  Background and terms  Problem Formulation  Two-Phase Commit  Three-Phase Commit  Future work

Background  What is the problem?  Atomic Multicasting - more general problem due to distributed commit  Why is it important?  Atomic multicast ensures that : (1) The correct addressees of every message agree either to deliver or not to deliver (2) No two correct processes deliver any two messages in a different order

Distributed Commit  Given a process group and an operation  The operation might or might not be committable at all processes  Either everybody eventually commits or everybody eventually aborts  Even servers which crash and come back to live  Consistency: all nodes see the same data at the same time.  Availability: node failures do not prevent survivors from continuing to operate.

Distributed Commit  Can we not just do this with Virtual Synchrony?  Coordinator multicasts vote request  All processes respond to request  Coordinator multicasts vote result  COMMIT iff all vote COMMIT  This handles some error cases  But, what if a participant B crashes between a backup votes COMMIT and the COMMIT result is broadcast and then comes back to live?

What can go wrong?

Two Phase Commit (2PC)

The finite state machine for the coordinator. The finite state machine for the participant

Two Phase Commit (2PC)  Failures – Crash and omission – Detect via timeouts  Processes may recover – Need for logging states

Two Phase Commit (2PC) - Perspective  Coordinator think – Blocks in wait = Participant may have failed  That participant might vote ABORT, in which case a GLOBAL COMMIT would be wrong and irreversible  So, must do a GLOBAL ABORT  Participant think – Blocks in Ready = Coordinator may have failed  Some participants may have already committed

Two Phase Commit (2PC)

Actions taken by a participant P when residing in state READY and having contacted another participant Q

Two Phase Commit – Bad State so Yellow needs to wait for Blue or Green to come up again and inspect their log files!

Two Phase Commit (2PC)  Two-Phase Commit has the problem that if the coordinator and one participant crashes at a bad time the entire system freezes until one of them is up again  Getting a server up and running again typically involves human (a.k.a. very slow) intervention

Three Phase Commit  Three-Phase Commit enhances Two Phase Commit in that it is non-blocking in many more cases  As long as the live participants can make a majority decision they can continue on their own  Majority among all, not only the live ones  If there are many participants, this makes it very unlikely that 3PC blocks

Three Phase Commit  The states of the coordinator and each participant satisfy the following two conditions:  There is no single state from which it is possible to make a transition directly to either a COMMIT or an ABORT state.  There is no state in which it is not possible to make a final decision, and from which a transition to a COMMIT state can be made.

Three Phase Commit

 So what if a failure occurs? – Need to be able to recover to a correct state  Backward recovery – Bring the system to backward to a correct, previous state – Restore  Forward recovery – Bring the system forward to a correct, new state

References 1.Mikito Takada. “Distributed Systems for fun and profit” Online ebook Chapter Slides. 3.Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc Part 2 slides +(2).pdf?d= (2).pdf?d= Genuine Atomic Multicast, in Proc.11th Int. Workshop on Distributed Algorithms (WDAG’97), Lecture Notes in Computer Science, vol. 1320, Springer, Berlin, pp. 141–154.