Termination and Recovery MESSAGES RECEIVED SITE 1SITE 2SITE 3SITE 4 initial state committablenon Round 1(1)CNNN-NNNN Round 2FAILED(1)-CNNN--NNN Round 3FAILED.

Slides:



Advertisements
Similar presentations
Impossibility of Distributed Consensus with One Faulty Process
Advertisements

BASIC BUILDING BLOCKS -Harit Desai. Byzantine Generals Problem If a computer fails, –it behaves in a well defined manner A component always shows a zero.
6.852: Distributed Algorithms Spring, 2008 Class 7.
(c) Oded Shmueli Distributed Recovery, Lecture 7 (BHG, Chap.7)
CS 603 Handling Failure in Commit February 20, 2002.
1 ICS 214B: Transaction Processing and Distributed Data Management Lecture 12: Three-Phase Commits (3PC) Professor Chen Li.
Consensus Algorithms Willem Visser RW334. Why do we need consensus? Distributed Databases – Need to know others committed/aborted a transaction to avoid.
CIS 720 Concurrency Control. Timestamp-based concurrency control Assign a timestamp ts(T) to each transaction T. Each data item x has two timestamps:
Distributed Transaction Processing Some of the slides have been borrowed from courses taught at Stanford, Berkeley, Washington, and earlier version of.
Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 3 – Distributed Systems.
Group Communications Group communication: one source process sending a message to a group of processes: Destination is a group rather than a single process.
Non-blocking Atomic Commitment Aaron Kaminsky Presenting Chapter 6 of Distributed Systems, 2nd edition, 1993, ed. Mullender.
1 Fault-Tolerant Consensus. 2 Failures in Distributed Systems Link failure: A link fails and remains inactive; the network may get partitioned Crash:
The Atomic Commit Problem. 2 The Problem Reaching a decision in a distributed environment Every participant: has an opinion can veto.
Distributed DBMSPage © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database.
1 Distributed Databases CS347 Lecture 16 June 6, 2001.
Distributed DBMSPage © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database.
CS 603 Three-Phase Commit February 22, Centralized vs. Decentralized Protocols What if we don’t want a coordinator? Decentralized: –Each site broadcasts.
©Silberschatz, Korth and Sudarshan19.1Database System Concepts Distributed Transactions Transaction may access data at several sites. Each site has a local.
Failure and Availibility DS Lecture # 12. Optimistic Replication Let everyone make changes –Only 3 % transactions ever abort Make changes, send updates.
1 ICS 214B: Transaction Processing and Distributed Data Management Distributed Database Systems.
Distributed Commit. Example Consider a chain of stores and suppose a manager – wants to query all the stores, – find the inventory of toothbrushes at.
Consensus and Related Problems Béat Hirsbrunner References G. Coulouris, J. Dollimore and T. Kindberg "Distributed Systems: Concepts and Design", Ed. 4,
CS 603 Data Replication February 25, Data Replication: Why? Fault Tolerance –Hot backup –Catastrophic failure Performance –Parallelism –Decreased.
1 Formal Models for Distributed Negotiations Commit Protocols Roberto Bruni Dipartimento di Informatica Università di Pisa XVII Escuela de Ciencias Informaticas.
Distributed DBMSPage © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database.
Commit Protocols. CS5204 – Operating Systems2 Fault Tolerance Causes of failure: process failure machine failure network failure Goals : transparent:
Distributed Consensus Reaching agreement is a fundamental problem in distributed computing. Some examples are Leader election / Mutual Exclusion Commit.
Distributed Consensus Reaching agreement is a fundamental problem in distributed computing. Some examples are Leader election / Mutual Exclusion Commit.
8. Distributed DBMS Reliability
Distributed Txn Management, 2003Lecture 3 / Distributed Transaction Management – 2003 Jyrki Nummenmaa
Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.
B. Prabhakaran 1 Fault Tolerance Recovery: bringing back the failed node in step with other nodes in the system. Fault Tolerance: Increase the availability.
Distributed Transactions Chapter 13
Distributed Transaction Recovery
Distributed Txn Management, 2003Lecture 4 / Distributed Transaction Management – 2003 Jyrki Nummenmaa
Consensus and Its Impossibility in Asynchronous Systems.
EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Lecture 4: Sun: 23/4/1435 Distributed Operating Systems Lecturer/ Kawther Abas CS- 492 : Distributed system & Parallel Processing.
Distributed Transaction Management, Fall 2002Lecture Distributed Commit Protocols Jyrki Nummenmaa
University of Tampere, CS Department Distributed Commit.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Chapter 8 Fault.
Chapter 2 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.
Commit Algorithms Hamid Al-Hamadi CS 5204 November 17, 2009.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Chapter 8 Fault.
Committed:Effects are installed to the database. Aborted:Does not execute to completion and any partial effects on database are erased. Consistent state:
Chap 15. Agreement. Problem Processes need to agree on a single bit No link failures A process can fail by crashing (no malicious behavior) Messages take.
R*: An overview of the Architecture By R. Williams et al. Presented by D. Kontos Instructor : Dr. Megalooikonomou.
Multi-phase Commit Protocols1 Based on slides by Ken Birman, Cornell University.
1 Fault-Tolerant Consensus. 2 Communication Model Complete graph Synchronous, network.
Distributed DBMSPage © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database.
Operating System Reliability Andy Wang COP 5611 Advanced Operating Systems.
Outline Introduction Background Distributed DBMS Architecture
Database System Implementation CSE 507
Operating System Reliability
Operating System Reliability
Distributed Consensus
Commit Protocols CS60002: Distributed Systems
Outline Introduction Background Distributed DBMS Architecture
Outline Announcements Fault Tolerance.
Operating System Reliability
Operating System Reliability
Outline Introduction Background Distributed DBMS Architecture
Outline Introduction Background Distributed DBMS Architecture
Assignment 5 - Solution Problem 1
Operating System Reliability
Distributed Databases Recovery
CIS 720 Concurrency Control.
Operating System Reliability
Operating System Reliability
Presentation transcript:

Termination and Recovery MESSAGES RECEIVED SITE 1SITE 2SITE 3SITE 4 initial state committablenon Round 1(1)CNNN-NNNN Round 2FAILED(1)-CNNN--NNN Round 3FAILED (1)--CNN---NN Round 4FAILED (1)---CN Round 5FAILED ----C NOTE: (1) site fails after sending a single message Figure 5.4. Worst case execution of the resilient termination protocol

Termination and Recovery The second issue can lead to very subtle problems. Again, consider the scenario where Site 1 sends a committable message to Site 2 and then crashes. Site 2 sends out non- committable messages, receives the committable message from Site 1, commits, and then promptly fails. Now, Site 3 receives a single non-committable message (from Site 2). Let us assume that Site 3 was not aware that Site 1 was up at the beginning of the protocol (a reasonable assumption). Then, Site 3 would not suspect that messages it received were inconsistent with those received by Site 2, and it would make an inconsistent commit decision 2.

Termination and Recovery Aborted On the other hand, it is always safe to commit the transaction during a round of “committable” messages, even when additional site failures are detected, because in a progressive protocol an operational site in a committable state never moves to a non-committable state. Therefore, a failed site can never influence its committable cohorts. The protocol is summarized in Figure 5.5. To commit a transaction may take only a single message round; however, to abort a transaction normally requires at least two message rounds. The first message round is required to establish the operational sites because generally a site is not certain of which sites are currently up.

Termination and Recovery Recovery Protocol Protocols at failed site to complete all transactions outstanding at the time of failures. Classes of failures 1.Site failure 2.Lost messages 3.Network partitioning 4.Byzantine failures Effects of failures 1.Inconsistent database 2.Transaction processing is blocked 3.Failed component unavailable

Termination and Recovery Independent Recovery A recovering site makes a transition directly to a final state without communicating with other sites. Lemma For a protocol, if a local state’s concurrency set contains both an abort and commit, it is not resilient to an arbitrary failure of a single site. S i → commit because other sites may be in abort S i → abort because other sites may be in commit Rule 1: S: Intermediate state If C(s) contains a commit  failure transition from S to commit Otherwise failure transition from S to abort cannot

Termination and Recovery Rule 2: For each intermediate state S i : If t j  S(S i ) and t j has a failure transition to a commit (abort), then assign a timeout transition from S i to a commit (abort). Theorem: Rules 1 and 2 are sufficient for designing protocols resilient to a single site failure. p: consistent p': p + failure + time out transition S1S1 S 2 = f 2  f 2  C(S 1 )   f 1 f 2 ←inconsistent site 1 fails  S 1  S(S 2 )

Termination and Recovery Theroem: There exists no protocol using independent recovery that is resilient to arbitrary failures by two sites. Failure of j  recover to commit Failure of any other site  recover to abort G 0 → abort | G 1 | G k-1 → site j recovers to abort | only j makes a transition ↨ other sites recover to abort G k → site j recovers to commit | G m → commit same state exists for other sites first global state

Termination and Recovery Theorem: There exists no protocol resilient to a network partitioning when messages are left. Rule 3 Rule 4 Theorem: Rules 3 and 4 are necessary and sufficient for making protocols resilient to a partition in a two-site protocol. Theorem: There exists no protocol resilient to a multiple partition. Insomorphic to Rules 1 Rule 2 undelivered message ↔ timeout timeout ↔ failure

Termination and Recovery Definition Protocol is synchronous within one state transition if one site never leads another site by more than one state transition. Theorem: Fundamental non-blocking A protocol is non-blocking iff: 1. no local state S C(S) = A (abort) and C (commit) 2. no non-committable state S C(S) = C Lemma: A protocol that is synchronous within one state transition is non-blocking iff: 1.No local state adjacent to both a commit and an abort state No non-committable state adj. to a commit state.