Outline Announcements Fault Tolerance.

Slides:



Advertisements
Similar presentations
(c) Oded Shmueli Distributed Recovery, Lecture 7 (BHG, Chap.7)
Advertisements

COS 461 Fall 1997 Transaction Processing u normal systems lose their state when they crash u many applications need better behavior u today’s topic: how.
Dr. Kalpakis CMSC 621, Advanced Operating Systems. Fall 2003 URL: Fault Tolerance in Distributed Systems.
CIS 720 Concurrency Control. Timestamp-based concurrency control Assign a timestamp ts(T) to each transaction T. Each data item x has two timestamps:
Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Termination and Recovery MESSAGES RECEIVED SITE 1SITE 2SITE 3SITE 4 initial state committablenon Round 1(1)CNNN-NNNN Round 2FAILED(1)-CNNN--NNN Round 3FAILED.
Database Replication techniques: a Three Parameter Classification Authors : Database Replication techniques: a Three Parameter Classification Authors :
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
CS 582 / CMPE 481 Distributed Systems
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Chapter 18: Distributed Coordination (Chapter 18.1 – 18.5)
©Silberschatz, Korth and Sudarshan19.1Database System Concepts Distributed Transactions Transaction may access data at several sites. Each site has a local.
1 ICS 214B: Transaction Processing and Distributed Data Management Distributed Database Systems.
Distributed Commit. Example Consider a chain of stores and suppose a manager – wants to query all the stores, – find the inventory of toothbrushes at.
CMPT Dr. Alexandra Fedorova Lecture XI: Distributed Transactions.
CMPT Dr. Alexandra Fedorova Lecture XI: Distributed Transactions.
Distributed Databases
Distributed DBMSPage © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database.
Commit Protocols. CS5204 – Operating Systems2 Fault Tolerance Causes of failure: process failure machine failure network failure Goals : transparent:
04/18/2005Yan Huang - CSCI5330 Database Implementation – Distributed Database Systems Distributed Database Systems.
Chapter 19 Recovery and Fault Tolerance Copyright © 2008.
Lecture 16- Distributed Databases Advanced Databases Masood Niazi Torshiz Islamic Azad University- Mashhad Branch
B. Prabhakaran 1 Fault Tolerance Recovery: bringing back the failed node in step with other nodes in the system. Fault Tolerance: Increase the availability.
Distributed Transactions Chapter 13
Operating Systems Distributed Coordination. Topics –Event Ordering –Mutual Exclusion –Atomicity –Concurrency Control Topics –Event Ordering –Mutual Exclusion.
Distributed Transaction Management, Fall 2002Lecture Distributed Commit Protocols Jyrki Nummenmaa
University of Tampere, CS Department Distributed Commit.
XA Transactions.
Commit Algorithms Hamid Al-Hamadi CS 5204 November 17, 2009.
Committed:Effects are installed to the database. Aborted:Does not execute to completion and any partial effects on database are erased. Consistent state:
Introduction to Distributed Databases Yiwei Wu. Introduction A distributed database is a database in which portions of the database are stored on multiple.
Antidio Viguria Ann Krueger A Nonblocking Quorum Consensus Protocol for Replicated Data Divyakant Agrawal and Arthur J. Bernstein Paper Presentation: Dependable.
Distributed DBMSPage © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database.
Topics in Distributed Databases Database System Implementation CSE 507 Some slides adapted from Navathe et. Al and Silberchatz et. Al.
Distributed Databases – Advanced Concepts Chapter 25 in Textbook.
Remote Backup Systems.
Recovery in Distributed Systems:
Chapter 19: Distributed Databases
Outline Introduction Background Distributed DBMS Architecture
Fault Tolerance.
Database System Implementation CSE 507
Two phase commit.
Operating System Reliability
Operating System Reliability
Outline Distributed Mutual Exclusion Distributed Deadlock Detection
Commit Protocols CS60002: Distributed Systems
RELIABILITY.
Outline Introduction Background Distributed DBMS Architecture
Operating System Reliability
Operating System Reliability
Distributed Commit Phases
Distributed Systems CS
Replication and Recovery in Distributed Systems
CSE 486/586 Distributed Systems Concurrency Control --- 3
Outline Introduction Background Distributed DBMS Architecture
Assignment 5 - Solution Problem 1
Distributed Transactions
Lecture 21: Replication Control
Operating System Reliability
Exercises for Chapter 14: Distributed Transactions
Distributed Databases Recovery
Lecture 21: Replication Control
Abstractions for Fault Tolerance
Remote Backup Systems.
CIS 720 Concurrency Control.
CSE 486/586 Distributed Systems Concurrency Control --- 3
Last Class: Fault Tolerance
Operating System Reliability
Operating System Reliability
Presentation transcript:

Outline Announcements Fault Tolerance

COP 5611 - Operating Systems Announcements Class evaluation at the beginning of next class Please come on time so that we still have enough time to cover the materials we need to cover Discussions Homework #4 Quiz #2 Decisions Final exam: open book or close book? Lab 2: Extension? Quiz #3: A week from today November 27, 2018 COP 5611 - Operating Systems

COP 5611 - Operating Systems Motivations A system is fault-tolerant If it can mask failures It continues to perform its specified function in the event of a failure Mainly through redundancy Or it exhibits a well defined failure behavior in the event of failure Distributed commit, either all sites commit a particular operation or none of them November 27, 2018 COP 5611 - Operating Systems

Fault Tolerance Through Redundancy The key approach to fault tolerance is redundancy Three kinds of redundancy Information redundancy Time redundancy Physical redundancy A system can have A multiple number of processes A multiple number of hardware components A multiple number of copies of data November 27, 2018 COP 5611 - Operating Systems

Failure Resilient Processes A process is resilient if it masks failures and guarantees progress despite a certain number of system failures Backup processes In this approach, each resilient process is implemented by a primary process and one or more backup processes The state of the primary processes is stored at some intervals If the primary terminates, one of the backup processes becomes active and takes over November 27, 2018 COP 5611 - Operating Systems

Failure Resilient Processes – cont. Replicated execution Several processes execute the same program concurrently It can increase the reliability and availability It requires that all requests at all processes in the same order Nonidempotent operations need to be taken care of November 27, 2018 COP 5611 - Operating Systems

COP 5611 - Operating Systems Distributed Commit The distributed commit problem involves having an operation being performed by each member of a process group or none at all This is referred to as global atomicity Commit protocols Given that each site has a recovery strategy at the local level, commit protocols ensure that all the sites either commit or abort the transaction unanimously, even in the presence of multiple and repetitive failures November 27, 2018 COP 5611 - Operating Systems

One-phase Commit Protocol One site is designated as a coordinator The coordinator tells all the other processes whether or not to locally perform the operation in question This scheme however is not fault tolerant November 27, 2018 COP 5611 - Operating Systems

Two-Phase Commit Protocol In this protocol, one of the processes acts as a coordinator Other processes are referred to as cohorts Cohorts are assumed to be executing at different sites A stable storage is available at each site The write-ahead log protocol is used There are two phases involved in the protocol November 27, 2018 COP 5611 - Operating Systems

Two-Phase Commit Protocol – cont. November 27, 2018 COP 5611 - Operating Systems

Two-Phase Commit Protocol – cont. November 27, 2018 COP 5611 - Operating Systems

Two-Phase Commit Protocol – cont. Coordinator November 27, 2018 COP 5611 - Operating Systems

Two-Phase Commit Protocol – cont. Site failures handling Suppose the coordinator crashes before having written the COMMIT record On recovery, the coordinator broadcasts an ABORT message to all the cohorts Suppose the coordinator crashes after writing the COMMIT record but before writing the COMPETE record On recovery, the coordinate broadcasts a COMMIT message Suppose the coordinator crashes after writing the COMPLETE record On recovery, there is nothing to be done for the transaction November 27, 2018 COP 5611 - Operating Systems

Two-Phase Commit Protocol – cont. Site failures handling - continued If a cohort crashes in Phase I, the coordinate aborts the transaction because it does not receive a reply from the crashed cohort If a cohort crashes in Phase II (after writing its UNDO and REDO log) On recovery, the cohort will check with the coordinator whether to abort or to commit the transaction November 27, 2018 COP 5611 - Operating Systems

Two-Phase Commit Protocol – cont. Limitation It is a blocking protocol Whenever the coordinator fails, cohort sites will have to wait for its recovery This is undesirable as these sites may be holding locks on resources It cannot be used if transactions must be resilient to site failures This leads to non-blocking commit protocols November 27, 2018 COP 5611 - Operating Systems

Non-blocking Commit Protocols To be non-blocking in the event of site failures Operational sites should agree on the outcome of the transaction by examining their local states Failed sites, upon recovery, must also reach the same conclusion regarding the outcome of the transaction as operational sites do Independent recovery refers to the situation that the recovering sites can decide the final outcome of the transaction based solely on their local state November 27, 2018 COP 5611 - Operating Systems

Three-Phase Commit Protocol – cont. November 27, 2018 COP 5611 - Operating Systems

Three-Phase Commit Protocol for Single Site Failure November 27, 2018 COP 5611 - Operating Systems

Three-Phase Commit Protocol – cont. Phase I - is identical to the that of the two-phase commit protocol except in the event of a site’s failure If a cohort fails, the coordinator times out waiting for the Agreed message and the coordinator aborts the transaction and sends abort messages to all the cohorts Phase II - The coordinator sends a Prepare message to all the cohorts if all the cohorts have sent Agreed message in phase I Otherwise, it sends an Abort message November 27, 2018 COP 5611 - Operating Systems

Three-Phase Commit Protocol – cont. Phase III – On receiving acknowledgments to the Prepare messages from all the cohorts, the coordinator sends a Commit message to all the cohorts On receiving a Commit message, a cohort commits the transaction November 27, 2018 COP 5611 - Operating Systems

Three-Phase Commit Protocol – cont. Theoretical results Rules 1 and 2 are sufficient for designing commit protocols resilient to a single site failure during a transaction There exists no protocol using independent recovery that is resilient to arbitrary failures by two sites There exists no protocol resilient to network partitioning when messages are lost There exists no protocol resilient to multiple network partitioning November 27, 2018 COP 5611 - Operating Systems

COP 5611 - Operating Systems Voting Protocols Distributed commit protocols are resilient to single site failures But they are not resilient to multiple site failures, communication failures, and network partitioning Voting protocols are more fault tolerant They allow data accesses under network failures, multiple site failures, and message losses without compromising the integrity of the data The basic idea is that each replica is assigned some number of votes and a majority of votes must be collected before a process can access a replica November 27, 2018 COP 5611 - Operating Systems

COP 5611 - Operating Systems Static Voting System model The replicas of files are stored at different sites Every file access operation requires that an appropriate lock is obtained The lock rule allows either “one writer and no readers” or “multiple readers and no writers” Every file is associated with a version number Indicates the number of times the file has been updated Version numbers are stored on stable storage Every write operation updates its version number November 27, 2018 COP 5611 - Operating Systems

COP 5611 - Operating Systems Static Voting – cont. Basic idea Every replica is assigned a certain number of votes This information is stored on stable storage A read or write operation is permitted if a certain number of votes, read quorum or write quorum, are collected by the requesting process November 27, 2018 COP 5611 - Operating Systems

COP 5611 - Operating Systems Static Voting – cont. November 27, 2018 COP 5611 - Operating Systems

COP 5611 - Operating Systems Static Voting – cont. November 27, 2018 COP 5611 - Operating Systems

COP 5611 - Operating Systems Static Voting – cont. November 27, 2018 COP 5611 - Operating Systems

COP 5611 - Operating Systems Vote Assignment November 27, 2018 COP 5611 - Operating Systems

Vote Assignment Examples November 27, 2018 COP 5611 - Operating Systems

Reliable Communication In a system using replicated data, it is important that data managers behave identically The data managers are required to have an identical view of the events Atomic broadcast November 27, 2018 COP 5611 - Operating Systems

COP 5611 - Operating Systems Summary Fault tolerance is to mask the failure or behave in a well-defined way in case of failures The key approach to failure masking is through redundancy Failure resilient processes Distributed commit protocols guarantee the global atomicity Either all sites will commit an operation or none of them November 27, 2018 COP 5611 - Operating Systems