DISTRIBUTED SYSTEMS II AGREEMENT - COMMIT (2-3 PHASE COMMIT) Prof Philippas Tsigas Distributed Computing and Systems Research Group.

Slides:



Advertisements
Similar presentations
CS542: Topics in Distributed Systems Distributed Transactions and Two Phase Commit Protocol.
Advertisements

Slides for Chapter 13: Distributed transactions
Replication Management. Motivations for Replication Performance enhancement Increased availability Fault tolerance.
(c) Oded Shmueli Distributed Recovery, Lecture 7 (BHG, Chap.7)
DISTRIBUTED SYSTEMS II REPLICATION CNT. II Prof Philippas Tsigas Distributed Computing and Systems Research Group.
Exercises for Chapter 17: Distributed Transactions
CIS 720 Concurrency Control. Timestamp-based concurrency control Assign a timestamp ts(T) to each transaction T. Each data item x has two timestamps:
Copyright © George Coulouris, Jean Dollimore, Tim Kindberg This material is made available for private study and for direct.
Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Group Communications Group communication: one source process sending a message to a group of processes: Destination is a group rather than a single process.
CS 582 / CMPE 481 Distributed Systems
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 18: Replication Control All slides © IG.
1 ICS 214B: Transaction Processing and Distributed Data Management Distributed Database Systems.
Distributed Systems Fall 2009 Distributed transactions.
Commit Protocols. CS5204 – Operating Systems2 Fault Tolerance Causes of failure: process failure machine failure network failure Goals : transparent:
Distributed Commit Dr. Yingwu Zhu. Failures in a distributed system Consistency requires agreement among multiple servers – Is transaction X committed?
Distributed Transactions March 15, Transactions What is a Distributed Transaction?  A transaction that involves more than one server  Network.
DISTRIBUTED SYSTEMS II AGREEMENT (2-3 PHASE COM.) Prof Philippas Tsigas Distributed Computing and Systems Research Group.
Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.
Chapter 19 Recovery and Fault Tolerance Copyright © 2008.
Copyright © George Coulouris, Jean Dollimore, Tim Kindberg This material is made available for private study and for direct.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Replication with View Synchronous Group Communication Steve Ko Computer Sciences and Engineering.
Distributed Transactions Chapter 13
Lecture 12: Distributed transactions Haibin Zhu, PhD. Assistant Professor Department of Computer Science Nipissing University © 2002.
CSE 486/586 CSE 486/586 Distributed Systems Concurrency Control Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Concurrency Control Steve Ko Computer Sciences and Engineering University at Buffalo.
Distributed Transactions CS425 /CSE424/ECE428 – Distributed Systems – Fall 2011 Nikita Borisov - UIUC Material derived from slides by I. Gupta, M. Harandi,
DISTRIBUTED SYSTEMS II REPLICATION CNT. Prof Philippas Tsigas Distributed Computing and Systems Research Group.
DISTRIBUTED SYSTEMS II FAULT-TOLERANT AGREEMENT II Prof Philippas Tsigas Distributed Computing and Systems Research Group.
University of Tampere, CS Department Distributed Commit.
DISTRIBUTED SYSTEMS II REPLICATION Prof Philippas Tsigas Distributed Computing and Systems Research Group.
IM NTU Distributed Information Systems 2004 Replication Management -- 1 Replication Management Yih-Kuen Tsay Dept. of Information Management National Taiwan.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Replication Steve Ko Computer Sciences and Engineering University at Buffalo.
CS542: Topics in Distributed Systems Replication.
Replication (1). Topics r Why Replication? r System Model r Consistency Models – How do we reason about the consistency of the “global state”? m Data-centric.
DISTRIBUTED SYSTEMS II REPLICATION Prof Philippas Tsigas Distributed Computing and Systems Research Group.
Copyright © George Coulouris, Jean Dollimore, Tim Kindberg This material is made available for private study and for direct.
Distributed Transactions Chapter – Vidya Satyanarayanan.
Fault Tolerant Services
Fault Tolerance and Replication
Consensus and leader election Landon Cox February 6, 2015.
From Coulouris, Dollimore, Kindberg and Blair Distributed Systems: Concepts and Design Edition 5, © Addison-Wesley 2012 Slides for Chapter 17: Distributed.
 2002 M. T. Harandi and J. Hou (modified: I. Gupta) Distributed Transactions.
© Chinese University, CSE Dept. Distributed Systems / Distributed Systems Topic 8: Fault Tolerance and Replication Dr. Michael R. Lyu Computer Science.
Revisiting failure detectors Some of you asked questions about implementing consensus using S - how does it differ from reaching consensus using P. Here.
Replication Improves reliability Improves availability ( What good is a reliable system if it is not available?) Replication must be transparent and create.
A client transaction becomes distributed if it invokes operations in several different Servers There are two different ways that distributed transactions.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Replication Steve Ko Computer Sciences and Engineering University at Buffalo.
Lecture 13: Replication Haibin Zhu, PhD. Assistant Professor Department of Computer Science Nipissing University © 2002.
CSE 486/586, Spring 2014 CSE 486/586 Distributed Systems Transactions on Replicated Data Steve Ko Computer Sciences and Engineering University at Buffalo.
Distributed Computing Systems Replication Dr. Sunny Jeong. Mr. Colin Zhang With Thanks to Prof. G. Coulouris,
Replication Chapter Katherine Dawicki. Motivations Performance enhancement Increased availability Fault Tolerance.
EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Outline Announcements Fault Tolerance.
Active replication for fault tolerance
Replication and Recovery in Distributed Systems
EEC 688/788 Secure and Dependable Computing
CSE 486/586 Distributed Systems Concurrency Control --- 3
Slides for Chapter 14: Distributed transactions
Lecture 21: Replication Control
UNIVERSITAS GUNADARMA
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Lecture 21: Replication Control
CIS 720 Concurrency Control.
Last Class: Fault Tolerance
Presentation transcript:

DISTRIBUTED SYSTEMS II AGREEMENT - COMMIT (2-3 PHASE COMMIT) Prof Philippas Tsigas Distributed Computing and Systems Research Group

”I can’t find a solution, I guess I’m just too dumb”  Picture from Computers and Intractability, by Garey and Johnson 2

”I can’t find an algorithm, because no such algorithm is possible”  Picture from Computers and Intractability, by Garey and Johnson 3

”I can’t find an algorithm, but neither can all these famous people.”  Picture from Computers and Intractability, by Garey and Johnson 4

5 Atomic commit protocols  transaction atomicity requires that at the end, –either all of its operations are carried out or none of them.  in a distributed transaction, the client has requested the operations at more than one server  one-phase atomic commit protocol –the coordinator tells the participants whether to commit or abort –what is the problem with that? –this does not allow one of the servers to decide to abort – it may have discovered a deadlock or it may have crashed and been restarted  two-phase atomic commit protocol –is designed to allow any participant to choose to abort a transaction –phase 1 - each participant votes. If it votes to commit, it is prepared. It cannot change its mind. In case it crashes, it must save updates in permanent store –phase 2 - the participants carry out the joint decision The decision could be commit or abort - participants record it in permanent store

6 Failure model for the commit protocols  Failure model for transactions –this applies to the two-phase commit protocol  Commit protocols are designed to work in –synchronous system, system failure when a msg does not arrive on time. –servers may crash but a new process whose state is set from information saved in permanent storage and information held by other processes. –messages may NOT be lost. –assume corrupt and duplicated messages are removed. –no byzantine faults – servers either crash or they obey their requests  2PC is an example of a protocol for reaching a consensus. –Consensus cannot be reached in an asynchronous system if processes sometimes fail. –however, 2PC does reach consensus under those conditions. –because crash failures of processes are masked by replacing a crashed process with a new process whose state is set from information saved in permanent storage and information held by other processes.

7 Operations for two-phase commit protocol  participant interface - canCommit?, doCommit, doAbort coordinator interface - haveCommitted, getDecision canCommit?(trans)-> Yes / No Call from coordinator to participant to ask whether it can commit a transaction. Participant replies with its vote. doCommit(trans) Call from coordinator to participant to tell participant to commit its part of a transaction. doAbort(trans) Call from coordinator to participant to tell participant to abort its part of a transaction. haveCommitted(trans, participant) Call from participant to coordinator to confirm that it has committed the transaction. getDecision(trans) -> Yes / No Call from participant to coordinator to ask for the decision on a transaction after it has voted Yes but has still had no reply after some delay. Used to recover from server crash or delayed messages. This is a request with a reply These are asynchronous requests to avoid delays Asynchronous request

8 The two-phase commit protocol Phase 1 (voting phase): 1. The coordinator sends a canCommit? request to each of the participants in the transaction. 2. When a participant receives a canCommit? request it replies with its vote (Yes or No) to the coordinator. Before voting Yes, it prepares to commit by saving objects in permanent storage. If the vote is No the participant aborts immediately. Phase 2 (completion according to outcome of vote): 3. The coordinator collects the votes (including its own). (a) If there are no failures and all the votes are Yes the coordinator decides to commit the transaction and sends a doCommit request to each of the participants. (b) Otherwise the coordinator decides to abort the transaction and sends doAbort requests to all participants that voted Yes. 4. Participants that voted Yes are waiting for a doCommit or doAbort request from the coordinator. When a participant receives one of these messages it acts accordingly and in the case of commit, makes a haveCommitted call as confirmation to the coordinator.

Two-Phase Commit Protocol  canCommit? yes doCommit no doAbort canCommit?

TimeOut Protocol At step 2 and 3 no commit decision made OK to abort Coordinator will either not collect all commit votes or will vote for abort

TimeOut Protocol At step 4 o cohort cannot communicate with coordinator o Coordinator may have decided o Cohort must block until communication re- established o Might ask other cohorts

Restart Protocol  canCommit? yes doCommit no doAbort canCommit? If the site Has decided, it just picks up from where it left off Is a cohort that had not voted, it decides abort Is a cordinator that has not decided, it decides abort A cohort that crashed after voting commit, it must block until it discovers

Blocking  canCommit? yes doCommit no doAbort canCommit? Blocking can occur if between 2 and 4: Coordinatoor crashes Cohort cannot communicate with coordinator

Three-Phase Commit Protocol no doAbort canCommit? yes precommit ack commit 5 6

Three-phase commit protocol 15

16 Performance of the two-phase commit protocol  if there are no failures, the 2PC involving N participants requires – N canCommit? messages and replies, followed by N doCommit messages.  the cost in messages is proportional to 3N, and the cost in time is three rounds of messages.  The haveCommitted messages are not counted –there may be arbitrarily many server and communication failures –2PC is guaranteed to complete eventually, but it is not possible to specify a time limit within which it will be completed  delays to participants in uncertain state  some 3PCs designed to alleviate such delays they require more messages and more rounds for the normal case

What to read from your Book 17.1 Introduction 17.2Flat and nested distributed transactions 17.3Atomic commit protocols 17.4Concurrency control in distributed transactions 17.5Distributed deadlocks 17.6Transaction recovery 17

Two-phase commit protocol for nested transactions  Recall Fig 13.1b, top-level transaction T and subtransactions T 1, T 2, T 11, T 12, T 21, T 22  A subtransaction starts after its parent and finishes before it  When a subtransaction completes, it makes an independent decision either to commit provisionally or to abort. –A provisional commit is not the same as being prepared: it is a local decision and is not backed up on permanent storage. –If the server crashes subsequently, its replacement will not be able to carry out a provisional commit.  A two-phase commit protocol is needed for nested transactions –it allows servers of provisionally committed transactions that have crashed to abort them when they recover.

DISTRIBUTED SYSTEMS II REPLICATION Prof Philippas Tsigas Distributed Computing and Systems Research Group

Distributed Systems Course Replication 18.1 Introduction to replication 18.2 System model and group communication 18.3 Fault-tolerant services 18.4 Highly available services 18.5 Transactions with replicated data

21 Introduction to replication  replication can provide the following  performance enhancement –e.g. several web servers can have the same DNS name and the servers are selected in turn. To share the load. –replication of read-only data is simple, but replication of changing data has overheads  fault-tolerant service –guarantees correct behaviour in spite of certain faults (can include timeliness) –if f of f+1 servers crash then 1 remains to supply the service –if f of 2f+1 servers have byzantine faults then they can supply a correct service  availability is hindered by –server failures  replicate data at failure- independent servers and when one fails, client may use another. – network partitions and disconnected operation  Users of mobile computers deliberately disconnect, and then on re-connection, resolve conflicts Replication of data :- the maintenance of copies of data at multiple computers

22 Availiability is used for repairable systems  It is the probability that the system is operational at any random time t.  It can also be specified as a proportion of time that the system is available for use in a given interval (0,T). 22

23 Requirements for replicated data  Replication transparency –clients see logical objects (not several physical copies)  they access one logical item and receive a single result  Consistency –specified to suit the application,  e.g. when a user of a diary disconnects, their local copy may be inconsistent with the others and will need to be reconciled when they connect again. But connected clients using different copies should get consistent results. These issues are addressed in Bayou and Coda.

24 A basic architectural model for the management of replicated data FE Requests and replies C Replica C Service Clients Front ends managers RM FE RM Figure 14.1 A collection of RMs provides a service to clients Clients see a service that gives them access to logical objects, which are in fact replicated at the RMs Clients request operations: those without updates are called read-only requests the others are called update requests (they may include reads) Clients request are handled by front ends. A front end makes replication transparent.

System model  each logical object is implemented by a collection of physical copies called replicas –the replicas are not necessarily consistent all the time (some may have received updates, not yet conveyed to the others)  we assume an asynchronous system where processes fail only by crashing and generally assume no network partitions  replica managers –a RM contains replicas on a computer and access them directly –RMs apply operations to replicas recoverably  i.e. they do not leave inconsistent results if they crash –objects are copied at all RMs unless we state otherwise –static systems are based on a fixed set of RMs –in a dynamic system: RMs may join or leave (e.g. when they crash) can be –a RM can be a state machine, which has the following properties:

State Machine Semantic Characterization  Outputs of a state machine are complitely determined by the sequence of requests it processes indepedent of time and any other activity in the system.  Vague about internal structure 26

State Machine: Examples State machine  Server:  Word store[N]  Read(int loc) { send store[loc] to client; } Write[int loc, word val] { store[loc]=val } Client memory.write(100, 4) Memory.read(100) Receive v from memory Not a state machine  while true do read sensor q := compute adjustment send q to actuator end while 27

State Machine no Replication Response Guarantees 28 Client Server

Response Guarantees 1)Requests issued by a single client to a state machine are processed in the order issued (FIFO request delivery) 2) –Request r to state machine s by client c1 –could have caused request r’ to s by client c2, then –s processes r before r’ 29

30

31 Requests are buffered until they become stable to be processed

All replicas process the same sequence of requests 1.Uniquely identify the requests. 2.Order the requests. Do not forget the guarantees that we expect. 1.Server have to know when to service a request. (When a request is stable) 32

When to process a reguest – Stability Detection  3 methods: –Logical clocks –Real-time clocks –Server-generated ids 33

Logical Clocks  Assign integer T(e,p) to event e from processor p: –If e is a sending of a message –If e is a receiving of a message –Importanat event Properties: T(e,p) < T(e1,q) or vice-versa If e could have caused e1, then T(e,p)<T(e1,q) 34

p<q<r 35

Synchronized Real-Time Clocks –If a message sent with uid t will be received no later than t+D by local clock. –Uids differ by D at most at any time 36

Server-generated ids  Clients first get an id from the server then issue the id to issue a request (like a sequencer). 37

State Machine 38 Client Server

State Machine 39 Client Server

40 State Machine approach to Replication Each RM  applies operations atomically  its state is a deterministic function of its initial state and the operations applied start identical carryout the same sequence of operations  all replicas start identical and carry out the same sequence of operations  Its operations must not be affected by clock readings etc.

Replication  Place a copy of the server state machine on multiple network nodes.  ? Communication of the requests?  ? Coordination ?  Want: All replicas start in the same state All replicas receive the same set of requests All replicas process the same sequence of requests 41

42 Four phases in performing a request  issue request –the FE either  sends the request to a single RM that passes it on to the others  or multicasts the request to all of the RMs  coordination + agreement –the RMs decide whether to apply the request; and decide on its ordering relative to other requests (according to FIFO, causal or total ordering)  execution –the RMs execute the request (sometimes tentatively)  response –one or more RMs reply to FE. e.g.  for high availability give first response to client.  to tolerate byzantine faults, take a vote FIFO ordering: if a FE issues r then r', then any correct RM handles r before r' Causal ordering: if r  r', then any correct RM handles r before r' Total ordering: if a correct RM handles r before r', then any correct RM handles r before r' RMs agree - I.e. reach a consensus as to effect of the request. In Gossip, all RMs eventually receive updates.

Active replication for fault tolerance  the RMs are state machines all playing the same role and organised as a group. –all start in the same state and perform the same operations in the same order so that their state remains identical  If an RM crashes it has no effect on performance of the service because the others continue as normal  It can tolerate byzantine failures because the FE can collect and compare the replies it receives FEC CRM Figure 14.5 a FE multicasts each request to the group of RMs (and FE’s) the RMs process each request identically and reply Requires totally ordered reliable multicast so that all RMs perfrom the same operations in the same order What sort of system do we need to perform totally ordered reliable multicast?

44 Active replication - five phases in performing a client request  Request –FE attaches a unique id and uses totally ordered reliable multicast to send request to RMs. FE can at worst, crash. It does not issue requests in parallel  Coordination –the multicast delivers requests to all the RMs in the same (total) order.  Execution –every RM executes the request. They are state machines and receive requests in the same order, so the effects are identical. The id is put in the response  Agreement –no agreement is required because all RMs execute the same operations in the same order, due to the properties of the totally ordered multicast.  Response –FEs collect responses from RMs. FE may just use one or more responses. If it is only trying to tolerate crash failures, it gives the client the first response.

45 The passive (primary-backup) model for fault tolerance  There is at any time a single primary RM and one or more secondary (backup, slave) RMs  FEs communicate with the primary which executes the operation and sends copies of the updated data to the result to backups  if the primary fails, one of the backups is promoted to act as the primary FE C C RM Primary Backup RM Figure 14.4 The FE has to find the primary, e.g. after it crashes and another takes over

46 Passive (primary-backup) replication. Five phases.  The five phases in performing a client request are as follows:  1. Request: –a FE issues the request, containing a unique identifier, to the primary RM  2. Coordination : –the primary performs each request atomically, in the order in which it receives it relative to other requests –it checks the unique id; if it has already done the request it re-sends the response.  3. Execution: –The primary executes the request and stores the response.  4. Agreement : –If the request is an update the primary sends the updated state, the response and the unique identifier to all the backups. The backups send an acknowledgement.  5. Response : –The primary responds to the FE, which hands the response back to the client.

47 Passive (primary-backup) replication (discussion)  This system implements linearizability, since the primary sequences all the operations on the shared objects  If the primary fails, the system is linearizable, if a single backup takes over exactly where the primary left off, i.e.: –the primary is replaced by a unique backup –surviving RMs agree which operations had been performed at take over  view-synchronous group communication can achieve this –when surviving backups receive a view without the primary, they use an agreed function to calculate which is the new primary. –The new primary registers with name service –view synchrony also allows the processes to agree which operations were performed before the primary failed. –E.g. when a FE does not get a response, it retransmits it to the new primary –The new primary continues from phase 2 (coordination -uses the unique identifier to discover whether the request has already been performed.