Consistency and Replication (3). Topics Consistency protocols.

Slides:



Advertisements
Similar presentations
1 Process groups and message ordering If processes belong to groups, certain algorithms can be used that depend on group properties membership create (
Advertisements

Logical Clocks (2).
Replication. Topics r Why Replication? r System Model r Consistency Models r One approach to consistency management and dealing with failures.
Consistency and Replication Chapter 7 Part II Replica Management & Consistency Protocols.
Replication Management. Motivations for Replication Performance enhancement Increased availability Fault tolerance.
Synchronization Chapter clock synchronization * 5.2 logical clocks * 5.3 global state * 5.4 election algorithm * 5.5 mutual exclusion * 5.6 distributed.
DISTRIBUTED SYSTEMS II REPLICATION CNT. II Prof Philippas Tsigas Distributed Computing and Systems Research Group.
Consistency and Replication. Replication of data Why? To enhance reliability To improve performance in a large scale system Replicas must be consistent.
Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
CS 582 / CMPE 481 Distributed Systems
Database Replication techniques: a Three Parameter Classification Authors : Database Replication techniques: a Three Parameter Classification Authors :
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
More on Replication and Consistency CS-4513, D-Term More on Replication and Consistency CS-4513 D-Term 2007 (Slides include materials from Operating.
Group Communications Group communication: one source process sending a message to a group of processes: Destination is a group rather than a single process.
CS 582 / CMPE 481 Distributed Systems
CS 582 / CMPE 481 Distributed Systems Replication.
Distributed Systems Fall 2011 Gossip and highly available services.
© Chinese University, CSE Dept. Distributed Systems / Distributed Systems Topic 9: Time, Coordination and Replication Dr. Michael R. Lyu Computer.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Replication and Consistency CS-4513 D-term Replication and Consistency CS-4513 Distributed Computing Systems (Slides include materials from Operating.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Chapter 7 Consistency.
Time, Clocks, and the Ordering of Events in a Distributed System Leslie Lamport (1978) Presented by: Yoav Kantor.
Logical Clocks (2). Topics r Logical clocks r Totally-Ordered Multicasting r Vector timestamps.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Slides for Chapter 14: Replication From Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edition 3, © Addison-Wesley 2001.
Copyright © George Coulouris, Jean Dollimore, Tim Kindberg This material is made available for private study and for direct.
DISTRIBUTED SYSTEMS II REPLICATION CNT. Prof Philippas Tsigas Distributed Computing and Systems Research Group.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Gossiping Steve Ko Computer Sciences and Engineering University at Buffalo.
Practical Byzantine Fault Tolerance
Consistency and Replication Chapter 6. Release Consistency (1) A valid event sequence for release consistency. Use acquire/release operations to denote.
Consistency and Replication. Replication of data Why? To enhance reliability To improve performance in a large scale system Replicas must be consistent.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Outline Introduction (what’s it all about) Data-centric consistency Client-centric consistency Replica management Consistency protocols.
Logical Clocks. Topics Logical clocks Totally-Ordered Multicasting Vector timestamps.
Lamport’s Logical Clocks & Totally Ordered Multicasting.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Chapter 7 Consistency.
Replication (1). Topics r Why Replication? r Consistency Models – How do we reason about the consistency of the “global state”? m Data-centric consistency.
IM NTU Distributed Information Systems 2004 Replication Management -- 1 Replication Management Yih-Kuen Tsay Dept. of Information Management National Taiwan.
Synchronization Chapter 5.
Replication (1). Topics r Why Replication? r System Model r Consistency Models – How do we reason about the consistency of the “global state”? m Data-centric.
DISTRIBUTED SYSTEMS II REPLICATION Prof Philippas Tsigas Distributed Computing and Systems Research Group.
Copyright © George Coulouris, Jean Dollimore, Tim Kindberg This material is made available for private study and for direct.
Distributed Systems CS Consistency and Replication – Part IV Lecture 21, Nov 10, 2014 Mohammad Hammoud.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Fault Tolerant Services
Replication (1). Topics r Why Replication? r System Model r Consistency Models r One approach to consistency management and dealing with failures.
D u k e S y s t e m s Asynchronous Replicated State Machines (Causal Multicast and All That) Jeff Chase Duke University.
Logical Clocks. Topics Logical clocks Totally-Ordered Multicasting Vector timestamps.
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
Distributed Systems CS Consistency and Replication – Part IV Lecture 13, Oct 23, 2013 Mohammad Hammoud.
Providing High Availability Using Lazy Replication Rivaka Ladin, Barbara Liskov, Liuba Shrira, Sanjay Ghemawat Presented by Huang-Ming Huang.
Replication Improves reliability Improves availability ( What good is a reliable system if it is not available?) Replication must be transparent and create.
Consistency and Replication Chapter 6 Presenter: Yang Jie RTMM Lab Kyung Hee University.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Replication Steve Ko Computer Sciences and Engineering University at Buffalo.
Logical Clocks. Topics r Logical clocks r Totally-Ordered Multicasting.
Highly Available Services and Transactions with Replicated Data Jason Lenthe.
COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 6: Synchronyzation 3/5/20161 Distributed Systems - COMP 655.
Fault Tolerance (2). Topics r Reliable Group Communication.
Mutual Exclusion Algorithms. Topics r Defining mutual exclusion r A centralized approach r A distributed approach r An approach assuming an organization.
CSE 486/586 CSE 486/586 Distributed Systems Gossiping Steve Ko Computer Sciences and Engineering University at Buffalo.
Distributed Computing Systems Replication Dr. Sunny Jeong. Mr. Colin Zhang With Thanks to Prof. G. Coulouris,
Distributed systems II Replication Cnt.
Replication Improves reliability Improves availability
Active replication for fault tolerance
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
Slides for Chapter 15: Replication
Presentation transcript:

Consistency and Replication (3)

Topics Consistency protocols

Readings Van Steen and Tanenbaum: 6.5 Coulouris: 11,14

Introduction A consistency protocol describes an implementation of a specific consistency model. We will look at different architectures that can be used to support different consistency models, but first we look at a basic architectural model.

A Basic Architectural Model for the Management of Replicated Data FE Requests and replies C Replica C Service Clients Front ends managers RM FE RM

A Basic Architectural Model for the Management of Replicated Data A collection of replica managers provides a service to clients. The clients see a service that gives them access to objects (e.g., calendar or bank accounts) which are replicated. Each client’s requests are handled by a component called a front end.

A Basic Architectural Model for the Management of Replicated Data The purpose of the front end is to hide the replication from the client process. The client processes do not know how many replicas there are. A front end may be implemented in the client’s address space or it may be a separate process. Replicas coordinate in preparation to execute the request consistently.

A Basic Architectural Model for the Management of Replicated Data Replica managers execute requests One or more replicas may respond to the application (through the front end).

Primary-Based Protocols In primary-based protocols, each data item x in the data store has an associated primary, which is responsible for coordinating write operations on x. Primary-Backup Protocols u Read operations are performed on a locally available copy. u Write operations are done at a fixed primary copy. u The primary performs the update on its local copy of x and then forwards the update to all the other replicas (which are considered to be backups).

Primary-Based Protocols Primary-Backup Protocols (cont) u Each backup server performs the update as well and sends an acknowledgement back to the primary. u When all backup servers have updated their local copy the primary sends an acknowledgement back to the initial process. This implements sequential consistency. The primary RM is a performance bottleneck Can tolerate F failures for F+1 RMs SUN NIS (yellow pages) uses passive replication: client can contact primary or backup servers for reads, but only primary servers for updates.

The Primary-Backup Protocol FE C C RM Primary Backup RM

Replicated-Write Protocols In replicated-write protocols, write operations can be carried out at multiple replicas instead of only one (as seen in the case of primary- based replicas). Operations need to be carried out in the same order everywhere. We discussed one approach for doing so that uses Lamport’s timestamps. Using Lamport timestamps does not scale well in large distributed systems.

Replicated-Write Protocols An alternative approach to achieving total order is to use a central coordinator which is sometimes called a sequencer. u Forward each operation to the sequencer. u Sequencer assigns a unique sequence number and subsequently forwards the operation to all replicas. u Operations are carried out in the order of their sequence number. u Hmm. This resembles primary-based consistency protocols. Useful for sequential consistency.

Replicated-Write Protocols The use of a sequencer does not solve the scalability problem. A combination of Lamport timestamps and sequencers may be necessary. The approach is summarized as follows: u Each process has a unique identifier, p i, and keeps a sent message counter c i. The process identifier and message counter uniquely identify a message. u Active processes (or a sequencer) keep an extra counter: t i. This is called the ticket number. A ticket is a triplet (p i, t i, (p j, c j )).

Replicated-Write Protocols Approach Summary (cont) u An active process issues tickets for its own messages and for messages from its associated passive processes (these are processes that are not sequencers). u Passive processes multicast their messages to all group processes which then wait for a ticket stating the total order of each message. u The ticket is sent by each passive process’s sequencer. u Lamport’s totally ordered multicast algorithm is used among the sequencers to determine the order of update operations. u When an operation is allowed, each sequencer sends the ticket to its associated passive processes. It is assumed that the passive process receives these tickets in the order sent.

Replicated-Write Protocols Approach Summary (cont) u If a sequencer terminates abnormally, then one of the passive sequencers associated with it can become the new sequencer. u An election algorithm may be used to choose the new sequencer.

Replicated-Write Protocols Let’s say that we have 6 processes: p 1,p 2,p 3,p 4,p 5,p 6 Assume that p 1,p 2 are sequencers; p 3,p 4 are associated with p 1 and p 5,p 6 are associated with p 2 Let’s say that p 3 sends a message which is identified by (p 3, 1). p 1 generates a ticket as follows: (p 1, 1, (p 3, 1)) The ticket number is generated using the Lamport clock algorithm.

Replicated-Write Protocols Let’s say that p 5 sends a message which is identified by (p 5, 1). p 2 generates a ticket as follows: (p 2, 1, (p 3, 1)) Which update gets done first? Basically, p 1,p 2 will apply Lamport’s algorithm for totally ordered multicast. When an update operation is allowed to proceed, the sequencers send messages to their associated processes.

Gossip Architecture We just studied some architectures for sequential consistency. What about causal consistency? The Gossip Architecture supports causally-consistent lazy replication which in essence refers to the potential causality between read and write operations. Clients are allowed to communicate with each other, but will then have to exchange information on the operations they performed on the data store. This exchange of information is done through gossip messages.

Gossip Architecture

Each RM i maintains for its local copy the vector timestamp VAL(i) u VAL(i)[i]: the total number of completed write requests that have been sent from a client to RM i u VAL(i)[j]: the total number of completed write requests that have been sent from RM j to RM i u This is referred to as the value timestamp and it reflects the updates that have been completed at the replica. u This timestamp is attached to the reply of a read operation.

Gossip Architecture Each RM i maintains for its local copy the vector timestamps WORK(i) which represents those write operations that been been received (but not necessarily processed) at RM i u WORK(i)[i]: the total number of write requests that have been sent from a client to RM i including those that have been completed by RM i. u WORK(i)[j]: the total number of write requests that have been sent from RM j to RM i including those that have been completed by RM i. u This is referred to as the replica timestamp. u This timestamp is attached to the reply of a write operation.

Gossip Architecture Each client keeps track of the writes that it has seen so far. The client C maintains a vector timestamp LOCAL(C) with LOCAL (C )[i] set equal to the most recent value of the number of writes seen at RM i (from C’s view point). This vector timestamp is attached to every request sent to a replica. Note that the client can contact a different replica each time it wants to read or write data. Two front ends may exchange messages directly; these messages also carry the timestamp represented by LOCAL (C).

Gossip Architecture Write log (queue) u Every write operation, when received by a replica, is recorded in the update log of the replica. u Two reasons for this: n The update cannot be applied yet; it is held back n It is uncertain if the update has been received by all replicas. u The entries are sorted by timestamp. A similar log is needed for read operations. This is referred to as the read log (or queue).

Gossip Architecture The Executed Operation table u The same write operation may arrive at a replica from a front end and in a gossip message from another replica. u To present an update from being applied twice, the replica keeps a list of identifiers of the write operations that have been applied so far.

Gossip Architecture Processing read request R from C u Let DEP (R) be the timestamp associated with R. It is set to LOCAL(C). u The request is sent to RM i (with DEP (R)) which stores the request in its read queue. u The read request is processed if DEP(R)[j] <= VAL(i)[j] (for all j). This indicates that RM i has seen the same writes as the client. u As soon as a read operation can be carried out, RM i returns the value of the requested data item to the client, along with VAL(i). u LOCAL(C) is adjusted to the value max{LOCAL(C)[j],VAL(i)[j]} for all j. n This make sense since the value returned by read is potentially the cumulative result of all previous writes.

Gossip Architecture Performing a read operation at a local copy.

Gossip Architecture Processing a write operation, W, from C u Let DEP (W) be the timestamp associated with W. It is set to LOCAL(C). u When the request is received by RM i it increments WORK(i)[i] by 1 but leaves the other entries intact. n This is done so that WORK reflects that RM i has received the latest write request. At this point it isn’t known if it can be carried out. u A timestamp ts(W) is derived from DEP(W) by setting ts(W)[i] to WORK(i)[i]; the rest of entries are as found in DEP(W). u This timestamp is sent back as an acknowledgement to the client, which subsequently adjusts LOCAL(C) by setting each kth entry to max{LOCAL(C)[k],ts(W)[k]}.

Gossip Architecture Processing Write Operations (cont) u The write request W is processed if DEP(W)[j] <= VAL(i)[j] (for all j). u This indicates that RM i has seen the same writes as the client. This is referred to as the stability condition. u The write operation takes place. u What if there exists a j such that DEP(W)[j] > VAL(i)[j]? n This would indicate that there was a write seen by the client that is not yet seen by RM i.

Gossip Architecture Processing Write Operations(cont) u VAL(i) is adjusted by setting each jth entry to max{VAL(i)[j],ts(W)[j]}. n Recall that ts(W)[j] is set to DEP(W)[j] for all j != i and is set to WORK(i)[i] for j = i(which had been incremented upon receiving the write request; the end result is that VAL(i) is incremented by 1). The following two conditions are satisfied: u All operations sent directly to RM i from other clients but that preceded W, have been processed. n ts(W)[i] = VAL(i)[i] + 1 u All write operations that W depends on have been processed. n ts(W)[j] <= VAL(i)[j] for all j != i

Gossip Architecture Performing a write operation at a local copy.

Gossip Architecture For every gossip message received by RM j from RM i, does the following: u RM j adjusts WORK(j) by setting each kth entry equal to max{WORK(i)[k],WORK(j)[k]} u RM j merges the write operations sent by RM i with its own u Apply those writes that have become stable i.e., a write request W is processed if DEP(W)[j] <= VAL(i)[j] (for all j). A write from RM j that is processed should cause VAL(i)[j] to be incremented by 1. A gossip message need not contain the entire log, if it is certain that some of the updates have been seen by the receiving replica.

Gossip Architecture (Example) VAL = (0,0,0) WORK=(0,0,0) VAL = (0,0,0) WORK=(0,0,0) replicas LOCAL = (0,0,0) Initial state VAL = (0,0,0) WORK=(0,0,0) 0 1

Gossip Architecture (Example) VAL = (0,0,0) WORK=(0,0,0) VAL = (0,0,0) WORK=(0,0,0) replicas LOCAL = (0,0,0) Client 0 sends a write, W 0, to replica 0 VAL = (0,0,0) WORK=(0,0,0) 0 1 DEP(W 0 )=(0,0,0)

Gossip Architecture (Example) VAL = (0,0,0) WORK=(1,0,0) DEP(W 0 )=(0,0,0) ts(W 0 )=(1,0,0) VAL = (0,0,0) WORK=(0,0,0) replicas LOCAL = (0,0,0) WORK is updated VAL = (0,0,0) WORK=(0,0,0) 0 1

Gossip Architecture (Example) VAL = (0,0,0) WORK=(1,0,0) DEP(W 0 )=(0,0,0) ts(W 0 )=(1,0,0) VAL = (0,0,0) WORK=(0,0,0) replicas LOCAL = (1,0,0) LOCAL = (0,0,0) client 0 receives an ack from replica 0 for its write LOCAL changes from (0,0,0) to (1,0,0) VAL = (0,0,0) WORK=(0,0,0) 0 1 ack ( ts(W 0 ))

Gossip Architecture (Example) VAL = (1,0,0) WORK=(1,0,0) DEP(W 0 )=(0,0,0) ts(W 0 )=(1,0,0) VAL = (0,0,0) WORK=(0,0,0) replicas LOCAL = (1,0,0) LOCAL = (0,0,0) W 0 is applied since DEP(W 0 ) <= VAL; VAL changes VAL = (0,0,0) WORK=(0,0,0) 0 1

Gossip Architecture (Example) VAL = (1,0,0) WORK=(1,0,0) DEP(W 0 )=(0,0,0) ts(W 0 )=(1,0,0) VAL = (0,0,0) WORK=(0,0,0) replicas LOCAL = (1,0,0) LOCAL = (0,0,1) Represents state after Client 1 sends a write,W 1, to replica 2 VAL = (0,0,1) WORK=(0,0,1) DEP(W 1 )=(0,0,0) ts(W 1 )=(0,0,1) 0 1

Gossip Architecture (Example) VAL = (1,0,0) WORK=(1,0,0) DEP(W 0 )=(0,0,0) ts(W 0 )=(1,0,0) VAL = (0,0,0) WORK=(0,0,0) replicas LOCAL = (1,0,0 ) LOCAL = (0,0,1) Client 0 sends a write message W 2 to replica 2; Cannot be done yet since replica 2 didn’t see the write done at replica 1 VAL = (0,0,1) WORK=(0,0,2) DEP(W 1 )=(0,0,0) ts(W 1 )=(0,0,1) DEP(W 2 )=(1,0,0) ts(W 2 )=(1,0,2) 0 1 DEP(W 2 )=(1,0,0)

Gossip Architecture (Example) VAL = (1,0,0) WORK=(1,0,0) DEP(W 0 )=(0,0,0) ts(W 0 )=(1,0,0) VAL = (0,0,0) WORK=(0,0,0) replicas LOCAL = (1,0,2 ) LOCAL = (0,0,1) An ack has been returned to 0 which then updates LOCAL from (1,0,0) to (1,0,2) VAL = (0,0,1) WORK=(0,0,2) DEP(W 1 )=(0,0,0) ts(W 1 )=(0,0,1) DEP(W 2 )=(1,0,0) ts(W 2 )=(1,0,2) 0 1 ack(ts(W 2 ))

Gossip Architecture (Example) VAL = (1,0,0) WORK=(1,0,2) DEP(W 0 )=(0,0,0) ts(W 0 )=(1,0,0) VAL = (0,0,0) WORK=(0,0,0) replicas LOCAL = (1,0,2) LOCAL = (0,0,1) Replica 0 and 2 exchange update propagation messages (gossip) WORK at both replicas is adjusted VAL = (0,0,1) WORK=(1,0,2) DEP(W 1 )=(0,0,0) ts(W 1 )=(0,0,1) DEP(W 2 )=(1,0,0) ts(W 2 )=(1,0,2) 0 1

Gossip Architecture (Example) VAL = (1,0,0) WORK=(1,0,2) DEP(W 0 )=(0,0,0) ts(W 0 )=(1,0,0) VAL = (0,0,0) WORK=(0,0,0) replicas LOCAL = (1,0,2) LOCAL = (0,0,1) Replica 0 has one write operation (W 0 ). This is sent to replica 2 with DEP(W 0 ). Replica 2 has write operation(W 1 ). This is sent to replica 2 with DEP(W 1 ). Replica 2 also sends W 2 with DEP(W 2 ) VAL = (0,0,1) WORK=(1,0,2) DEP(W 1 )=(0,0,0) ts(W 1 )=(0,0,1) DEP(W 2 )=(1,0,0) ts(W 2 )=(1,0,2) 0 1

Gossip Architecture (Example) VAL = (1,0,0) WORK=(1,0,2) DEP(W 0 )=(0,0,0) ts(W 0 )=(1,0,0) VAL = (0,0,0) WORK=(0,0,0) replicas LOCAL = (1,0,2) LOCAL = (0,0,1) VAL = (0,0,1) WORK=(1,0,2) DEP(W 1 )=(0,0,0) ts(W 1 )=(0,0,1) DEP(W 2 )=(1,0,0) ts(W 2 )=(1,0,2) 0 1 Replica 2 can carry out W 0 since DEP(W 0 ) < VAL Replica 0 can carry out W 1 since DEP(W 1 ) <= VAL

Gossip Architecture (Example) VAL = (1,0,1) WORK=(1,0,2) DEP(W 0 )=(0,0,0) ts(W 0 )=(1,0,0) VAL = (0,0,0) WORK=(0,0,0) replicas LOCAL = (1,0,2) LOCAL = (0,0,1) VAL = (1,0,1) WORK=(1,0,2) DEP(W 1 )=(0,0,0) ts(W 1 )=(0,0,1) DEP(W 2 )=(1,0,0) ts(W 2 )=(1,0,2) 0 1 VAL in replica 0 and replica 2 are updated

Gossip Architecture (Example) VAL = (1,0,1) WORK=(1,0,2) DEP(W 0 )=(0,0,0) ts(W 0 )=(1,0,0) VAL = (0,0,0) WORK=(0,0,0) replicas LOCAL = (1,0,2) LOCAL = (0,0,1) VAL = (1,0,1) WORK=(1,0,2) DEP(W 1 )=(0,0,0) ts(W 1 )=(0,0,1) DEP(W 2 )=(1,0,0) ts(W 2 )=(1,0,2) 0 1 W 2 can now be executed at replica 2 since DEP(W 2 ) < VAL; W 2 can also be applied at replica 0

Summary There are good reasons to introduce replication. However, replication introduces consistency problems. Doing so may severely degrade performance, especially in large-scale systems. Thus consistency is relaxed. We have studied consistency models and protocols.