Slides for Chapter 15: Replication

Slides for Chapter 15: Replication
From Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edition 4, © Addison-Wesley 2005

System model and group communication Fault-tolerant services
Outline Introduction System model and group communication Fault-tolerant services Case studies of highly available services: the gossip architecture, Bayou and Coda Transactions with replicated data Summary Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn © Pearson Education 2005

Replication: the maintenance of copies of data at multiple computers
Introduction Replication: the maintenance of copies of data at multiple computers A key to the effectiveness of distributed systems: Enhance performance Increased availability Fault tolerance Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn © Pearson Education 2005

System model and group communication
Object  Replica System model Five phases are involved in the performance of a single request upon the replicated objects: The front end issues the request to one or more replica managers Coordination FIFO ordering Causal ordering Total ordering Execution Agreement Response Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn © Pearson Education 2005

Group communication Multicast communication
Group membership management These two are highly interrelated Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn © Pearson Education 2005

Figure 15.1 A basic architectural model for the management of replicated data
FE Requests and replies C Replica Service Clients Front ends managers RM Communicate with one or more of the replica managers, make replication transparent Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn © Pearson Education 2005

Figure 15.2 Services provided for process groups
Join Group address expansion Multicast communication send Fail Group membership management Leave Process group A process outside the group sends message to the group without knowing the group’s membership The group communication service has to manage changes in the group’s membership while multicasts take place concurrently Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn © Pearson Education 2005

The role of group membership service
Providing an interface for group membership changes Create or destroy groups, add or withdraw a process from a group Implementing a failure detector Exclude a process if it is suspected to have failed or become unreachable Notifying members of group membership changes Performing group address expansion Expands the identifier into the current group membership for delivery IP multicast is a weak case of a group membership service. A full group membership service maintains group views (Ordered lists of the current group members with their unique process identifiers) Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn © Pearson Education 2005

View delivery For each group g the group management service delivers to any member process a series of views For example, A member delivering a view when a membership change occurs An event as occurring in a view v(g) at process p if, at the time of the event’s occurrence, p has delivered v(g) but has not yet delivered the next view Some basic requirements for view delivery Order: If a process p delivers view v(g) and then view v’(g), then no other process q!= p delivers v’(g) before v(g) Integrity: if process p delivers view v(g) then p∈v(g) Non-triviality: If process q joins a group and is or becomes indefinitely reachable from process p!=q, then eventually q is always in the views that p delivers Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn © Pearson Education 2005

View synchronous group communication
Guarantees provided by view synchronous group communication: Agreement: correct processes deliver the same sequence of views (starting from the view in which they join the group), and the same set of messages in any given view Integrity: if a correct process p delivers message m, then it will not deliver m again. Furthermore, p∈group(m) and the process that sent m is in the view in which p delivers m Validity(closed group): correct processes always deliver the messages that they send. Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn © Pearson Education 2005

Figure 15.3 View-synchronous group communication
q r p crashes view (q, r) view (p, q, r) a (allowed). b (allowed). c (disallowed). d (disallowed). p sends a message m while in view (p,q,r) but that p crashes soon after sending m, while q and r are correct Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn © Pearson Education 2005

Fault-tolerance services
Correctness criteria for replicated objects Linearizability A replicated shared object service is linearizable if for any execution there is some interleaving of the series of operations issued by all the clients that satisfies the following two criteria: The interleaved sequence of operations meets the specification of a (single) correct copy of the objects The order of operations in the interleaving is consistent with the real times at which the operations occurred in the actual execution Client 1: Client 2: setBalanceB(x, 1) setBalanceA(y, 2) getBalanceA(y) -> 2 getBalanceA(x) -> 0 Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn © Pearson Education 2005

Sequentially consistent
If for any execution there is some interleaving of the series of operations issued by all the clients which satisfies the following two criteria The interleaved sequence of operations meets the specification of a (single) correct copy of the objects The order of operations in the interleaving is consistent with the program order in which each individual client executed them Client 1: Client 2: setBalanceB(x, 1) getBalanceA(y) -> 0 getBalanceA(x) -> 0 setBalanceA(y, 2) Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn © Pearson Education 2005

Passive (primary-backup) replication
At any time, there is a single primary replica manager and one or more secondary replica managers (backups) Front ends communicate only with the primary replica manager The primary replica manager executes the operations and sends copies of the updated data to the backups If the primary fails, one of the backups is promoted to act as the primary This system obviously implements linearizability if the primary is correct. Why? Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn © Pearson Education 2005

Figure 15.4 The passive (primary-backup) model for fault tolerance
FE C RM Primary Backup Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn © Pearson Education 2005

Active replication The replica managers are state machines that play equivalent roles and are organized as a group Front ends multicast their requests to the group of replica managers All the replica managers process the request independently but identically and reply If any replica manager crashes, no impact upon the performance of the service Achieves sequential consistency Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn © Pearson Education 2005

Figure 15.5 Active replication
FE C RM Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn © Pearson Education 2005

Case studies of highly available services
Apply replication techniques to make service highly available. Three systems that provide highly available services: the gossip architecture, Bayou and Coda Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn © Pearson Education 2005

The gossip architecture
Replicating data close to the points where groups of clients need it Replica managers exchange messages periodically in order to convey the updates they have each received from clients The system makes two guarantees Each client obtains a consistent services over time Replica managers only ever provide a client with data that reflects at least the updates that the clients have observed so far Relaxed consistency between replicas All replica managers eventually receive all updates and they apply updates with ordering guarantees that make the replicas sufficiently similar to suit the needs of the application Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn © Pearson Education 2005

Figure 15.6 Query and update operations in a gossip service
Val FE RM Query, prev Val, new Update Update, Update id Clients gossip Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn © Pearson Education 2005

Figure 15.7 Front ends propagate their timestamps whenever clients communicate directly
FE Clients Service Vector timestamps RM gossip In order to control the ordering of operating processes, each front end keeps a vector timestamp that reflects the version of the latest data values accessed by the front end Clients exchange data by accessing the same gossip service and by communicating directly with one another via the client’s front ends. Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn © Pearson Education 2005

Figure 15.8 A gossip replica manager, showing its main state components
Other replica managers Replica timestamp Update log Value timestamp Executed operation table Stable updates Updates Gossip messages FE Replica Replica log OperationID Update Prev Replica manager Timestamp table Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn © Pearson Education 2005

Figure 15.9 Committed and tentative updates in Bayou
cN t0 t1 ti Committed Tentative t2 Tentative update ti becomes the next committed update and is inserted after the last committed update cN. ti+1 Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn © Pearson Education 2005

Transactions with replicated data
One-copy serialization property The effect of transactions performed by clients on replicated objects should be the same as if they had been performed one at a time on a single set of objects Three replication schemes Available copies with validation Quorum consensus Virtual partition Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn © Pearson Education 2005

Architecture for replicated transactions
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn © Pearson Education 2005

Figure 15.10 Transactions on replicated data
B A Client + front end getBalance(A) Replica managers deposit(B,3); U T Read-one/write-all scheme, a read request can be performed by a single replica manager, whereas a write request must be performed by all the replica managers in the group ACID: Atomicity, Consistency, Isolation, Durability Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn © Pearson Education 2005

The two-phase commit protocol
One of the RMs is the coordinator First phase – coordinator sends canCommit? To other RMs Second phase – coordinator sends doCommit or doAbort to other RMs Primary copy replication Primary copy is used for transactions Concurrency control at the primary Primary communicates with other RMs (backup to address failure) Read-one / write-all Read – one single RM, which sets a read lock Write – at all RMs, each sets a write lock One copy serializability is achieved Read and write on same RM require conflicting locks Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn © Pearson Education 2005

Available copies replication
Designed to allow for some replica managers being temporarily unavailable. A client’s request on a logical object may be performed by any available replica manager But a client’s update request must be performed by all available replica managers in the group with copies of the object Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn © Pearson Education 2005

Figure 15.11 Available copies
X Client + front end P B Replica managers deposit(A,3); U T deposit(B,3); getBalance(B) getBalance(A) Y M N At X, transaction T has read A and therefore transaction U is not allowed to update A with the deposit operation until transaction T has completed Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn © Pearson Education 2005

Figure 15.12 Network partition
Client + front end B withdraw(B, 4) Replica managers deposit(B,3); U T Network partition Replication schemes are designed with the assumption that partitions will eventually be repaired. Therefore, the replica managers within a single partition must ensure that any requests that they execute during a partition will not make the set of replicas inconsistent when the partition is repaired. Optimistic approaches allow updates in all partitions. After repairing partition, all updates that break the one-copy serializability criterion will be aborted Pessimistic approaches limits availability even when there are no partitions, it prevents any inconsistencies occurring during partitions. Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn © Pearson Education 2005

Available copies with validation
Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn © Pearson Education 2005

Quorum consensus methods
In quorum consensus replication schemes an update operation on a logical object may be completed successfully by a subgroup of its group of replica managers Version number for timestamps may be used to determine whether copies are up to date Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn © Pearson Education 2005

Page 650 Gifford’s quorum concensus examples
Latency Replica 1 75 75 75 (milliseconds) Replica 2 65 100 750 Replica 3 65 750 750 Voting Replica 1 1 2 1 configuration Replica 2 1 1 Replica 3 1 1 Quorum R 1 2 1 sizes W 1 3 3 Derived performance of file suite: Read Latency 65 75 75 Blocking probability 0.01 0.0002 Write Latency 75 100 750 Blocking probability 0.01 0.0101 0.03 Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn © Pearson Education 2005

Virtual partition algorithm
Combines the quorum consensus approach with the available copies algorithm A virtual partition is an abstraction of a real partition and contains a set of replica managers Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn © Pearson Education 2005

Figure 15.13 Two network partitions
Replica managers Network partition V X Y Z T Transaction Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn © Pearson Education 2005

Figure 15.14 Virtual partition
X V Y Z Replica managers Virtual partition Network partition Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn © Pearson Education 2005

Figure 15.15 Two overlapping virtual partitions
Virtual partition V 1 2 Y X V Z Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn © Pearson Education 2005

Figure 15.16 Creating a virtual partition
Phase 1: • The initiator sends a Join request to each potential member. The argument of Join is a proposed logical timestamp for the new virtual partition. • When a replica manager receives a Join request, it compares the proposed logical timestamp with that of its current virtual partition. – If the proposed logical timestamp is greater it agrees to join and replies Yes; – If it is less, it refuses to join and replies No. Phase 2: • If the initiator has received sufficient Yes replies to have read and write quora, it may complete the creation of the new virtual partition by sending a Confirmation message to the sites that agreed to join. The creation timestamp and list of actual members are sent as arguments. • Replica managers receiving the Confirmation message join the new virtual partition and record its creation timestamp and list of actual members. Instructor’s Guide for Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn © Pearson Education 2005

Slides for Chapter 15: Replication

Similar presentations

Presentation on theme: "Slides for Chapter 15: Replication"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Slides for Chapter 15: Replication

Similar presentations

Presentation on theme: "Slides for Chapter 15: Replication"— Presentation transcript:

Similar presentations

About project

Feedback