Replication ( ) by Ramya Balakumar

Slides:



Advertisements
Similar presentations
Replication. Topics r Why Replication? r System Model r Consistency Models r One approach to consistency management and dealing with failures.
Advertisements

Replication Management. Motivations for Replication Performance enhancement Increased availability Fault tolerance.
Consistency and Replication (3). Topics Consistency protocols.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Replication Steve Ko Computer Sciences and Engineering University at Buffalo.
DISTRIBUTED SYSTEMS II REPLICATION CNT. II Prof Philippas Tsigas Distributed Computing and Systems Research Group.
1 Linearizability (p566) the strictest criterion for a replication system  The correctness criteria for replicated objects are defined by referring to.
DISTRIBUTED SYSTEMS II REPLICATION –QUORUM CONSENSUS Prof Philippas Tsigas Distributed Computing and Systems Research Group.
Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Slides for Chapter 15: Replication
CS542: Topics in Distributed Systems
Distributed Systems 2006 Styles of Client/Server Computing.
CS 582 / CMPE 481 Distributed Systems
Coda file system: Disconnected operation By Wallis Chau May 7, 2003.
CS 582 / CMPE 481 Distributed Systems Replication.
CSS490 Replication & Fault Tolerance
Distributed Systems Fall 2011 Gossip and highly available services.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Concurrency Control & Caching Consistency Issues and Survey Dingshan He November 18, 2002.
Jeff Chheng Jun Du.  Distributed file system  Designed for scalability, security, and high availability  Descendant of version 2 of Andrew File System.
Transactions and concurrency control
Fault Tolerance via the State Machine Replication Approach Favian Contreras.
Distributed File Systems Case Studies: Sprite Coda.
Distributed Systems Course Replication 14.1 Introduction to replication 14.2 System model and group communication 14.3 Fault-tolerant services 14.4 Highly.
From Coulouris, Dollimore, Kindberg and Blair Distributed Systems: Concepts and Design Edition 5, © Addison-Wesley 2012 Slides for Chapter 18: Replication.
Slides for Chapter 14: Replication From Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edition 3, © Addison-Wesley 2001.
Copyright © George Coulouris, Jean Dollimore, Tim Kindberg This material is made available for private study and for direct.
DISTRIBUTED SYSTEMS II REPLICATION CNT. Prof Philippas Tsigas Distributed Computing and Systems Research Group.
CS425 /CSE424/ECE428 – Distributed Systems – Fall Nikita Borisov - UIUC1 Material derived from slides by I. Gupta, M. Harandi, J. Hou, S.
DISTRIBUTED SYSTEMS II REPLICATION Prof Philippas Tsigas Distributed Computing and Systems Research Group.
IM NTU Distributed Information Systems 2004 Replication Management -- 1 Replication Management Yih-Kuen Tsay Dept. of Information Management National Taiwan.
DISTRIBUTED SYSTEMS II AGREEMENT - COMMIT (2-3 PHASE COMMIT) Prof Philippas Tsigas Distributed Computing and Systems Research Group.
1 Highly available services  we discuss the application of replication techniques to make services highly available. –we aim to give clients access to.
CS542: Topics in Distributed Systems Replication.
Replication (1). Topics r Why Replication? r System Model r Consistency Models – How do we reason about the consistency of the “global state”? m Data-centric.
CS425 / CSE424 / ECE428 — Distributed Systems — Fall 2011 Some material derived from slides by Prashant Shenoy (Umass) & courses.washington.edu/css434/students/Coda.ppt.
DISTRIBUTED SYSTEMS II REPLICATION Prof Philippas Tsigas Distributed Computing and Systems Research Group.
CSE 486/586 CSE 486/586 Distributed Systems Consistency Steve Ko Computer Sciences and Engineering University at Buffalo.
Copyright © George Coulouris, Jean Dollimore, Tim Kindberg This material is made available for private study and for direct.
Copyright © George Coulouris, Jean Dollimore, Tim Kindberg This material is made available for private study and for direct.
Distributed Transactions Chapter – Vidya Satyanarayanan.
Fault Tolerant Services
Fault Tolerance and Replication
Replication (1). Topics r Why Replication? r System Model r Consistency Models r One approach to consistency management and dealing with failures.
Write Conflicts in Optimistic Replication Problem: replicas may accept conflicting writes. How to detect/resolve the conflicts? client B client A replica.
Providing High Availability Using Lazy Replication Rivaka Ladin, Barbara Liskov, Liuba Shrira, Sanjay Ghemawat Presented by Huang-Ming Huang.
Introduction to Distributed Databases Yiwei Wu. Introduction A distributed database is a database in which portions of the database are stored on multiple.
CSE 486/586 Distributed Systems Consistency --- 3
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Replication Steve Ko Computer Sciences and Engineering University at Buffalo.
Lecture 13: Replication Haibin Zhu, PhD. Assistant Professor Department of Computer Science Nipissing University © 2002.
Computer Science 425 Distributed Systems (Fall 2009) Lecture 24 Transactions with Replication Reading: Section 15.5 Klara Nahrstedt.
Highly Available Services and Transactions with Replicated Data Jason Lenthe.
CSE 486/586, Spring 2014 CSE 486/586 Distributed Systems Transactions on Replicated Data Steve Ko Computer Sciences and Engineering University at Buffalo.
THE EVOLUTION OF CODA M. Satyanarayanan Carnegie-Mellon University.
Computer Science 425 Distributed Systems CS 425 / CSE 424 / ECE 428 Fall 2011 Gossiping Reading: Section 15.4 / 18.4  2011, N. Borisov, I. Gupta, K. Nahrtstedt,
Topics in Distributed Databases Database System Implementation CSE 507 Some slides adapted from Navathe et. Al and Silberchatz et. Al.
Distributed Computing Systems Replication Dr. Sunny Jeong. Mr. Colin Zhang With Thanks to Prof. G. Coulouris,
1 Highly available services  we discuss the application of replication techniques to make services highly available. –we aim to give clients access to.
Distributed Databases – Advanced Concepts Chapter 25 in Textbook.
Nomadic File Systems Uri Moszkowicz 05/02/02.
Chapter 14: Replication Introduction
Replication Control II Reading: Chapter 15 (relevant parts)
Distributed systems II Replication Cnt.
Outline Announcements Fault Tolerance.
Today: Coda, xFS Case Study: Coda File System
CSE 486/586 Distributed Systems Consistency --- 1
Active replication for fault tolerance
Replication and Recovery in Distributed Systems
Slides for Chapter 15: Replication
Slides for Chapter 18: Replication
Distributed Systems Course Replication
Presentation transcript:

Replication (14.4-14.6) by Ramya Balakumar

Highly available services A few systems that provide highly available services : Gossip Bayou Coda

The gossip architecture the gossip architecture is a framework for implementing highly available services data is replicated close to the location of clients RMs periodically exchange ‘gossip’ messages containing updates gossip service provides two types of operations queries - read only operations updates - modify (but do not read) the state FE sends queries and updates to any chosen RM one that is available and gives reasonable response times Two guarantees (even if RMs are temporarily unable to communicate data reflects the updates seen by client, even if the use different RMs relaxed consistency between replicas. All RMs eventually receive all updates. RMs use ordering guarantees to suit the needs of the application (generally causal ordering). Client may observe stale data.

Query and update operations in a gossip service Val FE RM Query, prev Val, new Update Update, Update id Service Clients gossip

Gossip processing of queries and updates The five phases in performing a client request are: request update response - the RM replies as soon as it has seen the update coordination execution - the RM executes the request query response - if the request is a query the RM now replies: agreement RMs update one another by exchanging gossip messages (lazily) e.g. when several updates have been collected queries, FE may need to contact another RM update ops - for higher reliability, block client until update given to f+1 RMs query response - will have waited until ordering constraints apply

A gossip replica manager, showing its main state components Replica timestamp Update log Value timestamp Executed operation table Stable updates Updates Gossip messages FE Replica Replica log OperationID Update Prev Replica manager Other replica managers Timestamp table

Processing of query and update operations Query operations contain q.prev they can be applied if q.prev ≤ valueTS (value timestamp) failing this, the RM can wait for gossip message or initiate them e.g. if valueTS = (2,5,5) and q.prev = (2,4,6) - RM 0 has missed an update from RM 2 Once the query can be applied, the RM returns valueTS (new) to the FE. The FE merges new with its vector timestamp e.g. in a gossip with 3 RMs a value of (2,4,5) at RM 0 means that the value there reflects system the first 2 updates accepted from FEs at RM 0, the first 4 at RM 1 and the first 5 at RM 2. example has 3 RMs. RMs are 0, 1, 2. RM 0 missed update from RM 2 as 6 is not <= 5 FE must have used RM 2 last time. Merge the FE had not seen the latest update at RM 1 ([prev had 4 and RM had 5) More details of what is stored for updates on page 578. They can implement total ordering by using a primary RM as a sequencer

Gossip update operations Update operations are processed in causal order A FE sends update operation u.op, u.prev, u.id to RM i A FE can send a request to several RMs, using same id When RM i receives an update request, it checks whether it is new, by looking for the id in its executed ops table and its log if it is new, the RM increments by 1 the ith element of its replica timestamp, assigns a unique vector timestamp ts to the update and stores the update in its log The RM returns ts to the FE,which merges it with its vector timestamp For stability u.prev ≤ valueTS That is, the valueTS reflects all updates seen by the FE. When stable, the RM applies the operation u.op to the value,updates valueTS and adds u.id to the executed operation table.

Discussion of Gossip architecture the gossip architecture is designed to provide a highly available service clients with access to a single RM can work when other RMs are inaccessible but it is not suitable for data such as bank accounts it is inappropriate for updating replicas in real time (e.g. a conference) scalability as the number of RMs grow, so does the number of gossip messages for R RMs, the number of messages per request (2 for the request and the rest for gossip) = 2 + (R-1)/G G is the number of updates per gossip message increase G and improve number of gossip messages, but make latency worse for applications where queries are more frequent than updates, use some read-only replicas, which are updated only by gossip messages

The Bayou system Operational transformation: Replica managers detect and resolve conflicts according to domain specific policy. The Bayou guarantee: Eventually every replica manager receives the same set of updates; the databases of the replica managers are identical.

Committed and tentative updates in Bayou cN t0 t1 ti Committed Tentative t2 Tentative update ti becomes the next committed update and is inserted after the last committed update cN. ti+1

Replication non-transparent to application Bayou summary Replication non-transparent to application E.g: A user has booked a slot in a diary, but this slot moves to another available slot. The user comes to know of this only when he checks the diary again. No clear indication for user as to which data is committed and which data is tentative.

The Coda File system Descendent of AFS (Andrew File System) Vice processes—Replica Managers Venus processes—hybrid of front ends and managers VSG—Volume Storage Group (group of servers holding replica of a file system) AVS—Available volume storage group Working mostly similar to AFS

Allows file modification when network disconnected or partitioned. Replication Strategy Allows file modification when network disconnected or partitioned. CVV—Coda version vector—vector timestamp with one element for each server. Each element of CVV is an estimate of the number of modifications performed on the version of the file that is held at the corresponding server.

EXAMPLE Example: File  F Servers  S1, S2, S3 Consider the modifications to F in a volume that is replicated at 3 servers. The VSG for F is {S1, S2, S3} F is modified at about the same time by 2 clients C1, C2. Because of the Network fault: C1 can access S1 and S2 only. Therefore C1’s AVSG is {S1, S2} C2 can access S3 only Therefore C2’s AVSG is {S3} Initially CVV for F is {1,1,1} at all 3 servers.

Update Semantics: Guarantees offered by Coda: After a successful open: (s’ <> NULL and (latest (F, s’, 0) or (latest(F, s’, T) and lostCallback(s’, T) and inCache(F)))) or (s’ = NULL and inCache(F)) After a failed open operation: (s’ <> NULL and conflict(F, s’)) or (s’ = O’ and -| inCache(F)) After a successful close: (s’ <>NULL and updated(F,s’)) or (s’ = O’) After failed close: s’ <> NULL and conflict(F, s’)

Cache Coherence: The Venus process at each client must detect the following events within T seconds of their occurrence: Enlargement of an AVSG ( due to inaccessibility of a previously inaccessible server) Shrinking of an AVSG (due to a server becoming inaccessible) A lost call back event. Venus sends a probe message to all servers in the VSG of the file every T seconds.

Coda File system & Bayou approach Similar to Bayou --- Bayou used for databases and Coda for file systems Coda conflict resolution similar to Bayou’s operational transformation approach.

E.g. of Transactions on replicated data B A Client + front end getBalance(A) Replica managers deposit(B,3); U T

Available copies – read one/ write all available T’s getBalance is performed by X whereas Ts deposit is performed by M, N and P. At X T has read A and has locked it. Therefore U’s deposit is delayed until T finishes Client + front end T U Client + front end getBalance(B) deposit(A,3); T’s getBalance is performed by X and its deposit by M, N and P at X T has read A and it is locked until T finishes getBalance(A) deposit(B,3); Replica managers B Replica managers M A B B A X Y P N

Available copies replication: RM failure example both RMs fail before T and U have performed their deposit operations Therefore T’s deposit will be performed at RMs M and P (all available) and U’s deposit will be performed at RM Y. (all available). Client + front end T U Client + front end getBalance(B) deposit(A,3); getBalance(A) deposit(B,3); Replica managers B Replica managers M A B B A X Y P N

Available copies replication Local validation (the additional concurrency control) before a transaction commits, it checks for failures and recoveries of the RMs it has contacted e.g. T would check if N is still unavailable and that X, M and P are still available. If this is the case, T can commit. this implies that X failed after T validated and before U validated i.e. we have N fails T commits  X fails  U validates (above, we said X fails before T’s deposit, in which case,T would have to abort) U checks if N is still available (no) and X still unavailable If N is unavailable :U must abort T did getBalance at X (before it failed). Then it did deposit at M and P, noticing that N crashed U did getBalance at N and deposit at Y. (it observes X fails) after all the operations of a transaction have been carried out the FE will inform the coordinator of failed RMs it knows about the coordinator can attempt to communicate with any RMs noted to have failed then in doing the 2PC it will discover whether any RMs involved in the transaction have subsequently failed

Network partition-division into subgroups Client + front end B withdraw(B, 4) Replica managers deposit(B,3); U T Network partition

Two network partitions Replica managers Network partition V X Y Z T Transaction

Virtual partition X V Y Z Replica managers Virtual partition Network partition

Two overlapping virtual partitions Virtual partition V 1 2 Y X V Z

Creating a virtual partition Phase 1: • The initiator sends a Join request to each potential member. The argument of Join is a proposed logical timestamp for the new virtual partition. • When a replica manager receives a Join request, it compares the proposed logical timestamp with that of its current virtual partition. – If the proposed logical timestamp is greater it agrees to join and replies Yes; – If it is less, it refuses to join and replies No. Phase 2: • If the initiator has received sufficient Yes replies to have read and write quora, it may complete the creation of the new virtual partition by sending a Confirmation message to the sites that agreed to join. The creation timestamp and list of actual members are sent as arguments. • Replica managers receiving the Confirmation message join the new virtual partition and record its creation timestamp and list of actual members.

Summary for Gossip and replication in transactions the Gossip architecture is designed for highly available services it uses a lazy form of replication in which RMs update one another from time to time by means of gossip messages it allows clients to make updates to local replicas while partitioned RMs exchange updates with one another when reconnected replication in transactions primary-backup architectures can be used other architectures allow FMs to use any RM available copies allows RMs to fail, but cannot deal with partitions