Replica Control for Peer-to- Peer Storage Systems.

Slides:



Advertisements
Similar presentations
Case Study - Amazon. Amazon r Amazon has many Data Centers r Hundreds of services r Thousands of commodity machines r Millions of customers at peak times.
Advertisements

Replication Management. Motivations for Replication Performance enhancement Increased availability Fault tolerance.
Synchronization Chapter clock synchronization * 5.2 logical clocks * 5.3 global state * 5.4 election algorithm * 5.5 mutual exclusion * 5.6 distributed.
Dynamo: Amazon's Highly Available Key-value Store Distributed Storage Systems CS presented by: Hussam Abu-Libdeh.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Replication Steve Ko Computer Sciences and Engineering University at Buffalo.
1 Algorithms and protocols for distributed systems We have defined process groups as having peer or hierarchical structure and have seen that a coordinator.
DISTRIBUTED SYSTEMS II REPLICATION CNT. II Prof Philippas Tsigas Distributed Computing and Systems Research Group.
Page 1 Mutual Exclusion* Distributed Systems *referred to slides by Prof. Paul Krzyzanowski at Rutgers University and Prof. Mary Ellen Weisskopf at University.
Efficient Solutions to the Replicated Log and Dictionary Problems
1 Linearizability (p566) the strictest criterion for a replication system  The correctness criteria for replicated objects are defined by referring to.
DISTRIBUTED SYSTEMS II REPLICATION –QUORUM CONSENSUS Prof Philippas Tsigas Distributed Computing and Systems Research Group.
MUREX: A Mutable Replica Control Protocol for Structured Peer-to-Peer Storage Systems.
Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Database Replication techniques: a Three Parameter Classification Authors : Database Replication techniques: a Three Parameter Classification Authors :
Highly Concurrent and Fault-Tolerant h-out of-k Mutual Exclusion Using Cohorts Coteries for Distributed Systems.
CS 582 / CMPE 481 Distributed Systems
CMPT 431 Dr. Alexandra Fedorova Lecture XII: Replication.
1 ICS 214B: Transaction Processing and Distributed Data Management Replication Techniques.
CSS490 Replication & Fault Tolerance
Manajemen Basis Data Pertemuan 10 Matakuliah: M0264/Manajemen Basis Data Tahun: 2008.
A Fault-Tolerant h-out of-k Mutual Exclusion Algorithm Using Cohorts Coteries for Distributed Systems Presented by Jehn-Ruey Jiang National Central University.
A Fault-Tolerant h-out of-k Mutual Exclusion Algorithm Using Cohorts Coteries for Distributed Systems Presented by Jehn-Ruey Jiang National Central University.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Chapter 18: Distributed Coordination (Chapter 18.1 – 18.5)
Concurrency Control & Caching Consistency Issues and Survey Dingshan He November 18, 2002.
Farsite: Ferderated, Available, and Reliable Storage for an Incompletely Trusted Environment Microsoft Reseach, Appear in OSDI’02.
Database System Concepts ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Remote Backup Systems.
1CS 6401 Peer-to-Peer Networks Outline Overview Gnutella Structured Overlays BitTorrent.
1 The Google File System Reporter: You-Wei Zhang.
6.4 Data And File Replication Presenter : Jing He Instructor: Dr. Yanqing Zhang.
Thesis Proposal Data Consistency in DHTs. Background Peer-to-peer systems have become increasingly popular Lots of P2P applications around us –File sharing,
1 The Design of a Robust Peer-to-Peer System Rodrigo Rodrigues, Barbara Liskov, Liuba Shrira Presented by Yi Chen Some slides are borrowed from the authors’
Replication March 16, Replication What is Replication?  A technique for increasing availability, fault tolerance and sometimes, performance 
Replicated Databases. Reading Textbook: Ch.13 Textbook: Ch.13 FarkasCSCE Spring
Practical Byzantine Fault Tolerance
Byzantine fault-tolerance COMP 413 Fall Overview Models –Synchronous vs. asynchronous systems –Byzantine failure model Secure storage with self-certifying.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Improving the Efficiency of Fault-Tolerant Distributed Shared-Memory Algorithms Eli Sadovnik and Steven Homberg Second Annual MIT PRIMES Conference, May.
Replica Control for Peer-to- Peer Storage Systems.
IM NTU Distributed Information Systems 2004 Replication Management -- 1 Replication Management Yih-Kuen Tsay Dept. of Information Management National Taiwan.
Databases Illuminated
1 Highly available services  we discuss the application of replication techniques to make services highly available. –we aim to give clients access to.
Replication (1). Topics r Why Replication? r System Model r Consistency Models – How do we reason about the consistency of the “global state”? m Data-centric.
Fault Tolerance and Replication
Chapter 4 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.
3/6/99 1 Replication CSE Transaction Processing Philip A. Bernstein.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Replication Steve Ko Computer Sciences and Engineering University at Buffalo.
INTERNET TECHNOLOGIES Week 10 Peer to Peer Paradigm 1.
EEC 688/788 Secure and Dependable Computing Lecture 9 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Antidio Viguria Ann Krueger A Nonblocking Quorum Consensus Protocol for Replicated Data Divyakant Agrawal and Arthur J. Bernstein Paper Presentation: Dependable.
Highly Available Services and Transactions with Replicated Data Jason Lenthe.
CSE 486/586, Spring 2014 CSE 486/586 Distributed Systems Transactions on Replicated Data Steve Ko Computer Sciences and Engineering University at Buffalo.
Topics in Distributed Databases Database System Implementation CSE 507 Some slides adapted from Navathe et. Al and Silberchatz et. Al.
Revisiting Logical Clocks: Mutual Exclusion Problem statement: Given a set of n processes, and a shared resource, it is required that: –Mutual exclusion.
DISTRIBUTED SYSTEMS II REPLICATION CNT. II Prof Philippas Tsigas Distributed Computing and Systems Research Group.
Ivy: A Read/Write Peer-to- Peer File System Authors: Muthitacharoen Athicha, Robert Morris, Thomer M. Gil, and Benjie Chen Presented by Saurabh Jha 1.
CS 540 Database Management Systems NoSQL & NewSQL Some slides due to Magda Balazinska 1.
DISTRIBUTED FILE SYSTEM- ENHANCEMENT AND FURTHER DEVELOPMENT BY:- PALLAWI(10BIT0033)
1 Highly available services  we discuss the application of replication techniques to make services highly available. –we aim to give clients access to.
6.4 Data and File Replication
Providing Secure Storage on the Internet
Replica Control for Peer-to-Peer Storage Systems
CS 440 Database Management Systems
Outline Announcements Fault Tolerance.
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Replication and Availability in Distributed Systems
Distributed Database Management Systems
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
EEC 688/788 Secure and Dependable Computing
Presentation transcript:

Replica Control for Peer-to- Peer Storage Systems

P2P Peer-to-peer (P2P) has emerged as an important paradigm model for sharing resources at the edges of the Internet. the most widely exploited resource is storage, as typified in P2P music file sharing –Napster –Gnutella Following the great success of P2P file sharing, a natural next step is to develop wide-area, P2P storage systems to aggregate the storage across the Internet.

Replica Control Protocol Replication – maintain multiple copies of some critical data to increase the availability – to reduce read access times Replica Control Protocol – to avoid inconsistent updates – to guarantee a consistent view of the replicated data

Resiliency Requirement Need data replication –Even if some nodes fail, the computation can progress –Consistency requirement –Failures may partition the network –Rejoining need to use consistency control algorithms

One-copy equivalence consistency criteria The set of replicas must behave as if there is only a single copy. Conditions to ensure one-copy equivalence are –no two write operations can proceed at the same time –no a pair of a read operation and a write operation can proceed at the same time –a read operation always returns the replica that the last write operation writes

Replica Control Methods Optimistic –Proceed with computation on the available subgroup –Optimistic to join later with consistency Pessimistic –Restrict computations with worst-case assumptions –Approaches Primary site Voting

Optimistic Approach Version vector for file f –N element vector, where N is the number of nodes in which f is stores –The i th element represents the number of updates done by node I A vector V dominated V ’ if –Every element in V >= corresponding element in V ’ Conflicts if neither dominates

Optimistic (cont ’ d) Consistency resolution –If V dominates V ’, inconsistent; can be resolved by copying V to V ’ –If V and V ’ conflict, inconsistency cannot be resolved Version vector can resolve only update conflicts; cannot resolve read-write conflicts

Primary Site Approach Data replicated on at least k+1 nodes (for k-resilient) One node acts as the primary site (PS) –Any read request is served by the PS –Any write request is copied to all other back- up sites –Any write request to back-up sites are forwarded to the PS

PS Failure Handling If back-up fails, no interruption in service If PS fails, there are two possibilities –If the network not segmented Choose another node in the set as the primary If checkpointing has been active, need to restart only from the previous checkpoint –If segmented Only the partition with PS can progress Other partitions stops updates on data Necessary to distinguish between site failures and network partitions

Witnesses Witness - small entity that maintains enough information to identity the replicas that contain the most recent version of the data - this information could be a timestamp containing the time of the latest update - replaced by a version number, which is an integer incremented each time the data are updated

Voting Approach V votes are distributed to n replicas with –Vw+Vr > V –Vw+Vw > V Obtain Vr or more votes to read Obtain Vw or more votes to write Quorum system is more general than voting

Quorum Systems Trees Grid-based (array-based) Torus Hierarchical Multi-column and so on…

Classification of P2P Storage Sys. Unstructured –“Replication Strategies for Highly Available Peer-to- peer Storage” –“Replication Strategies in Unstructured Peer-to-peer Networks” Structured –CFS –PAST –LAR –Ivy –Oasis –Om –Eliot –Sigma (for mutual exclusion primitive) Read only Read/Write (Mutable)

Ivy Stores a set of logs with the aid of distributed hash tables. Ivy keeps, for each participant, a log storing all its updates, and maintains data consistency optimistically by performing conflict resolutions among all logs. (Maintain data consistency in a best-effort manner) The logs should be kept indefinitely and a participant must scan all the logs related to a file to look up the up-to-date file data. Thus, Ivy is only suitable for small groups of participants.

Eliot Eliot relies a reliable, fault-tolerant, immutable P2P storage substrate  Charles to store data blocks, and uses an auxiliary metadata service (MS) for storing mutable metadata. It supports NFS-like consistency semantics; however, the traffic between MS and the client is high for such semantics. It also supports AFS open-close consistency semantics; however, this semantics may cause the problem of lost updates. The MS service is provided by a conventional replicated database, which may be not fit for dynamic P2P environments.

Oasis Oasis is based on Gifford’s weighted voting quorum concept and allows dynamic quorum membership. It spreads versioned metadata along with data replicas over the P2P network. To complete an operation on a data object, a client must first find a metadata related to the object and figure out the total number of votes, required votes for read/write operations, replica list, and so on, to form a quorum accordingly. One drawback of Oasis is that if a node happens to use a stale metadata, the data consistency may be violated.

Om Om is based on the concepts of automatic replica regeneration and replica membership reconfiguration. The consistency is maintained by two quorum systems: a read-one-write-all quorum system for accessing replicas, and a witness-modeled quorum system for reconfiguration. Om allows replica regeneration from single replica. However, a write in Om is always first forwarded to the primary copy, which serializing all writes and uses a two- phase procedure to propagate the write to all secondary replicas. The drawbacks of Om are (1) the primary replica may become a bottleneck (2) the overhead incurred by the two-phase procedure may be too high (3) the reconfiguration by witness model has the probability of violating consistency.

Sigma The Sigma protocol intelligently collect states from all replicas to achieve mutual exclusion. The basic idea of the Sigma protocol is as follows. A node u wishing to be the winner of the mutual exclusion sends a timestamped request for each of the totally n (n=3k+1) replicas and waits for replies. On receiving a request from u, a node v should put u’s request in a local queue by the timestamp order, takes the node as the winner whose request is in the front of the queue, and reply the winner ID to u.

Sigma When the number of replies received by u exceeds m (m=2k+1), u acts according to the following conditions: (1) if more than m replies take v as the winner, then u is the winner. (2) if more than m replies take w (w  u) as the winner, then w is the winner and u just keeps waiting. (3) if no node is regarded as the winner by more than m replies, then u sends YIELD message to cancel its request temporarily and then re-inserts its request again. In this manner, one node can eventually be elected as the winner even when communication delay variance is large. A drawback of the Sigma protocol is that a node needs to send requests to all replicas and gets advantaged replies from a large portion (  2/3) of nodes to be the winner of the mutual exclusion, which will incur large overhead. Moreover, the overhead will even be larger under an environment of high contention.

MUREX comes to the rescue!