Presentation is loading. Please wait.

Presentation is loading. Please wait.

MUREX: A Mutable Replica Control Protocol for Structured Peer-to-Peer Storage Systems.

Similar presentations


Presentation on theme: "MUREX: A Mutable Replica Control Protocol for Structured Peer-to-Peer Storage Systems."— Presentation transcript:

1 MUREX: A Mutable Replica Control Protocol for Structured Peer-to-Peer Storage Systems

2 P2P Systems For sharing resources at the edge of Internet Classification –Unstructured Napster, Gnutella –Structured –Chord, Pastry, Tapestry, CAN, Tornado

3 Replication Data items are replicated for the purpose of fault-tolerance. Some DHTs have provided replication utilities, which are usually used to replicate routing states. The proposed protocol replicates data items in the application layer so that it can be built on top of any DHT.

4 Faulty model Fail-stop Byzantine Middle ground

5 Hash table 02 128 -1 Peer nodes … Hash Function 1Hash Function 2Hash Function n Data Item replica 1replica n replica 2 Duplicating a Data Item

6 How to keep replica consistent Primary copy mechanism: update propagation Quorum-based mechanism: (Qw intersect Qr, Qw intersects Qw) –ROWA: read one write all –Majority –Multi-Column Protocol –…

7 Problems State loss Replica Regeneration Replica Transfer

8 Hash table 02 128 -1 Peer nodes … Hash Function 1Hash Function 2Hash Function n Data Item replica 1replica n replica 2 Peer Joins/Leaves

9 Hash table 02 128 -1 Peer nodes … Hash Function 1Hash Function 2Hash Function n Data Item replica 1replica n replica 2 Peer Joins/Leaves replica transfer state loss peer joins peer leaves

10 Hash table 02 128 -1 Peer nodes … Hash Function 1Hash Function 2Hash Function n Data Item new replica replica n replica 2 Peer Joins/Leaves replica regeneration replica transfer state loss peer joins peer leaves

11 Solutions Leased Lock Replica Pointer Auto-Replica Regeneration

12 Implementation Solutions can be integrated with any quorum- based replica control protocols on the basis of some DHT We choose multi-column protocol (MCP) and Tornado DHT for the following reasons: –MCP has small quorum sizes: SQRT(n) (MCP can achieve constant quorum sizes in the best case if necessary) –Tornado is a typical DHT system developed by ourselves.

13 Multi-Column Structure Multi-Column structure MC(m)  (C 1,...,C m ), is a list of pairwise disjoint sets of replicas satisfying  C i  >1 for 1  i  m. For example, ({r1,r2}, {r3,r4,r5}, {r6,r7,r8,r9}) and ({r1,r2,r3,r4,r5}, {r6,r7}, {r8,r9}) are multi-column structures, where r1,...,r9 are (keys of) replicas of a data item.

14 Write/Read Quorums A write quorum under MC(m) is a set that contains all replicas of some column C i, 1  i  m (note that i=1 is included), and one replica of each of the columns C i+1,...,C m. A read quorum under MC(m) is either –Type-1: a set that contains one replica of each of the columns C 1,...,C m. or –Type-2: a set that contains all replicas of some column C i, 1<i  m (note that i=1 is excluded), and one replica of each of the columns C i+1,...,C m.

15 Construction of Write Quorums one primary cohort with supporting cohorts at rear e.g. quorums: {2, 6, 10, 11, 14} {1, 2, 5, 9, 13, 15} 0123 4567 891011 12131415 0123 4567 891011 12131415 141312 1110 9 8 7654 3210

16 Randomized Alg. for write quorums Function Get_Write_Quorum((C 1,...,C m ): Multi-Column): Set; Var S: Set; i, j: Integer; S=C i, where i=Random(1..m); //i will be an integer between 1 and m Choose one arbitrary member in C j and add it into S for j=i+1,…,m. Return S; End Get_Write_Quorum

17 Randomized Alg. for type-1 read quorums Function Get_Read_Quorum1((C 1,...,C m ): Multi-Column): Set; Var S: Set; i: Integer; Choose one arbitrary member in C i and add it into S for i=1,…,m. End Get_Read_Quorum

18 Randomized Alg. for type-2 read quorums Function Get_Read_Quorum2((C 1,...,C m ): Multi-Column): Set; Var S: Set; i, j: Integer; S=C i, where i=Random(2..m); //i will be an integer between 2 and m Choose one arbitrary member in C j and add it into S for j=i+1,…,m. Return S; End Get_Write_Quorum

19 n Hash functions There are several ways to disseminate the n replicas In MUREX, we adopt n hashing function methods. There are n replicas with hashing key k 1,…,k n for each data item, where k 1 =HASH 1 (data item name),…,k n =HASH n (data item name).

20 Operations publish(data, data item name): to place data replicas of the original data item at nodes associated with k 1,…,k n with version number 0. read(data item name): to return the replica of the up-to-date version number (by collecting a read quorum of replicas and returning the one with the highest version number). write(data, data item name): to update the data item (by writing a write quorum of replicas with the highest version number plus one).

21 Messages LOCK OK WAIT MISS UNLOCK

22 Initialization Initially, the data originator publishes the original data item by calling publish(data, data item name), which stores the data item with version number 0 at the n nodes associated with k 1,…,k n.

23 Read/Write Afterwards, any participant can call read (or write) operation to read (or write) the data item by issuing LOCK requests, with the help of the DHT, to all members of a read (or write) quorum Q.

24 Asking a missed replica When a node receives a lock request, it sends a MISS message if it does not own the replica. It is noted that MISS is sent just once for each replica.

25 Check lock conflict On the other hand, if the node owns the replica, it then checks if there is a lock conflict, i.e., if a read-locked replica receives a write-lock request, or if a write- locked replica receives a write-lock or a read-lock request.

26 OK v.s. WAIT If there is no lock conflict, it locks the replica and replies an OK message containing the replica version number. On the contrary, if there is a lock conflict, the node replies a WAIT message.

27 Wait Period After sending LOCK requests, a node enters the “wait period” of length W. During the wait period, if a node has gathered OK messages form all members of quorum Q, it can execute the desired operation. A node sends UNLOCK messages to unlock the replicas after the operation is finished.

28 Usage of the Version Number A read operation in MUREX reads the replica of the largest version number from one of Q’s members. On the other hand, a write operation always writes into all members of Q the newest replica attached with the version number which is one more than the largest version number just encountered.

29 Quorum Reselection During the wait period, if a node u cannot gather OK messages from all members of Q after the specific waiting period W, it should select another quorum Q, send LOCK requests, and then enter another wait period again. A node may enter wait periods repeatedly until enough OK messages are gathered or until any lock expires.

30 Quorum Reselection Case 1 No WAIT message is received by u: This case occurs when there is no contention. For such a case, Q should be such a quorum that  Q  R  is minimized, where R is the set of nodes having replied OK messages. Node u sends LOCK messages to the nodes that u has not sent LOCK messages yet.

31 Quorum Reselection Case 2 One or more WAIT messages are received by u: This case occurs when there is contention. For such a case, node u first sends UNLOCK messages to all the members of Q. (Some of the UNLOCK messages does not need to send through DHT since node u has learned IPs of some nodes from reply messages.) Then, u select an arbitrary quorum Q. After a random backoff time, node u sends LOCK messages to all members of Q. The random backoff concept is similar to that of Ethernet [Cho] and is used to avoid continuous conflicts among contending nodes.

32 Deadlock- and Livelock-Free In MUREX, every lock is assumed to be a leased lock that has a leased period of L. We also assume that the critical section of a read or a write operation takes C time to complete. A leased lock automatically expires after the leased period L expired. Thus, a node should release the lock if H>L-C-D, where H is the holding time of the lock and D is the propagation delay of transmitting the lock. It is noted that the lock holding time H is the time when a node received OK messages.

33 L, D, H and C u:u: v:v: L H DC OK message time

34 The leased lock makes node substitution work If a node u hosting replica r leaves, a node v will be selected for replacing u to host r. If r is still locked while u leaves, then the lock-state of r is lost. When v later obtains the copy of r somehow, it cannot grant the lock for r at time E when it obtains the replica. Instead, it can grant the lock at time E+L, where L is the leased lock period. By this, a replica is never locked more than once, and the lock-state loss problem is solved.

35 Replica Pointer (1/2) When a node v arrives to share part of the load of node u, say from key k 1 to key k 2 of the hash space, u should transfer to v the replicas of keys from k 1 to k 2. To reduce the cost of transferring all the replicas, MUREX transfers replica pointers instead of the actual replicas. A replica pointer is a five-tuple: (key, data item name, version number, lock state, storing location IP). It is produced when a replica is generated and can be used to locate the actual replica stored. When node v owns the replica pointer of replica r, it is regarded as r’s host, which can reply the lock request of r.

36 Replica Pointer (2/2) On the other hand, when node u sends out the replica pointer of replica r, it is no more the host of r and cannot reply the lock request of r (even if it stores the actual replica of r). A replica pointer is a lightweight mechanism for transferring replicas; it can be propagated from node to node. When a node w owing replica pointer of r receives a lock request for r, it should check whether the node storing the actual replica of r is still alive. If so, w can behave as host of r. Otherwise, w regards itself as having no replica r. Every transfer of replica pointer between two nodes, say from u to v, should be recorded locally by u so that an UNLOCK message can be sent to the last node having the replica pointer.

37 Replica Auto-Regeneration (1/2) When node v receives from node u a LOCK message for locking replica r, v sends a MISS message if it does not own replica r. It is noted that MISS is sent for a replica only once. Node v is assumed to have no replica r if the following conditions hold: 1.v does not have the replica pointer of r 2.v has the replica pointer of r, which indicates that w stores r, but w is not alive.

38 Replica Auto-Regeneration (2/2) After obtaining (resp., generating) the newest replica by executing a read (or resp., write) operation, node u should send the newest replica to node v. After receiving the newest replica, node v generates a replica pointer for the replica and can start to reply to lock request at time E+L, where E is the time of receiving the replica and L is the leased lock period. In such a manner, replica regeneration can be performed automatically with little overhead.

39 Analysis – Availability (1/2) We assume that all data replicas have the same up-probability p, the probability that a single replica is up (i.e., accessible). Let R AV (k) denote the availability of read quorums under MC(k), and W AV (k), the availability of write quorums under MC(k).

40 Analysis – Availability (2/2)

41 Analysis – Quorum Size

42

43

44 When up prob. is high (for example, in the well controlled environment), we can adopt larger column size if the write availability is the most significant concern. When up prob. Is low (for example, in the Internet environment), we can adopt smaller column size if the read availability is the most significant concern.

45 Simulation

46 Related Work As far as we have known, there are four existent mutable P2P storage systems proposed for P2P environments: Ivy [Mut], Eliot [Ste], Oasis [Rod], and Om [Yu]. The protocols, on trying to maintain data consistency, all encounter the problems caused by “node substitution”, although not mentioned explicitly, and solves them by the concepts of logs, replicated metadata service, dynamic quorum membership, and replica membership reconfiguration, respectively. A mechanism called informed backoff is proposed in [Lin] to intelligently collect replica states to achieve mutual exclusion (i.e., exclusive lock) among replicas. The mechanism treats “node substitution” as a malicious fault, and uses the term “random reset” to refer the fault.

47 Ivy Ivy [Mut] is based on a set of logs stored with the aid of distributed hash tables. It keeps a log storing all updates for every participant, and maintains data consistency optimistically by performing conflict resolutions among all logs. The logs should be kept indefinitely and participants must scan all the logs to look up file data. Thus, Ivy is only suitable for small group of participants.

48 Eliot Eliot [Ste] relies a reliable, fault-tolerant, immutable P2P storage substrate  Charles to store data blocks, and uses an auxiliary metadata service (MS) for storing mutable metadata. It supports NFS-like consistency semantics; however, the traffic between MS and the client is high for such semantics. It also supports AFS open-close consistency semantics; however, this semantics may cause the problem of lost updates. The MS service is provided by a conventional replicated database, which may be not fit for dynamic P2P environments.

49 Oasis Oasis [Rod] is based on Gifford’s weighted voting quorum concept and allows dynamic quorum membership. It spreads versioned metadata along with data replicas over the P2P network. To complete an operation, a client must first find related metadata to form a quorum. If the metadata is not found, the operation may fail.

50 Om Om [Yu] is based on the concepts of automatic replica regeneration and replica membership reconfiguration. The consistency is maintained by two quorum systems: a read-one-write-all quorum system for accessing replicas, and a witness-modeled quorum system for reconfiguration. Om allows replica regeneration from single replica. However, a write in Om is always forwarded to the primary copy, which serializing all writes and uses a two-phase protocol to propagate the write to every secondary replica. The primary replica may become a bottleneck and the overhead incurred by the two phase protocol may be too high. Moreover, the reconfiguration by witness model has the probability of violating consistency.

51 Sigma The paper [Lin] utilizes informed backoff mechanism to design algorithms achieving mutual exclusion among replicas. A node u wishing to be the winner of the mutual exclusion sends a request for each of the totally n (n=3k+1) replicas and waits for responses. On receiving a request, a node should put the request in a FIFO queue and send the ID of the node whose request is in the front of the queue. When the number of responses received by node u exceeds m (m=2k+1), node u then regards node v (v may be u) as the winner if more than m responses take v as the winner. Otherwise, node u sends a release message to all replicas that take u as the winner to relinquish the request. To avoid repeated conflicts for high contention environment, only after a random backoff time will node u start over to send requests. In this manner, a winner can be elected successfully even if replica is reset when “node substitution” occurs. The work in [Lin] regards “node substitution” as sort of malicious faults, while our protocol regards it as sort of omission faults.

52 References: 1.[Bha] R. Bhagwan, D. Moore, S. Savage, and G. Voelker, “Replication Strategies for Highly Available Peer-to-peer Storage,” Proc. WFDDC, 2002. 2.[Dab] F. Dabek, M. Kaashoek, D. Karger, R. Morris, and I. Stoica, “Wide-area Cooperative Storage with CFS,” Proc. SOSP, 2001. 3.[Gop] V. Gopalakrishnan, B. Silaghi, B. Bhattacharjee, and P. Keleher, “Adaptive Replication in Peer-to-peer Systems,” Proc. International Conference on Distributed Computing Systems, 2004. 4.[Kub] J. Kubiatowicz, D. Bindel, Y. Chen, S. Czerwinski, P. Eaton, D. Geels, R. Gummadi, S. Rhea, H. Weatherspoon, W. Weimer, C. Wells, and B. Zhao, “OceanStore: An Architecture for Global-Scale Persistent Storage,” Proc. ASPLOS, 2000. 5.[Lin] S. Lin, Q. Lian, M. Chen, and Z. Zhang, “A practical distributed mutual exclusion protocol in dynamic peer-to- peer systems,” In 3rd International Workshop on Peer- to-Peer Systems (IPTPS’04), 2004. 6.[Mut] A. Muthitacharoen, R. Morris, T. Gil, and B. Chen, “Ivy: A Read/write Peer-to- peer File System,” Proc. SOSDI, 2002. 7.[Rod] M. Rodrig, and A. Lamarca, “Decentralized Weighted Voting for P2P Data Management,” Proc. of the 3rd ACM International Workshop on Data Engineering for Wireless and Mobile Access, pp. 85–92, 2003. 8.[Ste] C. Stein, M. Tucker, and M. Seltzer, “Building a Reliable Mutable File System on Peer-to-peer Storage,” Proc. WRP2PDS, 2002. 9.[Yu] H. Yu. and A. Vahdat, “Consistent and Automatic Replica Regeneration,” Proc. NSDI, 2004. 10.[Zho] B. Zhou, D. A. Joseph, J. Kubiatowicz, “Tapestry: A Fault Tolerant Wide Area Network Infrastructure,” Proc. ACM SIGCOMM, 2001, 2001.


Download ppt "MUREX: A Mutable Replica Control Protocol for Structured Peer-to-Peer Storage Systems."

Similar presentations


Ads by Google