P2p, Spring 05 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Replication.

Slides:



Advertisements
Similar presentations
CAN 1.Distributed Hash Tables a)DHT recap b)Uses c)Example – CAN.
Advertisements

Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.
Peer to Peer and Distributed Hash Tables
Pastry Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK Some slides are borrowed from the original presentation by the authors.
Replication Strategies in Unstructured Peer-to-Peer Networks Edith Cohen Scott Shenker This is a modified version of the original presentation by the authors.
Scalable Content-Addressable Network Lintao Liu
Consistency and Replication Chapter 7 Part II Replica Management & Consistency Protocols.
Replication Management. Motivations for Replication Performance enhancement Increased availability Fault tolerance.
Chord: A scalable peer-to- peer lookup service for Internet applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashock, Hari Balakrishnan.
1 1 Chord: A scalable Peer-to-peer Lookup Service for Internet Applications Dariotaki Roula
Technion –Israel Institute of Technology Computer Networks Laboratory A Comparison of Peer-to-Peer systems by Gomon Dmitri and Kritsmer Ilya under Roi.
Search and Replication in Unstructured Peer-to-Peer Networks Pei Cao, Christine Lv., Edith Cohen, Kai Li and Scott Shenker ICS 2002.
LightFlood: An Optimal Flooding Scheme for File Search in Unstructured P2P Systems Song Jiang, Lei Guo, and Xiaodong Zhang College of William and Mary.
Common approach 1. Define space: assign random ID (160-bit) to each node and key 2. Define a metric topology in this space,  that is, the space of keys.
Small-world Overlay P2P Network
P2p, Spring 05 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems March 29, 2005.
Database Replication techniques: a Three Parameter Classification Authors : Database Replication techniques: a Three Parameter Classification Authors :
P2p, Fall 05 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Search & Replication in Unstructured P2p.
Peer to Peer File Sharing Huseyin Ozgur TAN. What is Peer-to-Peer?  Every node is designed to(but may not by user choice) provide some service that helps.
CS 582 / CMPE 481 Distributed Systems
A Scalable Content-Addressable Network Authors: S. Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker University of California, Berkeley Presenter:
Search and Replication in Unstructured Peer-to-Peer Networks Pei Cao Cisco Systems, Inc. (Joint work with Christine Lv, Edith Cohen, Kai Li and Scott Shenker)
1 A Scalable Content- Addressable Network S. Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker Proceedings of ACM SIGCOMM ’01 Sections: 3.5 & 3.7.
Vassilios V. Dimakopoulos and Evaggelia Pitoura Distributed Data Management Lab Dept. of Computer Science, Univ. of Ioannina, Greece
Freenet A Distributed Anonymous Information Storage and Retrieval System I Clarke O Sandberg I Clarke O Sandberg B WileyT W Hong.
Efficient Search in Peer to Peer Networks By: Beverly Yang Hector Garcia-Molina Presented By: Anshumaan Rajshiva Date: May 20,2002.
Searching in Unstructured Networks Joining Theory with P-P2P.
Wide-area cooperative storage with CFS
 Structured peer to peer overlay networks are resilient – but not secure.  Even a small fraction of malicious nodes may result in failure of correct.
1CS 6401 Peer-to-Peer Networks Outline Overview Gnutella Structured Overlays BitTorrent.
Distributed Databases
INTRODUCTION TO PEER TO PEER NETWORKS Z.M. Joseph CSE 6392 – DB Exploration Spring 2006 CSE, UT Arlington.
1 A scalable Content- Addressable Network Sylvia Rathnasamy, Paul Francis, Mark Handley, Richard Karp, Scott Shenker Pirammanayagam Manickavasagam.
1 Napster & Gnutella An Overview. 2 About Napster Distributed application allowing users to search and exchange MP3 files. Written by Shawn Fanning in.
Content Overlays (Nick Feamster). 2 Content Overlays Distributed content storage and retrieval Two primary approaches: –Structured overlay –Unstructured.
Chord & CFS Presenter: Gang ZhouNov. 11th, University of Virginia.
Peer to Peer Research survey TingYang Chang. Intro. Of P2P Computers of the system was known as peers which sharing data files with each other. Build.
Jonathan Walpole CSE515 - Distributed Computing Systems 1 Teaching Assistant for CSE515 Rahul Dubey.
Using the Small-World Model to Improve Freenet Performance Hui Zhang Ashish Goel Ramesh Govindan USC.
1 Reading Report 5 Yin Chen 2 Mar 2004 Reference: Chord: A Scalable Peer-To-Peer Lookup Service for Internet Applications, Ion Stoica, Robert Morris, david.
Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications Xiaozhou Li COS 461: Computer Networks (precept 04/06/12) Princeton University.
Replication Strategies in Unstructured Peer-to-Peer Networks Edith CohenScott Shenker Some slides are taken from the authors’ original presentation.
Efficient Peer to Peer Keyword Searching Nathan Gray.
Content Addressable Network CAN. The CAN is essentially a distributed Internet-scale hash table that maps file names to their location in the network.
A Scalable Content-Addressable Network (CAN) Seminar “Peer-to-peer Information Systems” Speaker Vladimir Eske Advisor Dr. Ralf Schenkel November 2003.
Replicated Databases. Reading Textbook: Ch.13 Textbook: Ch.13 FarkasCSCE Spring
Presentation 1 By: Hitesh Chheda 2/2/2010. Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT Laboratory for Computer Science.
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications.
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan Presented.
SIGCOMM 2001 Lecture slides by Dr. Yingwu Zhu Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications.
IM NTU Distributed Information Systems 2004 Replication Management -- 1 Replication Management Yih-Kuen Tsay Dept. of Information Management National Taiwan.
Kaleidoscope – Adding Colors to Kademlia Gil Einziger, Roy Friedman, Eyal Kibbar Computer Science, Technion 1.
Scalable Content- Addressable Networks Prepared by Kuhan Paramsothy March 5, 2007.
Paper Survey of DHT Distributed Hash Table. Usages Directory service  Very little amount of information, such as URI, metadata, … Storage  Data, such.
Peer to Peer A Survey and comparison of peer-to-peer overlay network schemes And so on… Chulhyun Park
1 Distributed Hash Table CS780-3 Lecture Notes In courtesy of Heng Yin.
LightFlood: An Efficient Flooding Scheme for File Search in Unstructured P2P Systems Song Jiang, Lei Guo, and Xiaodong Zhang College of William and Mary.
P2p, Fall 06 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Search in Unstructured P2p.
Chord Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber Google,
Aug 22, 2002Sigcomm 2002 Replication Strategies in Unstructured Peer-to-Peer Networks Edith Cohen AT&T Labs-research Scott Shenker ICIR.
Chapter 7: Consistency & Replication IV - REPLICATION MANAGEMENT By Jyothsna Natarajan Instructor: Prof. Yanqing Zhang Course: Advanced Operating Systems.
LOOKING UP DATA IN P2P SYSTEMS Hari Balakrishnan M. Frans Kaashoek David Karger Robert Morris Ion Stoica MIT LCS.
Two Peer-to-Peer Networking Approaches Ken Calvert Net Seminar, 23 October 2001 Note: Many slides “borrowed” from S. Ratnasamy’s Qualifying Exam talk.
Highly Available Services and Transactions with Replicated Data Jason Lenthe.
Distance Vector Routing
Topics in Distributed Databases Database System Implementation CSE 507 Some slides adapted from Navathe et. Al and Silberchatz et. Al.
Incrementally Improving Lookup Latency in Distributed Hash Table Systems Hui Zhang 1, Ashish Goel 2, Ramesh Govindan 1 1 University of Southern California.
Distributed Caching and Adaptive Search in Multilayer P2P Networks Chen Wang, Li Xiao, Yunhao Liu, Pei Zheng The 24th International Conference on Distributed.
Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications * CS587x Lecture Department of Computer Science Iowa State University *I. Stoica,
CS Spring 2010 CS 414 – Multimedia Systems Design Lecture 24 – Introduction to Peer-to-Peer (P2P) Systems Klara Nahrstedt (presented by Long Vu)
Presentation transcript:

P2p, Spring 05 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems Replication

P2p, Spring 05 2 Types of Replication Two types of replication  Index (metadata): replicate index entries  Data/Document replication: replicate the actual data (e.g., music files)

P2p, Spring 05 3 Types of Replication Caching vs Replication Cache: Store data retrieved from a previous request (client-initiated) Replication: More proactive, a copy of a data item may be stored at a node even if the node has not requested it

P2p, Spring 05 4 Reasons for Replication Reasons for replication  Performance load balancing locality: place copies close to the requestor geographic locality (more choices of next step in search) reduce number of search  Availability In case of failures Peer departures Besides storage, cost associated with replication: Consistency Maintenance Make reads faster in the expense of slower writes

P2p, Spring 05 5 Issues Which items (data/metadata) to replicate Popularity In traditional distributed systems, also rate of read/write Where to replicate

P2p, Spring 05 6 “Database-Flavored” Replication Control Protocols Lets assume the existence of a data item x with copies x 1, x 2, …, x n x: logical data item x i ’s: physical data items A replication control protocol is responsible for mapping each read/write on a logical data item (R(x)/W(x)) to a set of read/writes on a (possibly) proper subset of the physical data item copies of x

P2p, Spring 05 7 One Copy Serializability Correctness A DBMS for a replicated database should behave like a DBMS managing a one-copy (i.e., nonreplicated) database insofar as users can tell One-copy serializable (1SR) the schedule of transactions on a replicated database be equivalent to a serial execution of those transactions on a one- copy database

P2p, Spring 05 8 ROWA Read One/Write All (ROWA) A replication control protocol that maps each read to only one copy of the item and each write to a set of writes on all physical data item copies. Even if one of the copies is unavailable an update transaction cannot terminate

P2p, Spring 05 9 Write-All-Available Write-all-available A replication control protocol that maps each read to only one copy of the item and each write to a set of writes on all available physical data item copies.

P2p, Spring Quorum-Based Voting Read quorum V r and a write quorum V w to read or write a data item If a given data item has a total of V votes, the quorums have to obey the following rules: 1.V r + V w > V 2.V w > V/2 Rule 1 ensures that a data item is not read or written by two transactions concurrently (R/W) Rule 2 ensures that two write operations from two transactions cannot occur concurrently on the same data item (W/W)

P2p, Spring Quorum-Based Voting In the case of network partitioning, determine which transactions are going to terminate based on the votes they can acquire the rules ensure that two transactions that are initiated in two different partitions and access the same data item cannot terminate at the same time

P2p, Spring Distributing Writes Immediate writes Deffered writes Access only one copy of the data item, it delays the distribution of writes to other sites until the transaction has terminated and is ready to commit. It maintains an intention list of deferred updates After the transaction terminates, it send the appropriate portion of the intention list to each site that contains replicated copies Optimizations – aborts cost less – may delay commitment – delays the detection of copies Primary copy Use the same copy of a data item

P2p, Spring Eager vs Lazy Replication Eager replication: keeps all replicas synchronized by updating all replicas in a single transaction Lazy replication: asynchronously propagate replica updates to other nodes after replicating transaction commits In p2p, lazy replication is the norm

P2p, Spring Update Propagation Who initiates the update: Push by the server item (copy) that changes Pull by the client holding the copy When  Periodic  Immediate  When an inconsistency is detected  Threshold-based: Freshness (e.g., number of updates or actual time) Value  Time-to-live: Items expire after that time Stateless or State-full

P2p, Spring Topics in Database Systems: Data Management in Peer-to-Peer Systems Replication in Structured P2P From CHORD and CAN first papers

P2p, Spring CHORD Invariant to guarantee correctness of lookups: Keep successors nodes up-to-date Method: Maintain a successor list of its “r” nearest successors on the Chord ring Why? Availability How to keep it consistent: Lazy thought a periodic stabilization Metadata replication or redundancy

P2p, Spring CHORD Method: Replicate data associated with a key at the k nodes succeeding the key Why? Availability Data replication

P2p, Spring CAN Multiple realities With r realities each node is assigned r coordinated zones, one on every reality and holds r independent neighbor sets Replicate the hash table at each reality Availability: Fails only if nodes at both r nodes fail Performance: Better search, choose to forward the query to the neighbor with coordinates closest to the destination Metadata replication

P2p, Spring CAN Overloading coordinate zones Multiple nodes may share a zone The hash table may be replicated among zones Higher availability Performance: choices in the number of neighbors, can select nodes closer in latency Cost for Consistency Metadata replication

P2p, Spring CAN Multiple Hash Functions Use k different hash functions to map a single key onto k points in the coordinate space Availability: fail only if all k replicas are unavailable Performance: choose to send it to the node closest in the coordinated space or send query to all k nodes in parallel (k parallel searches) Cost for Consistency Query traffic (if parallel searches) Metadata replication

P2p, Spring CAN Hot-spot Replication A node that finds it is being overloaded by requests for a particular data key can replicate this key at each of its neighboring nodes Them with a certain probability can choose to either satisfy the request or forward it Performance: load balancing Metadata replication

P2p, Spring CAN Caching Each node maintains a a cache of the data keys it recently accessed Before forwarding a request, it first checks whether the requested key is in its cache, and if so, it can satisfy the request without forwarding it any further Number of cache entries per key grows in direct proportion to its popularity Metadata replication

P2p, Spring Topics in Database Systems: Data Management in Peer-to-Peer Systems Q. Lv et al, “Search and Replication in Unstructured Peer-to- Peer Networks”, ICS’02

P2p, Spring Search and Replication in Unstructured Peer-to-Peer Networks Type of replication depends on the search strategy used (i)A number of blind-search variations of flooding (ii)A number of (metadata) replication strategies Evaluation Method: Study how they work for a number of different topologies and query distributions

P2p, Spring Methodology Aspects of P2P Performance of search depends on  Network topology: graph formed by the p2p overlay network  Query distribution: the distribution of query frequencies for individual files  Replication: number of nodes that have a particular file Assumption: fixed network topology and fixed query distribution Results still hold, if one assumes that the time to complete a search is short compared to the time of change in network topology an in query distribution

P2p, Spring Network Topology (1) Power-Law Random Graph A 9239-node random graph Node degrees follow a power law distribution when ranked from the most connected to the least, the i-th ranked has ω/i a, where ω is a constant Once the node degrees are chosen, the nodes are connected randomly

P2p, Spring Network Topology (2) Normal Random Graph A 9836-node random graph

P2p, Spring Network Topology (3) Gnutella Graph (Gnutella) A 4736-node graph obtained in Oct 2000 Node degrees roughly follow a two-segment power law distribution

P2p, Spring Network Topology (4) Two-Dimensional Grid (Grid) A two dimensional 100x100 grid

P2p, Spring Network Topology

P2p, Spring Query Distribution Let q i be the relative popularity of the i-th object (in terms of queries issued for it) Values are normalized Σ i=1, m q i = 1 (1) Uniform: All objects are equally popular q i = 1/m (2) Zip-like q i  1 / i α

P2p, Spring Replication Each object i is replicated on r i nodes and the total number of objects stored is R, that is Σ i=1, m r i = R (1) Uniform: All objects are replicated at the same number of nodes r i = R/m (2) Proportional: The replication of an object is proportional to the query probability of the object r i  q i (3) Square-root: The replication of an object i is proportional to the square root of its query probability q i r i  √q i

P2p, Spring Query Distribution & Replication When the replication is uniform, the query distribution is irrelevant (since all objects are replicated by the same amount, search times are equivalent) When the query distribution is uniform all three replication distributions are equivalent Thus, three relevant combinations: (1)Uniform/Uniform (2)Zipf-like/Proportional (3)Zipf-like/Square-root

P2p, Spring Metrics Pr(success): probability of finding the queried object before the search terminates #hops: delay in finding an object as measured in number of hops

P2p, Spring Metrics #msgs per node: Overhead of an algorithm as measured in average number of search messages each node in the p2p has to process #nodes visited Percentage of message duplication Peak #msgs: the number of messages that the busiest node has to process (to identify hot spots)

P2p, Spring Simulation Methodology For each experiment, First select the topology and the query/replication distributions For each object i with replication r i, generate numPlace different sets of random replica placements (each set contains r i random nodes, on which to place the replicas of object i) For each replica placement, randomly choose numQuery different nodes form which to initiate the query for object i Thus, we get numPlace x numQuery queries In the paper, numPlace = 10 and numQuery = 100 -> 1000 different queries per object

P2p, Spring Limitation of Flooding Choice of TTL  Too low, the node may not find the object, even if it exists  Too high, burdens the network unnecessarily Search for an object that is replicated at 0.125% of the nodes (~11 nodes if total 9000) Note that TTL depends on the topology Also depends on replication (which is however unknown)

P2p, Spring Limitation of Flooding Choice of TTL Overhead Also depends on the topology

P2p, Spring Limitation of Flooding There are many duplicate messages (due to cycles) particularly in high connectivity graphs Multiple copies of a query are sent to a node by multiple neighbors Duplicated messages can be detected and not forwarded BUT, the number of duplicate messages can still be excessive and worsens as TTL increases

P2p, Spring Limitation of Flooding Different nodes

P2p, Spring Limitation of Flooding: Comparison of the topologies Power-law and Gnutella-style graphs particularly bad with flooding Highly connected nodes means higher duplication messages, because many nodes’ neighbors overlap Random graph best, Because in truly random graph the duplication ratio (the likelihood that the next node already received the query) is the same as the fraction of nodes visited so far, as long as that fraction is small Random graph better load distribution among nodes

P2p, Spring Two New Blind Search Strategies 1.Expanding Ring – not a fixed TTL (iterative deepening) 2. Random Walks (more details) – reduce number of duplicate messages

P2p, Spring Expanding Ring or Iterative Deepening Note that since flooding queries node in parallel, search may not stop even if the object is located Use successive floods with increasing TTL  A node starts a flood with a small TTL  If the search is not successful, the node increases the TTL and starts another flood  The process repeats until the object is found Works well when hot objects are replicated more widely than cold objects

P2p, Spring Expanding Ring or Iterative Deepening (details) Need to define  A policy: at which depths the iterations are to occur (i.e. the successive TTLs)  A time period W between successive iterations  after waiting for a time period W, if it has not received a positive response (i.e. the requested object), the query initiator resends the query with a larger TTL Nodes maintain ID of queries for W + ε Α node that receives the same message as in the previous round does not process it, it just forwards it

P2p, Spring Expanding Ring Start with TTL = 1 and increase each time by a step of 2 For replication over 10%, search stops at TTL 1 or 2

P2p, Spring Expanding Ring Comparison of message overhead between flooding and expanding ring Even for objects that are replicated at 0.125% of the nodes, even if flooding uses the best TTL for each topology, expending ring still halves the per-node message overhead

P2p, Spring Expanding Ring More pronounced improvement for Random and Gnutella graphs than for the PLRG partly because the very high degree nodes in PLGR reduce the opportunity for incremental retries in the expanding ring Introduce slight increase in the delays of finding an object: From 2 to 4 in flooding to 3 to 6 in expanding ring

P2p, Spring Random Walks Forward the query to a randomly chosen neighbor at each step Each message a walker k-walkers The requesting node sends k query messages and each query message takes its own random walk k walkers after T steps should reach roughly the same number of nodes as 1 walker after kT steps So cut delay by a factor of k 16 to 64 walkers give good results

P2p, Spring Random Walks When to terminate the walks  TTL-based  Checking: the walker periodically checks with the original requestor before walking to the next node (again use a large TTL, just to prevent loops) Experiments show that checking once at every 4th step strikes a good balance between the overhead of the checking message and the benefits of checking

P2p, Spring Random Walks When compared to flooding: The 32-walker random walk reduces message overhead by roughly two orders of magnitude for all queries across all network topologies at the expense of a slight increase in the number of hops (increasing from 2-6 to 4-15) When compared to expanding ring, The 32-walkers random walk outperforms expanding ring as well, particularly in PLRG and Gnutella graphs

P2p, Spring Random Walks Keeping State  Each query has a unique ID and its k-walkers are tagged with this ID  For each ID, a node remembers the neighbor it has forwarded the query  When a new query with the same ID arrives, the node forwards it to a different neighbor (randomly chosen) Improves Random and Grid by reducing up to 30% the message overhead and up to 30% the number of hops Small improvements for Gnutella and PLRG

P2p, Spring Principles of Search  Adaptive termination is very important Expanding ring or the checking method  Message duplication should be minimized Preferably, each query should visit a node just once  Granularity of the coverage should be small Increase of each additional step should not significantly increase the number of nodes visited

P2p, Spring Replication How many copies? Theoretically addressed in another paper, three types of replication:  Uniform  Proportional  Square-Root

P2p, Spring Replication: Problem Definition How many copies of each object so that the search overhead for the object is minimized, assuming that the total amount of storage for objects in the network is fixed

P2p, Spring Replication Theory Assume m objects and n nodes Each object i is replicated on r i distinct nodes and the total number of objects stored is R, that is Σ i=1, m r i = R Assume that object i is requested with relative rates q i, we normalize it by setting Σ i=1, m q i = 1 For convenience, assume 1 << r i  n

P2p, Spring Replication Theory Assume that searches go on until a copy is found Searches consist of randomly probing sites until the desired object is found The probability Pr(k) that the object is found at the k’th probe is given Pr(k) = Pr(not found in the previous k-1 probes) Pr(found in one (the kth) probe) = (1 – r i /n) k-1 * r i /n

P2p, Spring Replication Theory We are interested in the average search size A of all the objects (average number of nodes probed per object query) Average search size is the inverse of the fraction of sites that have replicas of the object A i = n/r i Average search size for all the objects A = Σ i q i A i = n Σ i q i /r i

P2p, Spring Replication Theory If we have no limit on r i, replicate everything everywhere Average search size is the inverse of the fraction of sites that have replicas of the object A i = n/r i = 1 Search becomes trivial average number of replicas per site ρ = R/n is fixed How to allocate these R replicas among the m objects, how many replicas per object

P2p, Spring Uniform Replication Create the same number of replicas for each object r i = R/m Average search size for uniform replication A i = n/r i = m/ρ A uniform = Σ i q i m/ρ = m/ρ Which is independent of the query distribution It makes sense to allocate more copies to objects that are frequently queried, this should reduce the search size for the more popular objects

P2p, Spring Proportional Replication Create a number of replicas for each object proportional to the query rate r i = R q i Average search size for uniform replication A i = n/r i = n/R q i A proportioanl = Σ i q i n/R qi = m/ρ = A uniform Which is again independent of the query distribution Why? Objects whose query rate are greater than average (>1/m) do better with proportional, and the other do better with uniform The weighted average balances out to be the same So what is the optimal way to allocate replicas so that A is minimized?

P2p, Spring Square-Root Replication Find r i that minimizes A, A = Σ i q i A i = n Σ i q i /r i This is done for r i = λ √q i where λ = R/ Σ i √ q i Then the average search size is A optimal = 1/ρ ( Σ i √q i ) 2

P2p, Spring Replication (summary) Each object i is replicated on r i nodes and the total number of objects stored is R, that is Σ i=1, m r i = R (1) Uniform: All objects are replicated at the same number of nodes r i = R/m (2) Proportional: The replication of an object is proportional to the query probability of the object r i  q i (3) Square-root: The replication of an object i is proportional to the square root of its query probability q i r i  √q i

P2p, Spring Other Metrics: Discussion Utilization rate, the rate of requests that a replica of an object i receives U i = R q i /r i  For uniform replication, all objects have the same average search size, but replicas have utilization rates proportional to their query rates  Proportional replication achieves perfect load balancing with all replicas having the same utilization rate, but average search sizes vary with more popular objects having smaller average search sizes than less popular ones

P2p, Spring Replication: Summary

P2p, Spring Pareto Distribution

P2p, Spring Achieving Square-Root Replication How can we achieve square-root replication in practices?  Assume that each query keeps track of the search size  Each time a query is finished the object is copied to a number of sites proportional to the number of probes On average object i will be replicated on c n/r i times each time a query is issued (for some constant c) It can be argued that this gives square root

P2p, Spring Achieving Square-Root Replication What about replica deletion? The lifetime of replicas must be independent of object identity or query rate FIFO or random deletions is ok LRU or LFU no

P2p, Spring Replication - Conclusion Square-root replication is needed to minimize the overall search traffic: an object should be replicated at a number of nodes that is proportional to the number of probes that the search required

P2p, Spring Replication - Implementation Two strategies are easily implementable Owner Replication When a search is successful, the object is stored at the requestor node only (used in Gnutella) Path Replication When a search succeeds, the object is stored at all nodes along the path from the requestor node to the provider node (used in Freenet)

P2p, Spring Replication - Implementation If a p2p system uses k-walkers, the number of nodes between the requestor and the provider node is 1/k of the total nodes visited Then, path replication should result in square-root replication Problem: Tends to replicate nodes that are topologically along the same path

P2p, Spring Replication - Implementation Random Replication When a search succeeds, we count the number of nodes on the path between the requestor and the provider Say p Then, randomly pick p of the nodes that the k walkers visited to replicate the object Harder to implement

P2p, Spring Replication: Evaluation Study the three replication strategies in the Random graph network topology Simulation Details Place the m distinct objects randomly into the network Query generator generates queries according to a Poisson process at 5 queries/sec Zipf-distribution of queries among the m objects (with a = 1.2) For each query, the initiator is chosen randomly Then a 32-walker random walk with state keeping and checking every 4 steps Each sites stores at most objAllow (40) objects Random Deletion Warm-up period of 10,000 secs Snapshots every 2,000 query chunks

P2p, Spring Replication: Evaluation For each replication strategy  What kind of replication ratio distribution does the strategy generate?  What is the average number of messages per node in a system using the strategy  What is the distribution of number of hops in a system using the strategy

P2p, Spring Replication: Evaluation Both path and random replication generates replication ratios quite close to square-root of query rates

P2p, Spring Replication: Evaluation Path replication and random replication reduces the overall message traffic by a factor of 3 to 4

P2p, Spring Replication: Evaluation Much of the traffic reduction comes from reducing the number of hops Path and random, better than owner For example, queries that finish with 4 hops, 71% owner, 86% path, 91% random