Download presentation
Presentation is loading. Please wait.
1
Gossip-based Search Selection in Hybrid Peer-to-Peer Networks M. Zaharia and S. Keshav D.R.Cheriton School of Computer Science University of Waterloo, Waterloo, ON, Canada matei@matei.ca, keshav@uwaterloo.ca IPTPS 2006, Feb 28th 2006
2
The Search Problem Decentralized system of nodes, each of which stores copies of documents Keyword-based search o Each document is identified by a set of keywords (e.g. song title) o Queries return lists of documents whose keyword sets are supersets of the query keywords (“AND queries”) Example o Song: “Here Comes the Sun” keywords: “Here”, “Comes”, “The”, “Sun” o Query: “Here” AND “Sun” o Responses: “Here Comes the Sun”, “The Sun is Here”
3
Metrics Success rate o fraction of queries that return a result, conditional on a result being available Number of results found o no more than a desired maximum R max Response time o for first result, and for R max th result Bandwidth cost o includes costs of index creation, query propagation, and to fetch result(s)
4
Key Workload Characteristics Document popularities follow a Zipfian distribution o Some documents are more widely copied than others o Are also requested more often Some nodes have much faster connections and much longer connection durations than others
5
So… Retrieve popular documents with least work Offload work to better-connected and longer-lived peers How can we do that?
6
Hybrid P2P network [Loo, IPTPS 2004] DHT Ultrapeers Peers Bootstrap Nodes Flood queries for popular documents Use DHT for rare documents Only publish rare documents to DHT index
7
How to know document popularity? PIERSearch uses o Observations of result size history keyword frequency keyword pair frequency o Sampling of neighboring nodes These are all local Global knowledge is better
8
More on global knowledge Want histogram of document popularity o i.e. number of ultrapeers that index a document o we only care about popular documents, so can truncate the tail On getting a query, sum histogram values for all matching document titles and divide by number of ultrapeers If this exceeds threshold, then flood, else use DHT * * modulo rare documents with common keywords, see paper
9
Example Assume 100 ultrapeers and only two documents Suppose title ‘Here comes the Sun’ has count 15 (15 ultrapeers index it) and `You are my Sun’ has count 2 Query ‘Sun’ has sum 15+2/100 = 0.17 Query ‘Are My’ has sum 2/100 = 0.02 If threshold is 0.05, then first query is flooded and for second, we use a DHT
10
How to compute the histogram? Central server o Centralizes load and introduces single point of failure Compute on induced tree o brittle to failures Gossip o pick random node and exchange partial histograms o can result in double counting
11
Double counting problem A: a, bB: a, c a:2 b:1 c:1 a:2 b:1 c:1 C: a, d a:3 b:1 c:1 d:1 a:5 b:1 c:1 d:1
12
Avoiding double couting When an ultrapeer indexes a document title it hasn’t indexed already, it tosses a coin up to k times and counts the number of heads it sees before the first tail = CT Gossip CT values for all titles with other ultrapeers to compute maxCT o because max is an extremal value, no double counting (Flajolet-Martin) Count of the number of ultrapeers with the document is roughly 2 maxCT Example o 1000 nodes o Chances are good that one will see 10 consecutive heads o It gossips ‘10’
13
Approximate histograms Use coin-flipping trick for each document o Note that there can be up to 50% error Gossip partial histograms Concatenate histograms Truncate low-count documents
14
What about the threshold? If chosen too low, flood too often! If chosen too high, flood too rarely! Threshold is time dependent and load dependent No easy way to choose it
15
Adaptive thresholding Associate utility with the performance of a query Threshold should maximize utility For some queries, use both flooding and DHT and compare utilities This will tell us how to move the threshold in the future
16
Utility function
17
Adaptive thresholding
18
Evaluation Built an event-driven simulator for peer-to-peer search in generic peer-to-peer network architectures, in Java. Simulates each query, response and document download. Uses user lifetime and bandwidth distributions observed in real systems. Generates random exact queries based on fetch- at-most-once model (Zipfian with flattened head) o can also use traces of queries from real systems.
19
Parameters 3 peers join every 4 seconds Each enters with an average of 20 documents, randomly chosen from a dataset of 20,000 unique documents Peers emit queries on average once every 300 seconds, requesting at most 25 results Zipf parameter of 1.0. 1.7 million queries over a 22 hour period
20
Simulation stability Stable population achieved at 20,000 seconds Variance of all results under 5% and removed for clarity
21
Systems compared
22
Metrics
23
Performance (normalized)
24
Adaptive thresholding
25
Scaling (normalized)
26
Trace-based simulation Trace of 50 ultrapeers for 3 hours on Sunday October 12, 2003 ~ 230,000 distinct queries ~200,000 distinct keywords ~672,000 distinct documents
27
Conclusions Gossip is an effective way to compute global state Utility functions provide simple ‘knobs’ to control performance and balance competing objectives Adaptive algorithms (threshold selection and flooding) reduce the need for external management and “magic constants” Giving hybrid ultrapeers access to global state reduces overhead by a factor of about two
28
Questions? ? ? ?
29
User Characteristics User capabilities are heterogeneous: From: Saroiu, S., Gummadi, P.K., Gribble, S.D: A Measurement Study of Peer-to-Peer File Sharing Systems, MMCN ’02
30
User Characteristics Node lifetimes are heterogeneous: From: Saroiu, S., Gummadi, P.K., Gribble, S.D: A Measurement Study of Peer-to-Peer File Sharing Systems, MMCN ’02
31
Refinement What about a query like ‘love amour’? Unlikely to be a popular document, so will be sent to the DHT But keywords are common! So, will have to an expensive join If we know which keywords are common, can flood instead
32
Simulator Speedup Fast I/O routines o Java creates temporary objects during string concatenation. Custom, large StringBuffer for string concatenation greatly improves performance. Batch database uploads o prepared statements turn out to be much less efficient than importing a table from a tab-separated text file. Avoid keyword search for exact queries Can simulate 20 hours with a population of 7000 users (~2,300,000 queries) in about 20 minutes
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.