Gossip-based Search Selection in Hybrid Peer-to-Peer Networks M. Zaharia and S. Keshav D.R.Cheriton School of Computer Science University of Waterloo,

Gossip-based Search Selection in Hybrid Peer-to-Peer Networks M. Zaharia and S. Keshav D.R.Cheriton School of Computer Science University of Waterloo, Waterloo, ON, Canada matei@matei.ca, keshav@uwaterloo.ca IPTPS 2006, Feb 28th 2006

The Search Problem  Decentralized system of nodes, each of which stores copies of documents  Keyword-based search o Each document is identified by a set of keywords (e.g. song title) o Queries return lists of documents whose keyword sets are supersets of the query keywords (“AND queries”)  Example o Song: “Here Comes the Sun”  keywords: “Here”, “Comes”, “The”, “Sun” o Query: “Here” AND “Sun” o Responses: “Here Comes the Sun”, “The Sun is Here”

Metrics  Success rate o fraction of queries that return a result, conditional on a result being available  Number of results found o no more than a desired maximum R max  Response time o for first result, and for R max th result  Bandwidth cost o includes costs of index creation, query propagation, and to fetch result(s)

Key Workload Characteristics  Document popularities follow a Zipfian distribution o Some documents are more widely copied than others o Are also requested more often  Some nodes have much faster connections and much longer connection durations than others

So…  Retrieve popular documents with least work  Offload work to better-connected and longer-lived peers How can we do that?

Hybrid P2P network [Loo, IPTPS 2004] DHT Ultrapeers Peers Bootstrap Nodes Flood queries for popular documents Use DHT for rare documents Only publish rare documents to DHT index

How to know document popularity?  PIERSearch uses o Observations of  result size history  keyword frequency  keyword pair frequency o Sampling of neighboring nodes  These are all local  Global knowledge is better

More on global knowledge  Want histogram of document popularity o i.e. number of ultrapeers that index a document o we only care about popular documents, so can truncate the tail  On getting a query, sum histogram values for all matching document titles and divide by number of ultrapeers  If this exceeds threshold, then flood, else use DHT * * modulo rare documents with common keywords, see paper

Example  Assume 100 ultrapeers and only two documents  Suppose title ‘Here comes the Sun’ has count 15 (15 ultrapeers index it) and `You are my Sun’ has count 2  Query ‘Sun’ has sum 15+2/100 = 0.17  Query ‘Are My’ has sum 2/100 = 0.02  If threshold is 0.05, then first query is flooded and for second, we use a DHT

How to compute the histogram?  Central server o Centralizes load and introduces single point of failure  Compute on induced tree o brittle to failures  Gossip o pick random node and exchange partial histograms o can result in double counting

Double counting problem A: a, bB: a, c a:2 b:1 c:1 a:2 b:1 c:1 C: a, d a:3 b:1 c:1 d:1 a:5 b:1 c:1 d:1

Avoiding double couting  When an ultrapeer indexes a document title it hasn’t indexed already, it tosses a coin up to k times and counts the number of heads it sees before the first tail = CT  Gossip CT values for all titles with other ultrapeers to compute maxCT o because max is an extremal value, no double counting  (Flajolet-Martin) Count of the number of ultrapeers with the document is roughly 2 maxCT  Example o 1000 nodes o Chances are good that one will see 10 consecutive heads o It gossips ‘10’

Approximate histograms  Use coin-flipping trick for each document o Note that there can be up to 50% error  Gossip partial histograms  Concatenate histograms  Truncate low-count documents

What about the threshold?  If chosen too low, flood too often!  If chosen too high, flood too rarely!  Threshold is time dependent and load dependent  No easy way to choose it

Adaptive thresholding  Associate utility with the performance of a query  Threshold should maximize utility  For some queries, use both flooding and DHT and compare utilities  This will tell us how to move the threshold in the future

Utility function

Adaptive thresholding

Evaluation  Built an event-driven simulator for peer-to-peer search in generic peer-to-peer network architectures, in Java.  Simulates each query, response and document download.  Uses user lifetime and bandwidth distributions observed in real systems.  Generates random exact queries based on fetch- at-most-once model (Zipfian with flattened head) o can also use traces of queries from real systems.

Parameters  3 peers join every 4 seconds  Each enters with an average of 20 documents, randomly chosen from a dataset of 20,000 unique documents  Peers emit queries on average once every 300 seconds, requesting at most 25 results  Zipf parameter of 1.0.  1.7 million queries over a 22 hour period

Simulation stability  Stable population achieved at 20,000 seconds  Variance of all results under 5% and removed for clarity

Systems compared

Metrics

Performance (normalized)

Adaptive thresholding

Scaling (normalized)

Trace-based simulation Trace of 50 ultrapeers for 3 hours on Sunday October 12, 2003 ~ 230,000 distinct queries ~200,000 distinct keywords ~672,000 distinct documents

Conclusions  Gossip is an effective way to compute global state  Utility functions provide simple ‘knobs’ to control performance and balance competing objectives  Adaptive algorithms (threshold selection and flooding) reduce the need for external management and “magic constants”  Giving hybrid ultrapeers access to global state reduces overhead by a factor of about two

Questions? ? ? ?

User Characteristics  User capabilities are heterogeneous: From: Saroiu, S., Gummadi, P.K., Gribble, S.D: A Measurement Study of Peer-to-Peer File Sharing Systems, MMCN ’02

User Characteristics  Node lifetimes are heterogeneous: From: Saroiu, S., Gummadi, P.K., Gribble, S.D: A Measurement Study of Peer-to-Peer File Sharing Systems, MMCN ’02

Refinement  What about a query like ‘love amour’?  Unlikely to be a popular document, so will be sent to the DHT  But keywords are common!  So, will have to an expensive join  If we know which keywords are common, can flood instead

Simulator Speedup  Fast I/O routines o Java creates temporary objects during string concatenation. Custom, large StringBuffer for string concatenation greatly improves performance.  Batch database uploads o prepared statements turn out to be much less efficient than importing a table from a tab-separated text file.  Avoid keyword search for exact queries  Can simulate 20 hours with a population of 7000 users (~2,300,000 queries) in about 20 minutes

Gossip-based Search Selection in Hybrid Peer-to-Peer Networks M. Zaharia and S. Keshav D.R.Cheriton School of Computer Science University of Waterloo,

Similar presentations

Presentation on theme: "Gossip-based Search Selection in Hybrid Peer-to-Peer Networks M. Zaharia and S. Keshav D.R.Cheriton School of Computer Science University of Waterloo,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Gossip-based Search Selection in Hybrid Peer-to-Peer Networks M. Zaharia and S. Keshav D.R.Cheriton School of Computer Science University of Waterloo,

Similar presentations

Presentation on theme: "Gossip-based Search Selection in Hybrid Peer-to-Peer Networks M. Zaharia and S. Keshav D.R.Cheriton School of Computer Science University of Waterloo,"— Presentation transcript:

Similar presentations

About project

Feedback