Gossip-based Search Selection in Hybrid Peer-to-Peer Networks M. Zaharia and S. Keshav D.R.Cheriton School of Computer Science University of Waterloo,

Slides:

Advertisements

Similar presentations

Peer-to-Peer and Social Networks An overview of Gnutella.

Advertisements

Neighbour selection strategies in BitTorrent- like Peer-to-Peer systems L.G. Alex Sung, Herman Li March 30, 2005 for CS856 Web Data Management University.

P2P data retrieval DHT (Distributed Hash Tables) Partially based on Hellerstein’s presentation at VLDB2004.

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer D istributed T able: Efficient Query-Driven Processing of Multi-Term Queries in.

Evaluating scalability Peer-to-Peer File Sharing Networks of Sayantan Mitra Vibhor Goyal.

Replication Strategies in Unstructured Peer-to-Peer Networks Edith Cohen Scott Shenker This is a modified version of the original presentation by the authors.

Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

Content Management & Hashtag Recommendation IN P2P OSN By Keerthi Nelaturu.

PCP2P: Probabilistic Clustering for P2P networks 32nd European Conference on Information Retrieval 28 th -31 st March 2010, Milton Keynes, UK Odysseas.

1 An Overview of Gnutella. 2 History The Gnutella network is a fully distributed alternative to the centralized Napster. Initial popularity of the network.

Search and Replication in Unstructured Peer-to-Peer Networks Pei Cao, Christine Lv., Edith Cohen, Kai Li and Scott Shenker ICS 2002.

CompSci 356: Computer Network Architectures Lecture 21: Content Distribution Chapter 9.4 Xiaowei Yang

Network Coding for Large Scale Content Distribution Christos Gkantsidis Georgia Institute of Technology Pablo Rodriguez Microsoft Research IEEE INFOCOM.

1 A Framework for Lazy Replication in P2P VoD Bin Cheng 1, Lex Stein 2, Hai Jin 1, Zheng Zhang 2 1 Huazhong University of Science & Technology (HUST) 2.

More routing protocols Alec Woo June 18 th, 2002.

Topics in Reliable Distributed Systems Lecture 2, Fall Dr. Idit Keidar.

Spotlighting Decentralized P2P File Sharing Archie Kuo and Ethan Le Department of Computer Science San Jose State University.

Hardware-based Load Generation for Testing Servers Lorenzo Orecchia Madhur Tulsiani CS 252 Spring 2006 Final Project Presentation May 1, 2006.

Efficient Content Location Using Interest-based Locality in Peer-to-Peer Systems Presented by: Lin Wing Kai.

Multi-server Optimal Bandwidth Monitoring for QoS based Multimedia Delivery Anup Basu, Irene Cheng and Yinzhe Yu Department of Computing Science U. of.

Aggregating Information in Peer-to-Peer Systems for Improved Join and Leave Distributed Computing Group Keno Albrecht Ruedi Arnold Michael Gähwiler Roger.

Chord-over-Chord Overlay Sudhindra Rao Ph.D Qualifier Exam Department of ECECS.

Topics in Reliable Distributed Systems Fall Dr. Idit Keidar.

1 CS 194: Distributed Systems Distributed Hash Tables Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering and Computer.

Improving Data Access in P2P Systems Karl Aberer and Magdalena Punceva Swiss Federal Institute of Technology Manfred Hauswirth and Roman Schmidt Technical.

Ecole Polytechnique Fédérale de Lausanne, Switzerland Efficient processing of XPath queries with structured overlay networks Gleb Skobeltsyn, Manfred Hauswirth,

On-Demand Media Streaming Over the Internet Mohamed M. Hefeeda, Bharat K. Bhargava Presented by Sam Distributed Computing Systems, FTDCS Proceedings.

Roger ZimmermannCOMPSAC 2004, September 30 Spatial Data Query Support in Peer-to-Peer Systems Roger Zimmermann, Wei-Shinn Ku, and Haojun Wang Computer.

1 Napster & Gnutella An Overview. 2 About Napster Distributed application allowing users to search and exchange MP3 files. Written by Shawn Fanning in.

Introduction Widespread unstructured P2P network

COCONET: Co-Operative Cache driven Overlay NETwork for p2p VoD streaming Abhishek Bhattacharya, Zhenyu Yang & Deng Pan.

By Shobana Padmanabhan Sep 12, 2007 CSE 473 Class #4: P2P Section 2.6 of textbook (some pictures here are from the book)

PIC: Practical Internet Coordinates for Distance Estimation Manuel Costa joint work with Miguel Castro, Ant Rowstron, Peter Key Microsoft Research Cambridge.

An affinity-driven clustering approach for service discovery and composition for pervasive computing J. Gaber and M.Bakhouya Laboratoire SeT Université.

Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications Xiaozhou Li COS 461: Computer Networks (precept 04/06/12) Princeton University.

Full-Text Search in P2P Networks Christof Leng Databases and Distributed Systems Group TU Darmstadt.

1 CS 425 Distributed Systems Fall 2011 Slides by Indranil Gupta Measurement Studies All Slides © IG Acknowledgments: Jay Patel.

Network Computing Laboratory Scalable File Sharing System Using Distributed Hash Table Idea Proposal April 14, 2005 Presentation by Jaesun Han.

Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03.

Fateme Shirazi Spring Statistical structures for Internet-scale data management Authors: Nikos Ntarmos, Peter Triantafillou, G. Weikum.

Efficient P2P backup through buffering at the edge S. Defrance, A.-M. Kermarrec (INRIA), E. Le Merrer, N. Le Scouarnec, G. Straub, A. van Kempen.

Super-peer Network. Motivation: Search in P2P Centralised (Napster) Flooding (Gnutella)  Essentially a breadth-first search using TTLs Distributed Hash.

Quantitative Evaluation of Unstructured Peer-to-Peer Architectures Fabrício Benevenuto José Ismael Jr. Jussara M. Almeida Department of Computer Science.

AlvisP2P : Scalable Peer-to-Peer Text Retrieval in a Structured P2P Network Toan Luu, Gleb Skobeltsyn, Fabius Klemm, Maroje Puh, Ivana Podnar Zarko, Martin.

SIGCOMM 2001 Lecture slides by Dr. Yingwu Zhu Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications.

Kaleidoscope – Adding Colors to Kademlia Gil Einziger, Roy Friedman, Eyal Kibbar Computer Science, Technion 1.

1 Peer-to-Peer Technologies Seminar by: Kunal Goswami (05IT6006) School of Information Technology Guided by: Prof. C.R.Mandal, School of Information Technology.

Paper Survey of DHT Distributed Hash Table. Usages Directory service  Very little amount of information, such as URI, metadata, … Storage  Data, such.

Efficient P2P Search by Exploiting Localities in Peer Community and Individual Peers A DISC’04 paper Lei Guo 1 Song Jiang 2 Li Xiao 3 and Xiaodong Zhang.

Efficient Peer-to-Peer Keyword Searching 1 Efficient Peer-to-Peer Keyword Searching Patrick Reynolds and Amin Vahdat presented by Volker Kudelko.

PROP: A Scalable and Reliable P2P Assisted Proxy Streaming System Computer Science Department College of William and Mary Lei Guo, Songqing Chen, and Xiaodong.

CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.

Peer to Peer Network Design Discovery and Routing algorithms

Aug 22, 2002Sigcomm 2002 Replication Strategies in Unstructured Peer-to-Peer Networks Edith Cohen AT&T Labs-research Scott Shenker ICIR.

Bandwidth-Efficient Continuous Query Processing over DHTs Yingwu Zhu.

@ Carnegie Mellon Databases 1 Finding Frequent Items in Distributed Data Streams Amit Manjhi V. Shkapenyuk, K. Dhamdhere, C. Olston Carnegie Mellon University.

Two Peer-to-Peer Networking Approaches Ken Calvert Net Seminar, 23 October 2001 Note: Many slides “borrowed” from S. Ratnasamy’s Qualifying Exam talk.

P2P Search COP6731 Advanced Database Systems. P2P Computing  Powerful personal computer Share computing resources P2P Computing  Advantages: Shared.

P2P Search COP P2P Search Techniques Centralized P2P systems  e.g. Napster, Decentralized & unstructured P2P systems  e.g. Gnutella.

School of Electrical Engineering &Telecommunications UNSW Cost-effective Broadcast for Fully Decentralized Peer-to-peer Networks Marius Portmann & Aruna.

Large Scale Sharing Marco F. Duarte COMP 520: Distributed Systems September 19, 2004.

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

CS Spring 2010 CS 414 – Multimedia Systems Design Lecture 24 – Introduction to Peer-to-Peer (P2P) Systems Klara Nahrstedt (presented by Long Vu)

Information Retrieval in Practice

Peer-to-Peer Data Management

CHAPTER 3 Architectures for Distributed Systems

Peer-to-Peer and Social Networks

EE 122: Peer-to-Peer (P2P) Networks

Determining the Peer Resource Contributions in a P2P Contract

Presentation transcript:

Gossip-based Search Selection in Hybrid Peer-to-Peer Networks M. Zaharia and S. Keshav D.R.Cheriton School of Computer Science University of Waterloo, Waterloo, ON, Canada IPTPS 2006, Feb 28th 2006

The Search Problem  Decentralized system of nodes, each of which stores copies of documents  Keyword-based search o Each document is identified by a set of keywords (e.g. song title) o Queries return lists of documents whose keyword sets are supersets of the query keywords (“AND queries”)  Example o Song: “Here Comes the Sun”  keywords: “Here”, “Comes”, “The”, “Sun” o Query: “Here” AND “Sun” o Responses: “Here Comes the Sun”, “The Sun is Here”

Metrics  Success rate o fraction of queries that return a result, conditional on a result being available  Number of results found o no more than a desired maximum R max  Response time o for first result, and for R max th result  Bandwidth cost o includes costs of index creation, query propagation, and to fetch result(s)

Key Workload Characteristics  Document popularities follow a Zipfian distribution o Some documents are more widely copied than others o Are also requested more often  Some nodes have much faster connections and much longer connection durations than others

So…  Retrieve popular documents with least work  Offload work to better-connected and longer-lived peers How can we do that?

Hybrid P2P network [Loo, IPTPS 2004] DHT Ultrapeers Peers Bootstrap Nodes Flood queries for popular documents Use DHT for rare documents Only publish rare documents to DHT index

How to know document popularity?  PIERSearch uses o Observations of  result size history  keyword frequency  keyword pair frequency o Sampling of neighboring nodes  These are all local  Global knowledge is better

More on global knowledge  Want histogram of document popularity o i.e. number of ultrapeers that index a document o we only care about popular documents, so can truncate the tail  On getting a query, sum histogram values for all matching document titles and divide by number of ultrapeers  If this exceeds threshold, then flood, else use DHT * * modulo rare documents with common keywords, see paper

Example  Assume 100 ultrapeers and only two documents  Suppose title ‘Here comes the Sun’ has count 15 (15 ultrapeers index it) and `You are my Sun’ has count 2  Query ‘Sun’ has sum 15+2/100 = 0.17  Query ‘Are My’ has sum 2/100 = 0.02  If threshold is 0.05, then first query is flooded and for second, we use a DHT

How to compute the histogram?  Central server o Centralizes load and introduces single point of failure  Compute on induced tree o brittle to failures  Gossip o pick random node and exchange partial histograms o can result in double counting

Double counting problem A: a, bB: a, c a:2 b:1 c:1 a:2 b:1 c:1 C: a, d a:3 b:1 c:1 d:1 a:5 b:1 c:1 d:1

Avoiding double couting  When an ultrapeer indexes a document title it hasn’t indexed already, it tosses a coin up to k times and counts the number of heads it sees before the first tail = CT  Gossip CT values for all titles with other ultrapeers to compute maxCT o because max is an extremal value, no double counting  (Flajolet-Martin) Count of the number of ultrapeers with the document is roughly 2 maxCT  Example o 1000 nodes o Chances are good that one will see 10 consecutive heads o It gossips ‘10’

Approximate histograms  Use coin-flipping trick for each document o Note that there can be up to 50% error  Gossip partial histograms  Concatenate histograms  Truncate low-count documents

What about the threshold?  If chosen too low, flood too often!  If chosen too high, flood too rarely!  Threshold is time dependent and load dependent  No easy way to choose it

Adaptive thresholding  Associate utility with the performance of a query  Threshold should maximize utility  For some queries, use both flooding and DHT and compare utilities  This will tell us how to move the threshold in the future

Utility function

Adaptive thresholding

Evaluation  Built an event-driven simulator for peer-to-peer search in generic peer-to-peer network architectures, in Java.  Simulates each query, response and document download.  Uses user lifetime and bandwidth distributions observed in real systems.  Generates random exact queries based on fetch- at-most-once model (Zipfian with flattened head) o can also use traces of queries from real systems.

Parameters  3 peers join every 4 seconds  Each enters with an average of 20 documents, randomly chosen from a dataset of 20,000 unique documents  Peers emit queries on average once every 300 seconds, requesting at most 25 results  Zipf parameter of 1.0.  1.7 million queries over a 22 hour period

Simulation stability  Stable population achieved at 20,000 seconds  Variance of all results under 5% and removed for clarity

Systems compared

Metrics

Performance (normalized)

Adaptive thresholding

Scaling (normalized)

Trace-based simulation Trace of 50 ultrapeers for 3 hours on Sunday October 12, 2003 ~ 230,000 distinct queries ~200,000 distinct keywords ~672,000 distinct documents

Conclusions  Gossip is an effective way to compute global state  Utility functions provide simple ‘knobs’ to control performance and balance competing objectives  Adaptive algorithms (threshold selection and flooding) reduce the need for external management and “magic constants”  Giving hybrid ultrapeers access to global state reduces overhead by a factor of about two

Questions? ? ? ?

User Characteristics  User capabilities are heterogeneous: From: Saroiu, S., Gummadi, P.K., Gribble, S.D: A Measurement Study of Peer-to-Peer File Sharing Systems, MMCN ’02

User Characteristics  Node lifetimes are heterogeneous: From: Saroiu, S., Gummadi, P.K., Gribble, S.D: A Measurement Study of Peer-to-Peer File Sharing Systems, MMCN ’02

Refinement  What about a query like ‘love amour’?  Unlikely to be a popular document, so will be sent to the DHT  But keywords are common!  So, will have to an expensive join  If we know which keywords are common, can flood instead

Simulator Speedup  Fast I/O routines o Java creates temporary objects during string concatenation. Custom, large StringBuffer for string concatenation greatly improves performance.  Batch database uploads o prepared statements turn out to be much less efficient than importing a table from a tab-separated text file.  Avoid keyword search for exact queries  Can simulate 20 hours with a population of 7000 users (~2,300,000 queries) in about 20 minutes