Www.tu-darmstadt.de www.dfg.dewww.quap2p.de Full-Text Search in P2P Networks Christof Leng Databases and Distributed Systems Group TU Darmstadt.

Slides:



Advertisements
Similar presentations
P2P data retrieval DHT (Distributed Hash Tables) Partially based on Hellerstein’s presentation at VLDB2004.
Advertisements

Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.
Peer to Peer and Distributed Hash Tables
Evaluating scalability Peer-to-Peer File Sharing Networks of Sayantan Mitra Vibhor Goyal.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Clayton Sullivan PEER-TO-PEER NETWORKS. INTRODUCTION What is a Peer-To-Peer Network A Peer Application Overlay Network Network Architecture and System.
CHORD – peer to peer lookup protocol Shankar Karthik Vaithianathan & Aravind Sivaraman University of Central Florida.
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
Distributed Search with Rendezvous Search Systems Christof Leng Dipl.-Inform. Christof Leng Databases & Distributed Systems Department of Computer Science.
Search and Replication in Unstructured Peer-to-Peer Networks Pei Cao, Christine Lv., Edith Cohen, Kai Li and Scott Shenker ICS 2002.
An Overview of Peer-to-Peer Networking CPSC 441 (with thanks to Sami Rollins, UCSB)
P2p, Spring 05 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems March 29, 2005.
FRIENDS: File Retrieval In a dEcentralized Network Distribution System Steven Huang, Kevin Li Computer Science and Engineering University of California,
Topics in Reliable Distributed Systems Lecture 2, Fall Dr. Idit Keidar.
Gossip-based Search Selection in Hybrid Peer-to-Peer Networks M. Zaharia and S. Keshav D.R.Cheriton School of Computer Science University of Waterloo,
Rendezvous Points-Based Scalable Content Discovery with Load Balancing Jun Gao Peter Steenkiste Computer Science Department Carnegie Mellon University.
Efficient Content Location Using Interest-based Locality in Peer-to-Peer Systems Presented by: Lin Wing Kai.
Exploiting Content Localities for Efficient Search in P2P Systems Lei Guo 1 Song Jiang 2 Li Xiao 3 and Xiaodong Zhang 1 1 College of William and Mary,
ODISSEA: a Peer-to-Peer Architecture for Scalable Web Search and IR Torsten Suel with C. Mathur, J. Wu, J. Zhang, A. Delis, M. Kharrazi, X. Long, K. Shanmugasunderam.
Adaptive Content Management in Structured P2P Communities Jussi Kangasharju Keith W. Ross David A. Turner.
A Distributed Search Service for Peer-to-Peer File Sharing in Mobile Application Presented by Tony Sung On Loy, MC Lab, CUHK IE 1 A Distributed Search.
Chord-over-Chord Overlay Sudhindra Rao Ph.D Qualifier Exam Department of ECECS.
Topics in Reliable Distributed Systems Fall Dr. Idit Keidar.
Lecture 10 Naming services for flat namespaces. EECE 411: Design of Distributed Software Applications Logistics / reminders Project Send Samer and me.
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
P2P File Sharing Systems
INTRODUCTION TO PEER TO PEER NETWORKS Z.M. Joseph CSE 6392 – DB Exploration Spring 2006 CSE, UT Arlington.
Roger ZimmermannCOMPSAC 2004, September 30 Spatial Data Query Support in Peer-to-Peer Systems Roger Zimmermann, Wei-Shinn Ku, and Haojun Wang Computer.
Effizientes Routing in P2P Netzwerken Chord: A Scalable Peer-to- peer Lookup Protocol for Internet Applications Dennis Schade.
P2P Architecture Case Study: Gnutella Network
IR Techniques For P2P Networks1 Information Retrieval Techniques For Peer-To-Peer Networks Demetrios Zeinalipour-Yazti, Vana Kalogeraki and Dimitrios Gunopulos.
Peer-to-Peer Networks University of Jordan. Server/Client Model What?
Using the Small-World Model to Improve Freenet Performance Hui Zhang Ashish Goel Ramesh Govindan USC.
Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications Xiaozhou Li COS 461: Computer Networks (precept 04/06/12) Princeton University.
The Effect of Collection Organization and Query Locality on IR Performance 2003/07/28 Park,
1 Distributed Hash Tables (DHTs) Lars Jørgen Lillehovde Jo Grimstad Bang Distributed Hash Tables (DHTs)
Efficient Peer to Peer Keyword Searching Nathan Gray.
Network Computing Laboratory Scalable File Sharing System Using Distributed Hash Table Idea Proposal April 14, 2005 Presentation by Jaesun Han.
Structuring P2P networks for efficient searching Rishi Kant and Abderrahim Laabid Abderrahim Laabid.
Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03.
Super-peer Network. Motivation: Search in P2P Centralised (Napster) Flooding (Gnutella)  Essentially a breadth-first search using TTLs Distributed Hash.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
AlvisP2P : Scalable Peer-to-Peer Text Retrieval in a Structured P2P Network Toan Luu, Gleb Skobeltsyn, Fabius Klemm, Maroje Puh, Ivana Podnar Zarko, Martin.
Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 2007 Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys Ivana.
Practical considerations for a web-scale search engine Michael Isard Microsoft Research Silicon Valley.
1 Peer-to-Peer Technologies Seminar by: Kunal Goswami (05IT6006) School of Information Technology Guided by: Prof. C.R.Mandal, School of Information Technology.
Paper Survey of DHT Distributed Hash Table. Usages Directory service  Very little amount of information, such as URI, metadata, … Storage  Data, such.
Peer to Peer A Survey and comparison of peer-to-peer overlay network schemes And so on… Chulhyun Park
Efficient P2P Search by Exploiting Localities in Peer Community and Individual Peers A DISC’04 paper Lei Guo 1 Song Jiang 2 Li Xiao 3 and Xiaodong Zhang.
Efficient Peer-to-Peer Keyword Searching 1 Efficient Peer-to-Peer Keyword Searching Patrick Reynolds and Amin Vahdat presented by Volker Kudelko.
ADVANCED COMPUTER NETWORKS Peer-Peer (P2P) Networks 1.
1. Efficient Peer-to-Peer Lookup Based on a Distributed Trie 2. Complex Queries in DHT-based Peer-to-Peer Networks Lintao Liu 5/21/2002.
Taxonomy Caching: A Scalable Low- Cost Mechanism for Indexing Remote Contents in Peer-to-Peer Systems Kjetil Nørvåg Norwegian University of Science and.
BubbleStorm: Resilient, Probabilistic, and Exhaustive Peer-to-Peer Search Wesley W. Terpstra, Jussi Kangasharju, Christof.
A Hybrid Search Engine -- Combining Google and P2P Xuanhui Wang.
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
Algorithms and Techniques in Structured Scalable Peer-to-Peer Networks
CS Spring 2014 CS 414 – Multimedia Systems Design Lecture 37 – Introduction to P2P (Part 1) Klara Nahrstedt.
Bandwidth-Efficient Continuous Query Processing over DHTs Yingwu Zhu.
Design of a Robust Search Algorithm for P2P Networks
P2P Content Search: Give the Web Back to the People Matthias Bender Sebastin Michel Peter Triantafillou Gerhard Weikum Christian Zimmer Mariam John CSE.
On the Placement of Web Server Replicas Yu Cai. Paper On the Placement of Web Server Replicas Lili Qiu, Venkata N. Padmanabhan, Geoffrey M. Voelker Infocom.
Peer-to-Peer File Sharing Systems Group Meeting Speaker: Dr. Xiaowen Chu April 2, 2004 Centre for E-transformation Research Department of Computer Science.
Distributed Hash Tables (DHT) Jukka K. Nurminen *Adapted from slides provided by Stefan Götz and Klaus Wehrle (University of Tübingen)
CS Spring 2010 CS 414 – Multimedia Systems Design Lecture 24 – Introduction to Peer-to-Peer (P2P) Systems Klara Nahrstedt (presented by Long Vu)
Information Retrieval in Practice
Data Mining Chapter 6 Search Engines
Information Retrieval and Web Design
Presentation transcript:

Full-Text Search in P2P Networks Christof Leng Databases and Distributed Systems Group TU Darmstadt

DFG RESEARCH GROUP QUAP2P TECHNISCHE UNIVERSITÄT DARMSTADT 2 Content  Short Intro to full-text search  Full-Text search on DHTs  Performance Comparison  Conclusion / Outlook

DFG RESEARCH GROUP QUAP2P TECHNISCHE UNIVERSITÄT DARMSTADT 3 What is full-text search?  Searching for documents containing all of a list of specified words  Search for “QuaP2P”  “Darmstadt”  “Research”  Very common operation  Google  Filesharing  Wikis  Source Code  Document / Knowledge Management  …  Can be extended to phrase search  Search for “TU Darmstadt”  “Christof Leng”

DFG RESEARCH GROUP QUAP2P TECHNISCHE UNIVERSITÄT DARMSTADT 4 Inverted Index  Full-text search is normally solved with inverted indexes  Query result is intersection of all searched word entries  Stemming can reduce the number of word entries doc2 “Similarity searches accelerate P2P downloads by percent.” doc1 “New P2P system could provide speed increase.” doc3 “I fail to see how this will make downloads faster.” 30doc1 70doc1 acceleratedoc1 bydoc1 coulddoc2 downloadsdoc1, doc3 faildoc3 fasterdoc3 howdoc3 i increasedoc2 makedoc3 newdoc2 p2pdoc1, doc2 percentdoc1 providedoc2 searchesdoc1 seedoc3 similaritydoc1 speeddoc2 systemdoc2 thisdoc3 todoc3 willdoc3   

DFG RESEARCH GROUP QUAP2P TECHNISCHE UNIVERSITÄT DARMSTADT 5 Overlay Types and Full-Text Search Peer-to- Peer Centralized Pure / Hierarchical Structured / DHT Inverted index on central server Inverted index on each (super-)node Distributed inverted index  Challenge

DFG RESEARCH GROUP QUAP2P TECHNISCHE UNIVERSITÄT DARMSTADT 6 Naïve Approach  Map inverted index to DHT  Key Lookup for every word  Intersect result lists at client Pro:  Simple  Short latency Con:  Result lists may be extremely large!  Result list sizes may vary extremely! Search for “QuaP2P”  “Darmstadt”  “Research” QuaP2P Darmstadt Research

DFG RESEARCH GROUP QUAP2P TECHNISCHE UNIVERSITÄT DARMSTADT Image from 7 Zipf Distributions in Natural Text  Some words are extremely common  Most words are extremely uncommon  Largest word frequency is proportional to number of distinct words  Avoid transfering result lists before intersection! Rank Word Occurences

DFG RESEARCH GROUP QUAP2P TECHNISCHE UNIVERSITÄT DARMSTADT 8 Intersecting on the way  Query least common word first  Forward result list to next word  Intersect on the way Pro:  Reduces traffic Con:  High latency  Knowledge about word frequencies required  Search for “the” and “who” (7.2 and 2.4 billion hits on Google each) Search for “QuaP2P”  “Darmstadt”  “Research” QuaP2P Darmstadt Research

DFG RESEARCH GROUP QUAP2P TECHNISCHE UNIVERSITÄT DARMSTADT 9 Using Bloom Filters  Bloom Filters reduce result list size  Forward Bloom Filters and return result list recursively Pro:  Reduces traffic even more (up to factor 50x) Con:  Even higher latency  Getting complicated Search for “QuaP2P”  “Darmstadt”  “Research” QuaP2P Darmstadt Research

DFG RESEARCH GROUP QUAP2P TECHNISCHE UNIVERSITÄT DARMSTADT Image from 10 Zipf Distributions in Query Terms, too  Query popularity obeys Zipf’ Law (déjà vu!)  This puts high load on nodes with the most popular keys  Even worse, this load scales linearly with the network size and user activity  The responsible nodes are randomly assigned (could be a modem user)  Hotspots will occur

DFG RESEARCH GROUP QUAP2P TECHNISCHE UNIVERSITÄT DARMSTADT 11 Caching and Precomputation  Caching  Keep lists received for intersection  Keep answers to popular queries  Traffic reduction: 38%  But: How to ensure coherence?  Precomputation  Inverted index for pairs or tupels of words  Only feasible for the most popular words  (but most effective there anyway)  Traffic reduction: 50%

DFG RESEARCH GROUP QUAP2P TECHNISCHE UNIVERSITÄT DARMSTADT 12 Further Optimizations  Compression of result lists  Adaptive Set Intersection  Gap Compression  Clustering of keys  Incremental Results  Do not return all results at once  Should be used in conjunction with ranking algorithm

DFG RESEARCH GROUP QUAP2P TECHNISCHE UNIVERSITÄT DARMSTADT Image from Yang et al: Performance of Full Text Search in Structured and Unstructured Peer-to-Peer Systems 13 Comparison of different approaches  Yang et al compared  DHT with Bloom Filters  Supernode with exhaustive flooding  Unstructured Random Walk w/o replication  Network size 1000  Random data set from WWW  All approaches have strengths

DFG RESEARCH GROUP QUAP2P TECHNISCHE UNIVERSITÄT DARMSTADT 14 Feasibility of P2P Web Search Engine  Li et al calculated the bandwidth usage of a P2P- based web search engine  3 billion documents (10KB each)  60,000 peers  Basic DHT was 100x worse than basic Gnutella  DHT Optimizations (e.g. Bloom Filters) made it competitive  No index creation or maintenance cost included (60TB)  No replica maintenance cost included

DFG RESEARCH GROUP QUAP2P TECHNISCHE UNIVERSITÄT DARMSTADT 15 Conclusion  Distributed Inverted Indexes are challenging  Implementation requires a lot of tricks  Performance is not outstanding  No comparison to state-of-the-art unstructured systems available  Maybe even more tricks from information retrieval research will help  Modeling the correct workload is really important for system design

DFG RESEARCH GROUP QUAP2P TECHNISCHE UNIVERSITÄT DARMSTADT 16 Outlook  Examine robustness of full-text search under Zipf query workloads  Implement DHT full-text search in simulator  Compare state-of-the-art unstructured and structured full-text search overlays  Improve consistency and coherence in DHT full-text search systems  Implement wiki and source code management with full-text search for Scenario B  Phrase search is even more challenging…

DFG RESEARCH GROUP QUAP2P TECHNISCHE UNIVERSITÄT DARMSTADT 17 Recommended Reading Performance Comparison  Li et al. On the Feasibility of Peer-to-Peer Web Indexing and Search. IPTPS  Yang et al. Performance of Full Text Search in Structured and Unstructured Peer-to-Peer Systems. INFOCOM DHT Full-Text Search  P. Reynolds and A. Vahdat. Efficient Peer-to-Peer Keyword Searching. IMC  O. Gnawali. A Keyword Set Search System for Peer-to-Peer Networks. Msc. Thesis, MIT, Workload Modeling  Breslau et al. Web Caching and Zipf-like Distributions: Evidence and Implications. INFOCOM  Gummadi et al. Measurement, Modeling and Analysis of a Peer-to-Peer File-Sharing Workload. SOSP 2003.