Download presentation
Presentation is loading. Please wait.
Published byGrant Jefferson Modified over 9 years ago
1
www.tu-darmstadt.de www.dfg.dewww.quap2p.de Full-Text Search in P2P Networks Christof Leng Databases and Distributed Systems Group TU Darmstadt
2
DFG RESEARCH GROUP QUAP2P TECHNISCHE UNIVERSITÄT DARMSTADT 2 Content Short Intro to full-text search Full-Text search on DHTs Performance Comparison Conclusion / Outlook
3
DFG RESEARCH GROUP QUAP2P TECHNISCHE UNIVERSITÄT DARMSTADT 3 What is full-text search? Searching for documents containing all of a list of specified words Search for “QuaP2P” “Darmstadt” “Research” Very common operation Google Filesharing Wikis Source Code Document / Knowledge Management … Can be extended to phrase search Search for “TU Darmstadt” “Christof Leng”
4
DFG RESEARCH GROUP QUAP2P TECHNISCHE UNIVERSITÄT DARMSTADT 4 Inverted Index Full-text search is normally solved with inverted indexes Query result is intersection of all searched word entries Stemming can reduce the number of word entries doc2 “Similarity searches accelerate P2P downloads by 30-70 percent.” doc1 “New P2P system could provide speed increase.” doc3 “I fail to see how this will make downloads faster.” 30doc1 70doc1 acceleratedoc1 bydoc1 coulddoc2 downloadsdoc1, doc3 faildoc3 fasterdoc3 howdoc3 i increasedoc2 makedoc3 newdoc2 p2pdoc1, doc2 percentdoc1 providedoc2 searchesdoc1 seedoc3 similaritydoc1 speeddoc2 systemdoc2 thisdoc3 todoc3 willdoc3
5
DFG RESEARCH GROUP QUAP2P TECHNISCHE UNIVERSITÄT DARMSTADT 5 Overlay Types and Full-Text Search Peer-to- Peer Centralized Pure / Hierarchical Structured / DHT Inverted index on central server Inverted index on each (super-)node Distributed inverted index Challenge
6
DFG RESEARCH GROUP QUAP2P TECHNISCHE UNIVERSITÄT DARMSTADT 6 Naïve Approach Map inverted index to DHT Key Lookup for every word Intersect result lists at client Pro: Simple Short latency Con: Result lists may be extremely large! Result list sizes may vary extremely! Search for “QuaP2P” “Darmstadt” “Research” QuaP2P Darmstadt Research
7
DFG RESEARCH GROUP QUAP2P TECHNISCHE UNIVERSITÄT DARMSTADT Image from http://en.wikipedia.org/wiki/Zipf's_law 7 Zipf Distributions in Natural Text Some words are extremely common Most words are extremely uncommon Largest word frequency is proportional to number of distinct words Avoid transfering result lists before intersection! Rank Word Occurences
8
DFG RESEARCH GROUP QUAP2P TECHNISCHE UNIVERSITÄT DARMSTADT 8 Intersecting on the way Query least common word first Forward result list to next word Intersect on the way Pro: Reduces traffic Con: High latency Knowledge about word frequencies required Search for “the” and “who” (7.2 and 2.4 billion hits on Google each) Search for “QuaP2P” “Darmstadt” “Research” QuaP2P Darmstadt Research
9
DFG RESEARCH GROUP QUAP2P TECHNISCHE UNIVERSITÄT DARMSTADT 9 Using Bloom Filters Bloom Filters reduce result list size Forward Bloom Filters and return result list recursively Pro: Reduces traffic even more (up to factor 50x) Con: Even higher latency Getting complicated Search for “QuaP2P” “Darmstadt” “Research” QuaP2P Darmstadt Research
10
DFG RESEARCH GROUP QUAP2P TECHNISCHE UNIVERSITÄT DARMSTADT Image from http://www.useit.com/alertbox/traffic_logs.html 10 Zipf Distributions in Query Terms, too Query popularity obeys Zipf’ Law (déjà vu!) This puts high load on nodes with the most popular keys Even worse, this load scales linearly with the network size and user activity The responsible nodes are randomly assigned (could be a modem user) Hotspots will occur
11
DFG RESEARCH GROUP QUAP2P TECHNISCHE UNIVERSITÄT DARMSTADT 11 Caching and Precomputation Caching Keep lists received for intersection Keep answers to popular queries Traffic reduction: 38% But: How to ensure coherence? Precomputation Inverted index for pairs or tupels of words Only feasible for the most popular words (but most effective there anyway) Traffic reduction: 50%
12
DFG RESEARCH GROUP QUAP2P TECHNISCHE UNIVERSITÄT DARMSTADT 12 Further Optimizations Compression of result lists Adaptive Set Intersection Gap Compression Clustering of keys Incremental Results Do not return all results at once Should be used in conjunction with ranking algorithm
13
DFG RESEARCH GROUP QUAP2P TECHNISCHE UNIVERSITÄT DARMSTADT Image from Yang et al: Performance of Full Text Search in Structured and Unstructured Peer-to-Peer Systems 13 Comparison of different approaches Yang et al compared DHT with Bloom Filters Supernode with exhaustive flooding Unstructured Random Walk w/o replication Network size 1000 Random data set from WWW All approaches have strengths
14
DFG RESEARCH GROUP QUAP2P TECHNISCHE UNIVERSITÄT DARMSTADT 14 Feasibility of P2P Web Search Engine Li et al calculated the bandwidth usage of a P2P- based web search engine 3 billion documents (10KB each) 60,000 peers Basic DHT was 100x worse than basic Gnutella DHT Optimizations (e.g. Bloom Filters) made it competitive No index creation or maintenance cost included (60TB) No replica maintenance cost included
15
DFG RESEARCH GROUP QUAP2P TECHNISCHE UNIVERSITÄT DARMSTADT 15 Conclusion Distributed Inverted Indexes are challenging Implementation requires a lot of tricks Performance is not outstanding No comparison to state-of-the-art unstructured systems available Maybe even more tricks from information retrieval research will help Modeling the correct workload is really important for system design
16
DFG RESEARCH GROUP QUAP2P TECHNISCHE UNIVERSITÄT DARMSTADT 16 Outlook Examine robustness of full-text search under Zipf query workloads Implement DHT full-text search in simulator Compare state-of-the-art unstructured and structured full-text search overlays Improve consistency and coherence in DHT full-text search systems Implement wiki and source code management with full-text search for Scenario B Phrase search is even more challenging…
17
DFG RESEARCH GROUP QUAP2P TECHNISCHE UNIVERSITÄT DARMSTADT 17 Recommended Reading Performance Comparison Li et al. On the Feasibility of Peer-to-Peer Web Indexing and Search. IPTPS 2003. Yang et al. Performance of Full Text Search in Structured and Unstructured Peer-to-Peer Systems. INFOCOM 2006. DHT Full-Text Search P. Reynolds and A. Vahdat. Efficient Peer-to-Peer Keyword Searching. IMC 2003. O. Gnawali. A Keyword Set Search System for Peer-to-Peer Networks. Msc. Thesis, MIT, 2002. Workload Modeling Breslau et al. Web Caching and Zipf-like Distributions: Evidence and Implications. INFOCOM 1999. Gummadi et al. Measurement, Modeling and Analysis of a Peer-to-Peer File-Sharing Workload. SOSP 2003.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.