Presentation is loading. Please wait.

Presentation is loading. Please wait.

Www.tu-darmstadt.de www.dfg.dewww.quap2p.de Full-Text Search in P2P Networks Christof Leng Databases and Distributed Systems Group TU Darmstadt.

Similar presentations


Presentation on theme: "Www.tu-darmstadt.de www.dfg.dewww.quap2p.de Full-Text Search in P2P Networks Christof Leng Databases and Distributed Systems Group TU Darmstadt."— Presentation transcript:

1 www.tu-darmstadt.de www.dfg.dewww.quap2p.de Full-Text Search in P2P Networks Christof Leng Databases and Distributed Systems Group TU Darmstadt

2 DFG RESEARCH GROUP QUAP2P TECHNISCHE UNIVERSITÄT DARMSTADT 2 Content  Short Intro to full-text search  Full-Text search on DHTs  Performance Comparison  Conclusion / Outlook

3 DFG RESEARCH GROUP QUAP2P TECHNISCHE UNIVERSITÄT DARMSTADT 3 What is full-text search?  Searching for documents containing all of a list of specified words  Search for “QuaP2P”  “Darmstadt”  “Research”  Very common operation  Google  Filesharing  Wikis  Source Code  Document / Knowledge Management  …  Can be extended to phrase search  Search for “TU Darmstadt”  “Christof Leng”

4 DFG RESEARCH GROUP QUAP2P TECHNISCHE UNIVERSITÄT DARMSTADT 4 Inverted Index  Full-text search is normally solved with inverted indexes  Query result is intersection of all searched word entries  Stemming can reduce the number of word entries doc2 “Similarity searches accelerate P2P downloads by 30-70 percent.” doc1 “New P2P system could provide speed increase.” doc3 “I fail to see how this will make downloads faster.” 30doc1 70doc1 acceleratedoc1 bydoc1 coulddoc2 downloadsdoc1, doc3 faildoc3 fasterdoc3 howdoc3 i increasedoc2 makedoc3 newdoc2 p2pdoc1, doc2 percentdoc1 providedoc2 searchesdoc1 seedoc3 similaritydoc1 speeddoc2 systemdoc2 thisdoc3 todoc3 willdoc3   

5 DFG RESEARCH GROUP QUAP2P TECHNISCHE UNIVERSITÄT DARMSTADT 5 Overlay Types and Full-Text Search Peer-to- Peer Centralized Pure / Hierarchical Structured / DHT Inverted index on central server Inverted index on each (super-)node Distributed inverted index  Challenge

6 DFG RESEARCH GROUP QUAP2P TECHNISCHE UNIVERSITÄT DARMSTADT 6 Naïve Approach  Map inverted index to DHT  Key Lookup for every word  Intersect result lists at client Pro:  Simple  Short latency Con:  Result lists may be extremely large!  Result list sizes may vary extremely! Search for “QuaP2P”  “Darmstadt”  “Research” QuaP2P Darmstadt Research

7 DFG RESEARCH GROUP QUAP2P TECHNISCHE UNIVERSITÄT DARMSTADT Image from http://en.wikipedia.org/wiki/Zipf's_law 7 Zipf Distributions in Natural Text  Some words are extremely common  Most words are extremely uncommon  Largest word frequency is proportional to number of distinct words  Avoid transfering result lists before intersection! Rank Word Occurences

8 DFG RESEARCH GROUP QUAP2P TECHNISCHE UNIVERSITÄT DARMSTADT 8 Intersecting on the way  Query least common word first  Forward result list to next word  Intersect on the way Pro:  Reduces traffic Con:  High latency  Knowledge about word frequencies required  Search for “the” and “who” (7.2 and 2.4 billion hits on Google each) Search for “QuaP2P”  “Darmstadt”  “Research” QuaP2P Darmstadt Research

9 DFG RESEARCH GROUP QUAP2P TECHNISCHE UNIVERSITÄT DARMSTADT 9 Using Bloom Filters  Bloom Filters reduce result list size  Forward Bloom Filters and return result list recursively Pro:  Reduces traffic even more (up to factor 50x) Con:  Even higher latency  Getting complicated Search for “QuaP2P”  “Darmstadt”  “Research” QuaP2P Darmstadt Research

10 DFG RESEARCH GROUP QUAP2P TECHNISCHE UNIVERSITÄT DARMSTADT Image from http://www.useit.com/alertbox/traffic_logs.html 10 Zipf Distributions in Query Terms, too  Query popularity obeys Zipf’ Law (déjà vu!)  This puts high load on nodes with the most popular keys  Even worse, this load scales linearly with the network size and user activity  The responsible nodes are randomly assigned (could be a modem user)  Hotspots will occur

11 DFG RESEARCH GROUP QUAP2P TECHNISCHE UNIVERSITÄT DARMSTADT 11 Caching and Precomputation  Caching  Keep lists received for intersection  Keep answers to popular queries  Traffic reduction: 38%  But: How to ensure coherence?  Precomputation  Inverted index for pairs or tupels of words  Only feasible for the most popular words  (but most effective there anyway)  Traffic reduction: 50%

12 DFG RESEARCH GROUP QUAP2P TECHNISCHE UNIVERSITÄT DARMSTADT 12 Further Optimizations  Compression of result lists  Adaptive Set Intersection  Gap Compression  Clustering of keys  Incremental Results  Do not return all results at once  Should be used in conjunction with ranking algorithm

13 DFG RESEARCH GROUP QUAP2P TECHNISCHE UNIVERSITÄT DARMSTADT Image from Yang et al: Performance of Full Text Search in Structured and Unstructured Peer-to-Peer Systems 13 Comparison of different approaches  Yang et al compared  DHT with Bloom Filters  Supernode with exhaustive flooding  Unstructured Random Walk w/o replication  Network size 1000  Random data set from WWW  All approaches have strengths

14 DFG RESEARCH GROUP QUAP2P TECHNISCHE UNIVERSITÄT DARMSTADT 14 Feasibility of P2P Web Search Engine  Li et al calculated the bandwidth usage of a P2P- based web search engine  3 billion documents (10KB each)  60,000 peers  Basic DHT was 100x worse than basic Gnutella  DHT Optimizations (e.g. Bloom Filters) made it competitive  No index creation or maintenance cost included (60TB)  No replica maintenance cost included

15 DFG RESEARCH GROUP QUAP2P TECHNISCHE UNIVERSITÄT DARMSTADT 15 Conclusion  Distributed Inverted Indexes are challenging  Implementation requires a lot of tricks  Performance is not outstanding  No comparison to state-of-the-art unstructured systems available  Maybe even more tricks from information retrieval research will help  Modeling the correct workload is really important for system design

16 DFG RESEARCH GROUP QUAP2P TECHNISCHE UNIVERSITÄT DARMSTADT 16 Outlook  Examine robustness of full-text search under Zipf query workloads  Implement DHT full-text search in simulator  Compare state-of-the-art unstructured and structured full-text search overlays  Improve consistency and coherence in DHT full-text search systems  Implement wiki and source code management with full-text search for Scenario B  Phrase search is even more challenging…

17 DFG RESEARCH GROUP QUAP2P TECHNISCHE UNIVERSITÄT DARMSTADT 17 Recommended Reading Performance Comparison  Li et al. On the Feasibility of Peer-to-Peer Web Indexing and Search. IPTPS 2003.  Yang et al. Performance of Full Text Search in Structured and Unstructured Peer-to-Peer Systems. INFOCOM 2006. DHT Full-Text Search  P. Reynolds and A. Vahdat. Efficient Peer-to-Peer Keyword Searching. IMC 2003.  O. Gnawali. A Keyword Set Search System for Peer-to-Peer Networks. Msc. Thesis, MIT, 2002. Workload Modeling  Breslau et al. Web Caching and Zipf-like Distributions: Evidence and Implications. INFOCOM 1999.  Gummadi et al. Measurement, Modeling and Analysis of a Peer-to-Peer File-Sharing Workload. SOSP 2003.


Download ppt "Www.tu-darmstadt.de www.dfg.dewww.quap2p.de Full-Text Search in P2P Networks Christof Leng Databases and Distributed Systems Group TU Darmstadt."

Similar presentations


Ads by Google