Presentation is loading. Please wait.

Presentation is loading. Please wait.

Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03.

Similar presentations


Presentation on theme: "Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03."— Presentation transcript:

1 Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03

2 Motivation & Query Model Avoid duplicating work and data movement by caching previous query results Observation: a i && a j && a k = (a i && a j ) && a k So We can keep the result for (a i &&a j ) as materialized view. Query Model: (a i && a j && a k ) || (b i && b j && b k ) ||…

3 View and View Tree a view is the cached result for a previous query. Where to store the views: Using the underlying P2P system mechanism for example, in Chord, the view “a && b” is stored at the successor of Hash(“a&&b”) But it can’t be used to efficiently answer view queries. Why? For “a 1 && a 2 &&..&& a k “, there are 2 k possible views. And you don’t know which one exists.

4 View and View Tree (Cont.) Another possible way: Centralized, consistent list of views Easy to locate Problems:  Frequent updates  Storage requirements Proposed Solution: View Tree Implemented as a trie,  A tree for storing strings in which there is one node for every common prefix. The strings are stored in extra leaf nodes. scalable, stateless

5 View Tree All nodes are at level 1. (single-attribute views) A canonical order on the attributes is defined and used to uniquely identify equivalent conjunctive queries. “a && b” and “b && a” are both recorded as “a && b” (alphabetical order)

6 Answering Queries Finding a smallest set of views to evaluate a query is NP-hard Instead, the following method is used: Exact match, if such a match exists Forward progress:  For each node accessed, at least one attribute must be located which does not occur in the views located so far.

7 Example on answering queries Query: “cbagekhilo” Step 1: match prefix “cbag” Step 2: “cbage” is not found, but “cbagh” exists and is useful for the query (forward progress) …..

8 Algorithm: Search(n, q)

9 Creating a balanced View Tree For a query “a && b && c &&.. && x”, there exist a lot of equivalent views, each corresponds to a position at View Tree. And any of them can be used to represent the result of the search query. Which one to choose and how to make a balanced Tree? Deterministically pick a permutation P uniformly at random among all possible permutations.

10 Maintaining the View Tree The owner of a new node needs to update all attribute indexed corresponding to attributes of the new node. (Q: Cross the whole network? All related views need to be updated? Isn’t it too expensive?) Heartbeat is used to check the presence of child nodes and parent node in the view tree. Insertion of new view is less expensive: some child pointers need to be reassigned.

11 Example of a new view join

12 Preliminary Results: Data source and methodology: Document: TREC-Web data set HTML pages with keyword meta-tag 64K different pages for each experiment Queries: generated using the statistical characteristics from search.com

13 Preliminary results: Caching Benefit Query locality Benefit

14 On the Feasibility of P2P Web Indexing and Search From UC Berkeley & MIT

15 Motivation: Is P2P web search likely to work? two keyword-search techniques: Flooding (Gnutella): not discussed in this paper Intersection of index lists This paper presents a feasibility analysis based on the resource constraints and workload.

16 Introduction Why interested in P2P searching: A good stress test for P2P architectures More resistant to censoring and manipulated ranking More robust from single node failure 550 billion documents on the web Google indexes more than 2 billion of them Gnutella and KaZaA: flooding search 500 Million files Typically music files Search: file meta-data such as titles and artist DHT-based keyword searching: Good full-text search performance with about 100,000 files (Duke Univ.)

17 Fundamental Constraints of the real world We assumed the following parameters Web documents: 3 billion Words per file: 1000 An inverted index would have: 3*10 9 *1000 unique docIDs. docID: 20 bytes (hash of the file content) Inverted index size: 6*10 13 bytes Queries per second: 1000 (google)

18 Fundamental constraints (cont.) Storage Constraints 1 GB for each PC, at least 60000 PCs. Communication Constraints: Assume web search consume 10% (after comparison with the traffic for DNS) 1999, Internet backbone of US: 100Gbits 1000 queries/sec, 10Mbits can be used for each query

19 Basic Cost analysis Assumption: DHT-based P2P systems Two-term query (each search has 2 keyword) In MIT, 1.7 Million Web pages, 81000 queries, 300,000 bytes are moved for each query. Scale to Internet (3 billion page), it might require 530 Mbytes for each query (Q: Is that true?)

20 Optimizations: All the optimizations used the 81000 queries from mit.edu Caching and Precomputation Caching received posting lists: reduce communication cost by 38% Computing and storing the intersection of different posting lists in advance: 3% of all possible term pairs is precomputed, the communication cost is reduced by 50% (Zipf distribution, most popular words)

21 Compression Bloom Filters: Two-round Bloom intersection (one node sends the bloom filter of its posting list to another node, which returns the result): compression ratio 13 4-round Bloom intersection: Compression ratio 40 Compressed Bloom filters: 30% improvement Gap Compression: Effective when the gaps between sorted docIDs are small, ? So less bits are required for docID ?

22 Compression (cont.) Adaptive Set Intersection: Exploit the structure in the posting lists to avoid transferring entire lists Example: {1, 3, 4, 7}&&{8, 10, 20, 30} requires one element exchange since 7<8 Clustering Similar documents are grouped together based on their term occurrences and assigned adjacent docIDs, which improves the compression ration of adaptive set intersecton with gap compression to 75

23 Optimization Tech. And Improvements Still one order of magnitude higher than the budget

24 Two compromises And Conclusion Compromising Result Quality Compromising P2P Structure Conclusion: Naïve implementations of P2P Web search are not feasible. The most effective optimizations bring the problem to within an order of magnitude of feasibility. Two possible compromises are proposed All of them combined together will bring us within feasibility range for P2P Web search. (Q: Sure?)


Download ppt "Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03."

Similar presentations


Ads by Google