Presentation is loading. Please wait.

Presentation is loading. Please wait.

G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval Query-Driven Indexing for Scalable P2P Text Retrieval Infoscale’07, June 6-8, 2007.

Similar presentations


Presentation on theme: "G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval Query-Driven Indexing for Scalable P2P Text Retrieval Infoscale’07, June 6-8, 2007."— Presentation transcript:

1 G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval Query-Driven Indexing for Scalable P2P Text Retrieval Infoscale’07, June 6-8, 2007 Suzhou, China Gleb Skobeltsyn EPFL, Switzerland June 6, 2007 Joint work with: Toan Luu Ivana Podnar Žarko Martin Rajman Karl Aberer Alvis Alvis

2 DHT Goal goal scalableOur goal is to achieve scalable full-text retrieval with structured P2P networks (DHTs) G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval 2 / 25 Each peer: Provides resources (bandwidth, storage) Searches the whole network Publishes its own documents

3 G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval Naïve (single-term) approach... is to distribute the global inverted index in a DHT: K I Query: “epfl & gleb” h(“epfl”)-{d 1,d 2 } h(“gleb”)-{d 2,d 3 } h(t’)-{d 4,d 5 } K I This slide was borrowed from B. T. Loo, J. M. Hellerstein, R. Huebsch, S. Shenker, I. Stoica presentation: Enhancing P2P File-Sharing with an Internet-Scale Query Processor {d 1,d 2 } {d 2 } 3 / 25

4 Indexing with Highly Discriminative Keys G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval [1] Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys I. Podnar, M. Rajman, T. Luu, F. Klemm, K. Aberer in ICDE’07, Istambul, Turkey 4 / 25

5 Indexing with HDKs: main properties G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval Distributed index contains {key,PL} pairs: keyEach key corresponds to a term or a set of terms posting listEach key is assigned to a posting list DF max top-rankedEach posting list stores at most DF max top-ranked document references. Data-Driven key generation: kEach time a new document is indexed, some posting lists for a key k can reach the max size of DF max triggers It triggers the generation of new keys (k + other frequent keys) closewProximity Filter: a document qualifies for a key t1&t2 if t1 is close to t2 (specified by a window size w). 5 / 25

6 HDK – exhaustive data driven indexing Pro’sPro’s: –ICDE’07 paper proves that the number of keys grows linearly –Elegant key generation mechanism –Low bandwidth while query processing (PL’s of limited size) Con’sCon’s: –Practically the number of keys is LARGE: 68M for 0.6M docs –High bandwidth consumption at indexing ProblemProblem: –Too many keys are superfluous (almost never used) G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval 6 / 25

7 Query Driven Indexing Lets index only what is queried! G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval 7 / 25

8 Contents Introduction HDK approach for indexing Query-driven approach for indexing/retrieval –Indexing structure –Example –ONM –Scalability –Evaluation Conclusion G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval 8 / 25

9 Query-Driven Index (QDI) Too-Many-KeysQuery-Driven Indexing strategy solves the “Too-Many-Keys” problem: –Avoids maintenance of superfluous keys –Generates only such keys that are requested by users –Utilizes query-log to discover such keys ProblemsProblems –Indexing of a new key requires a bandwidth-efficient mechanism to obtain the top-k posting list associated with the key Opportunistic Notification Mechanism Opportunistic Notification Mechanism(smart-broadcast) –Incomplete index causes degradation of query results quality Show that the degradation is low Show that the degradation is low G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval 9 / 25

10 Which keys to index? Each single-term found in the document collection is has to be indexed. basic single term index –We call all single-term keys a basic single term index. –The posting lists are truncated at DF max. non-superfluousactivatedA key k is non-superfluous and can be activated iff: –k is popular: QF(k) ≥QF min, where QF(k) is the popularity of the key k derived from the available query log and QF min is a parameter for our model (popularity filter). –k contains from 2 to s max terms: 2≤|k|≤ s max, where s max is a parameter of our model (size filter). –all immediate sub-keys of k (of size |k-1|) are indexed and their associated postings lists are truncated (redundancy filter). G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval 10 / 25

11 QDI: Retrieval G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval abc abc abbcac Single term index is generated Process abc 1)Probe P abc 2)Probe P ab P bc and P ac 3)Probe P a P b and P c 4)Obtain top-DF max results for a, b and c (ranked w.r.t a, b and c respectively) 5)Contact peers in the list, re-rank the obtained results w.r.t abc 6)Output top-10 Inc. the QF for ab, bc and ac Activate (index) ac peer ?abc nothing ?abc nothing ?abc +1 DF max popular 11 / 25

12 QDI: Retrieval 2 G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval abc abbcac abc Assume the frequency of b is below DF max Note, how the redundancy filter would simplify the lattice in such a case (grayed nodes cannot be activated) DF max abc abbc 12 / 25

13 QDI: Retrieval 3 G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval abc abbcac abc Single term index is generated and ac is indexed Process abc 1)Probe P abc 2)Probe P ab P bc and P ac – obtain the result for ac 3)Probe P b and obtain the result for b 4)Contact all peers in the list to re-rank the obtained results w.r.t abc 5)Output top-10 Inc. the QF for ab, bc and ac peer ?abc nothing ?abc nothing ?abc +1 13 / 25

14 Opportunistic Notification Mechanism ONM used to activate a new multi-term key ONM is a “smart” broadcast with the following features: –It is based on the shower multicast [2]: each peer within a specified range is contacted only once –Notifications are small and low-priority => piggybacking –Broadcast is split into several multicast sessions, each time pruning low-score documents –It uses the high-performance DHT layer [3] [2] A. Datta, M. Hauswirth, R. Schmidt, R. John, K. Aberer: Range Queries in Tree-Structured Overlays, in P2P’05 [3] F. Klemm, J.-Y. Le Boudec, D. Kostic, K. Aberer: Improving the Throughput of Distributed Hash Tables Using Congestion-Aware Routing, in IPTPS'07 G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval 14 / 25

15 Scalability The retrieval traffic is bounded by a constant due to trun- cated posting lists (depends on DF max and a query size) The indexing traffic depends on the number of keys to be activated. linearly –The number of keys in the HDK approach (UPPER BOUND) is proven to grow linearly with the number of peers, if each peer provides a limited number of documents does not depend on the document collection size –The number of keys does not depend on the document collection size but only on the size of the query log indexing traffic retrieval quality –We can use the QF min parameter to adjust the tradeoff: indexing traffic retrieval quality G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval 15 / 25

16 Contents Introduction HDK approach for indexing Query-driven approach for indexing/retrieval –Indexing structure –Example –ONM –Scalability –Evaluation Conclusion G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval 16 / 25

17 Overlap experiment Use the Wikipedia query-log (9M queries/9-10.2004) to build the index Choose randomly 3K test queries query its combinationsAnswer each test query with Google and compare to the union of top- DF max Google results for each of its combinations that are indexed according to the logs. Mimics our P2PIR system if Google’s ranking is used. Example: Original query Non-superfluous (indexed) combinations X X overlap@5=3/5=60% G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval 17 / 25

18 Overlap example G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval what did babe ruth do in the 1920” >id=481, q=“what did babe ruth do in the 1920” “1920 babe ruth”, qf=0 ----> Ov@100= 100% “1920 babe”, qf=0 ---------> Ov@100= 9% 1920 ruth”33% +++“1920 ruth”, qf=1 ---------> Ov@100= 33% babe ruth” 69% +++“babe ruth”, qf=495 -------> Ov@100= 69% ---“1920”, qf=716 ------------> Ov@100= 1% ---“babe”, qf=3196 -----------> Ov@100= 2% ---“ruth”, qf=1653 -----------> Ov@100= 7% 192294% Size: 192, Keys used: 2, Overlap@100: 94% Cut-n-paste from the simulation log: 18 / 25

19 Overlap with Google G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval 19 / 25

20 Overlap with Yahoo G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval 20 / 25

21 Overlap with Google (no/partial/full overlap) G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval 21 / 25

22 P2P Index Simulations G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval Number of keys depends only on the query log size and QF min ! Does not depend on the collection size! Number of keys is much smaller than for the HDK approach: 68M keys for 650K doc 22 / 25

23 Real query logs? G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval Wikipedia queries are unrealistic (too skewed) as users know what they want. Real web-queries might perform worse? Large scale experiments with real web queries and the TREC collection in [4] [4] [4] Web Text Retrieval with a P2P Query-Driven Index G. Skobeltsyn, T. Luu, I. Podnar Žarko, M. Rajman, K. Aberer To appear in SIGIR’07 23 / 25

24 Conclusions query-driven indexing strategyWe presented the query-driven indexing strategy for scalable web text retrieval with structured P2P networks: and –Stores posting lists in a DHT for terms and term combinations at most –Stores at most DF max top document references in a posting list statistics –Efficiently collects the query statistics in a distributed fashion popular –Based on this statistics activates (indexes) only popular keys no –Computes the result of a multi-term query based only on the index entries available at the moment – no costly intersections We also showed that: good retrieval quality –With real query-logs our approach achieves good retrieval quality tradeoff –The QF min parameter adjusts the traffic/quality tradeoff G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval 24 / 25

25 G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval Last slide Thank you for your attention! Questions? 25 / 25


Download ppt "G.Skobeltsyn | Query-Driven Indexing for Scalable P2P Text Retrieval Query-Driven Indexing for Scalable P2P Text Retrieval Infoscale’07, June 6-8, 2007."

Similar presentations


Ads by Google