Download presentation
Presentation is loading. Please wait.
Published byRhoda Campbell Modified over 9 years ago
1
G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval Query-Driven Indexing for P2P Text Retrieval The Future of Web Search 19.07.2007 Bertinoro, Italy Gleb Skobeltsyn EPFL, Switzerland June 19, 2007 Joint work with: Toan Luu Ivana Podnar Žarko Martin Rajman Karl Aberer Alvis Alvis
2
DHT Goal goal scalableOur goal is to achieve scalable full-text retrieval with structured P2P networks (DHTs) G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval Each peer: Provides resources (bandwidth, storage) Searches the whole network Publishes its own documents 2 / 29
3
G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval Naïve (single-term) approach... is to distribute the global inverted index in a DHT using term partitioning: K I Query: “epfl & gleb” h(“epfl”)-{d 1,d 2 } h(“gleb”)-{d 2,d 3 } h(t’)-{d 4,d 5 } K I This slide was borrowed from B. T. Loo, J. M. Hellerstein, R. Huebsch, S. Shenker, I. Stoica presentation: Enhancing P2P File-Sharing with an Internet-Scale Query Processor {d 1,d 2 } {d 2 } 3 / 29
4
Single-term vs. multi-term P2P indexing How to choose keys to keep a satisfactory retrieval quality? voc. size could grow exponentially! G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval 4 / 29
5
Multi-term indexing: framework G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval responsible DHTEach peer is responsible for a set of keys assigned by the underlying DHT using the standard hashing mechanism keyEach key corresponds to a term or a set of terms truncated posting list (TPL) DF max top-rankedEach key is assigned to a truncated posting list (TPL) that stores at most DF max top-ranked document references Distributed index contains {key,TPL} pairs optimizedThe indexing load is handled by an optimized DHT layer: F. Klemm, J.-Y. Le Boudec, D. Kostic, K. Aberer Improving the Throughput of Distributed Hash Tables Using Congestion-Aware Routing, in IPTPS'07 5 / 29
6
Single-term vs. multi-term P2P indexing How to choose keys to keep a satisfactory retrieval quality? voc. size could grow exponentially! G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval 6 / 29
7
Multi-term indexing techniques Indexing with Highly Discriminative Keys (HDKs), based on: –Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys I. Podnar, M. Rajman, T. Luu, F. Klemm, K. Aberer in ICDE’07 –Beyond term indexing: A P2P framework for Web information retrieval I. Podnar, M. Rajman, T. Luu, F. Klemm, K. Aberer Informatica, vol. 30, no. 2, 2006. Query-Driven Indexing (QDI), based on: –Web Text Retrieval with a P2P Query-Driven Index G. Skobeltsyn, T. Luu, I. Podnar Žarko, M. Rajman, K. Aberer in SIGIR’07 –Query-Driven Indexing for Scalable Peer-to-Peer Text Retrieval G. Skobeltsyn, T. Luu, I. Podnar Žarko, M. Rajman, K. Aberer in Infoscale’07 G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval 7 / 29
8
Indexing with HDK Data-Driven key generation: kEach time a new document is indexed, some pos- ting lists for a key k can reach the max size of DF max triggers −It triggers the generation of new keys (k + other frequent keys) Use a number of filters to reduce the number of keys, e.g.: closew −Proximity Filter: a document qualifies for a key t1&t2 if t1 is close to t2 (specified by a window size w). G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval 8 / 29
9
Indexing with HDK Pro’sPro’s: –ICDE’07 paper proves that the number of keys grows linearly –Elegant key generation mechanism –Low bandwidth while query processing (PL’s of limited size) Con’sCon’s: –Practically the number of keys is LARGE: 68M for 0.6M docs –High bandwidth consumption at indexing ProblemProblem: –Too many keys are superfluous (almost never used) G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval 9 / 29
10
Query Driven Indexing Lets index only what is queried! G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval 10 / 29
11
Contents Introduction Single-term vs. multi term indexing HDK approach for indexing Query-driven approach for indexing/retrieval –Indexing structure –Example –Scalability –Evaluation Conclusion G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval 11 / 29
12
Query-Driven Index (QDI) Too-Many-KeysQuery-Driven Indexing strategy solves the “Too-Many-Keys” problem: –Avoids maintenance of superfluous keys –Generates only such keys that are requested by users –Utilizes query-log to discover such keys ProblemsProblems –Indexing of a new key requires a bandwidth-efficient mechanism to obtain the top-k posting list associated with the key Smart Broadcast (ONM) Smart Broadcast (ONM) or Conventional intersection like TA, but less frequent Conventional intersection like TA, but less frequent –Incomplete index causes degradation of query results quality Show that the degradation is low Show that the degradation is low G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval 12 / 29
13
Which keys to index? Each single-term found in the document collection has to be indexed. basic single term index –We call all single-term keys a basic single term index. –The posting lists are truncated at DF max. non-superfluousactivatedA key k is non-superfluous and can be activated iff: –k is popular: QF(k) ≥QF min, where QF(k) is the popularity of the key k derived from the available query log and QF min is a parameter for our model (popularity filter). –k contains from 2 to s max terms: 2≤|k|≤ s max, where s max is a parameter of our model (size filter). –all immediate sub-keys of k (of size |k-1|) are indexed and their associated postings lists are truncated (redundancy filter). G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval 13 / 29
14
QDI: Retrieval G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval abc abc abbcac Single term index is generated Process abc 1)Probe P abc 2)Probe P ab P bc and P ac 3)Probe P a P b and P c 4)Obtain top-DF max results for a, b and c (ranked w.r.t a, b and c respectively) 5)Contact peers in the list, re-rank the obtained results w.r.t abc 6)Output top-10 Inc. the QF for ab, bc and ac Activate (index) ac peer ?abc nothing ?abc nothing ?abc +1 DF max popular 14 / 29
15
QDI: Retrieval 2 G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval abc abbcac abc Assume the frequency of b is below DF max Note, how the redundancy filter would simplify the lattice in such a case (grayed nodes cannot be activated) DF max abc abbc 15 / 29
16
QDI: Retrieval 3 G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval abc abbcac abc Single term index is generated and ac is indexed Process abc 1)Probe P abc 2)Probe P ab P bc and P ac – obtain the result for ac 3)Probe P b and obtain the result for b 4)Contact all peers in the list to re-rank the obtained results w.r.t abc 5)Output top-10 Inc. the QF for ab, bc and ac peer ?abc nothing ?abc nothing ?abc +1 16 / 29
17
Indexing on-demand … used to activate a new multi-term key ONM is a “smart” broadcast with the following features: –It is based on the shower multicast [2]: each peer within a specified range is contacted only once –Notifications are small and low-priority => piggybacking –Broadcast is split into several multicast sessions, each time pruning low-score documents –It uses the high-performance DHT layer [3] [2] A. Datta, M. Hauswirth, R. Schmidt, R. John, K. Aberer: Range Queries in Tree-Structured Overlays, in P2P’05 [3] F. Klemm, J.-Y. Le Boudec, D. Kostic, K. Aberer: Improving the Throughput of Distributed Hash Tables Using Congestion-Aware Routing, in IPTPS'07 G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval 17 / 29
18
Scalability The retrieval traffic is bounded by a constant due to trun- cated posting lists (depends on DF max and a query size) The indexing traffic depends on the number of keys to be activated. linearly –The number of keys in the HDK approach (UPPER BOUND) is proven to grow linearly with the number of peers, if each peer provides a limited number of documents does not depend on the document collection size –The number of keys does not depend on the document collection size but only on the size of the query log indexing traffic retrieval quality –We can use the QF min parameter to adjust the tradeoff: indexing traffic retrieval quality G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval 18 / 29
19
Contents Introduction Single-term vs. multi term indexing HDK approach for indexing Query-driven approach for indexing/retrieval –Indexing structure –Example –Scalability –Evaluation Conclusion G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval 19 / 29
20
AOL logs 17M Queries from March, April, May 2006 (92 days) 650K anonymous user sessions Extracted all unique queries from each user session: G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval … 2006-05-31 23:50:30 wearthbow.com native.cheyenne origin. 2006-05-31 23:50:30 l6 screensaver 2006-05-31 23:50:30 horses for sale in tn ky 2006-05-31 23:50:30 bank of america.com 2006-05-31 23:50:30 ask 2006-05-31 23:50:29 del rosa lanes 2006-05-31 23:50:28 www.spirit airlines.com 2006-05-31 23:50:28 find holy women of the bible 2006-05-31 23:50:27 trains 2006-05-31 23:50:27 todaysmiricles 2006-05-31 23:50:27 constition 2006-05-31 23:50:26 german grocceries in las vegas nv 2006-05-31 23:50:25 porn 2006-05-31 23:50:25 northwest indiana 2006-05-31 23:50:24 united.eprize.net 2006-05-31 23:50:24 jessica laguna … <-0.7Gb 20 / 29
21
Distribution of combinations in the AOL logs G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval 21 / 29
22
TREC Experiment WT10G collection (~1.69 M docs) 100 TREC queries (from TREC Web Track 9 & 10) Query statistics generated form 17M AOL queries Using Okapi-BM25 weighting schema to compute ranking score QF min = 1, 3, 5, ∞ DF max = 100, 500 s max =3 G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval DF max =100DF max =500 ST-BM25 QF min =∞QF min =5QF min =3QF min =1QF min =∞QF min =5QF min =3QF min =1 P@10.4080.449 0.4290.439 P@20.3880.4390.434 0.4180.429 P@30.3470.412 0.4080.3910.395 P@40.3240.3700.3720.3700.3670.362 0.360 P@50.3060.3450.3470.3410.3450.343 0.337 P@100.2660.2990.2950.2940.3070.3020.3030.3020.298 P@150.2370.267 0.2760.2790.2800.278 P@200.2120.243 0.2460.2540.259 0.257 P@300.1740.2060.2090.2120.2140.221 0.2240.226 P@500.1390.1690.1710.1740.1750.181 0.1830.186 P@1000.0970.1260.1270.1300.1280.135 0.1360.140 Precision is similar to centralized indexing TREC: Precision at Top Ranked Pages (table) 22 / 29
23
Overlap experiment Use the query-log to build the index (days 1..91) Choose randomly 2K test queries from the day 92 query its combinationsAnswer each test query with Google and compare to the union of top- DF max Google results for each of its combinations that are indexed according to the logs. Mimics our P2PIR system if Google’s ranking is used. Example: Original query Non-superfluous (indexed) combinations X X overlap@5=3/5=60% G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval 23 / 29
24
Overlap example G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval what did babe ruth do in the 1920” >id=481, q=“what did babe ruth do in the 1920” “1920 babe ruth”, qf=0 ----> Ov@100= 100% “1920 babe”, qf=0 ---------> Ov@100= 9% 1920 ruth”33% +++“1920 ruth”, qf=1 ---------> Ov@100= 33% babe ruth” 69% +++“babe ruth”, qf=495 -------> Ov@100= 69% ---“1920”, qf=716 ------------> Ov@100= 1% ---“babe”, qf=3196 -----------> Ov@100= 2% ---“ruth”, qf=1653 -----------> Ov@100= 7% 192294% Size: 192, Keys used: 2, Overlap@100: 94% Cut-n-paste from the simulation log: 24 / 29
25
Google experiment: impact of s max, DF max G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval impact of S max for all possible combinations (QF min =0) Impact of DF max with QF min =1, S max =3 25 / 29
26
Google experiment: impact of QF min G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval impact of QF min (DF max =600)Number of keys for different QF min Does not depend on the document collection size HDK approach would require ~65M keys for 650K documents Does not depend on the document collection size HDK approach would require ~65M keys for 650K documents >30% of badly performing queries are misspells => real quality is higher 26 / 29
27
Google experiment: impact of the log size G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval impact of the log size (Qf min =1, DF max =600) 27 / 29
28
Conclusions query-driven indexing strategyWe presented the query-driven indexing strategy for scalable web text retrieval with structured P2P networks: and –Stores posting lists in a DHT for terms and term combinations at most –Stores at most DF max top document references in a posting list statistics –Efficiently collects the query statistics in a distributed fashion popular –Based on this statistics activates (indexes) only popular keys no –Computes the result of a multi-term query based only on the index entries available at the moment – no costly intersections We also showed that: good retrieval quality –With real query-logs our approach achieves good retrieval quality tradeoff –The QF min parameter adjusts the traffic/quality tradeoff G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval 28 / 29
29
G.Skobeltsyn | Query-Driven Indexing for P2P Text Retrieval Last slide Thank you for your attention! Questions? 29 / 29 AlvisP2P - to appear in July at http://globalcomputing.epfl.ch/alvis/
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.