G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index Web Text Retrieval with a P2P Query-Driven Index Gleb Skobeltsyn EPFL, Lausanne Switzerland.

G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index Web Text Retrieval with a P2P Query-Driven Index Gleb Skobeltsyn EPFL, Lausanne Switzerland July 26, 2007 Joint work with: Toan Luu Ivana Podnar Žarko Martin Rajman Karl Aberer Alvis Alvis

P2P Goal goal scalableOur goal is to achieve scalable full-text retrieval with structured P2P networks G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index 2 / 27

Distributed P2P IR architecture G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index Each peer provides a local document collection for search Each peer is responsible for a fraction of the global index 3 / 27

P2P IR basics G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index NP2P network (Distributed Hash Table) with N peers logNEach peer maintains connections with logN neighbors posting list indexing keyThe posting list associated with a given indexing key is stored at the peer responsible for that key logNThis peer can be located in logN overlay hops k=hash(indexing_key) put(k,posting_list) k=hash(indexing_key) get(k) k -> p_list …-> … posting_list 4 / 27

G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index (Naïve) Single-term indexing approach unscalable bandwidth consumption Single term based partitioning strategy leads to unscalable bandwidth consumption at retrieval (frequent intersections of large posting lists) Query: “epfl & gleb” h(“epfl”)-{d 1,d 2 } h(“gleb”)-{d 2,d 3 } h(t’)-{d 4,d 5 } {d 1,d 2 } {d 2 } 5 / 27

Single-term vs. multi-term P2P indexing G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index 6 / 27

Multi-term indexing: framework G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index responsibleEach peer is responsible for a set of indexing keys indexing keyE ach indexing key = {term 1, term 2,.., term k }, k>0 DHTKeys are assigned to peers by the underlying DHT using the standard hashing mechanism truncated posting list (TPL) DF max top-rankedEach key is associated with a truncated posting list (TPL) that stores at most DF max top-ranked document references  Distributed index contains {key,TPL} pairs 7 / 27

Single-term vs. multi-term P2P indexing How to select keys to keep a satisfactory retrieval quality? voc. size could grow exponentially! G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index 8 / 27

Indexing with HDK (Podnar et al. ICDE’07) Document-Driven key generation: kEach time a new document is indexed, some posting lists for an indexin key k can reach the max size of DF max triggers  It triggers the generation of new keys (k + additional frequent keys) Use a number of filters to reduce the number of keys (proximity, redundancy and size filters) G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index 9 / 27

Indexing with HDK Pro’sPro’s: –ICDE’07 paper proves that the approach is scalable –Elegant key generation mechanism –Low bandwidth during retrieval (PL’s of limited size) Con’sCon’s: –Practically the number of keys is still LARGE: 113keys/doc (68M for 0.6M docs) –High bandwidth consumption at indexing ProblemProblem: –Too many keys are superfluous (almost never used) G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index Let’s index what is queried! 10 / 27

Contents Introduction Single-term vs. multi term indexing HDK approach for indexing Query-driven approach for indexing/retrieval –Indexing structure –Example –Scalability –Evaluation Conclusion G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index 11 / 27

QDI: Query-Driven Index Too-Many-KeysQuery-Driven Indexing strategy solves the “Too-Many-Keys” problem: –Avoids maintenance of superfluous keys on-the-fly –Generates only such keys that are requested by users on-the-fly –Utilizes query-log to discover such keys (monitors query frequency) G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index 12 / 27

QDI: Challenges Challenges –Indexing of a new key requires a bandwidth- efficient mechanism to obtain the top-k posting list associated with the key Conventional intersection like threshold algorithm, but less often Conventional intersection like threshold algorithm, but less often –Incomplete index causes degradation of query results quality Show that the degradation is low Show that the degradation is low G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index 13 / 27

QDI: Retrieval G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index abc abc abbcac Single term index is generated Process abc 1)Probe P abc 2)Probe P ab P bc and P ac 3)Probe P a P b and P c 4)Obtain top-DF max results for a, b and c (ranked w.r.t a, b and c respectively) 5)Contact peers in the list, re-rank the obtained results w.r.t abc 6)Output top-10 Inc. the QF for ab, bc and ac Activate (index) ac peer ?abc nothing ?abc nothing ?abc +1 DF max popular 14 / 27

QDI: Retrieval 2 G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index abc abbcac abc Assume the frequency of b is below DF max Note, how the redundancy filter simplifies the lattice in such a case (grayed nodes do not have to be activated) DF max abc abbc 15 / 27

QDI: Retrieval 3 G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index abc abbcac abc Single term index is generated and ac is indexed Process abc 1)Probe P abc 2)Probe P ab P bc and P ac – obtain the result for ac 3)Probe P b and obtain the result for b 4)Contact all peers in the list to re-rank the obtained results w.r.t abc 5)Output top-10 Inc. the QF for ab, bc and ac peer ?abc nothing ?abc nothing ?abc +1 16 / 27

QDI: Summary Each single-term found in the document collection has to be indexed. basic single term index –We call all single-term keys a basic single term index. –The posting lists are truncated at DF max. A multi-term key k is activated (indexed) iff: –k is popular: QF(k) ≥QF min, where QF(k) is the popularity of the key k derived from the available query log and QF min is a parameter for our model (popularity filter). –k contains from 2 to s max terms: 2≤|k|≤ s max, where s max is a parameter of our model (size filter). –all immediate sub-keys of k (of size |k-1|) are indexed and their associated postings lists are truncated (redundancy filter). G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index 17 / 27

Scalability (see Skobeltsyn et al, Infoscale’07) retrievalThe retrieval traffic is still scalable (depends on DF max and a query size) indexing activatedThe indexing traffic now depends on the number of activated keys does not depend on the document collection size  The number of keys does not depend on the document collection size but only on the size of the query log QF min DF max  We can use the QF min and DF max parameter to adjust the tradeoff: indexing traffic retrieval quality indexing traffic retrieval quality G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index 18 / 27

Contents Introduction Single-term vs. multi term indexing HDK approach for indexing Query-driven approach for indexing/retrieval –Indexing structure –Example –Scalability –Evaluation Conclusion G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index 19 / 27

AOL logs 17M Queries from March, April, May 2006 (92 days) 650K anonymous user sessions Extracted all unique queries from each user session: G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index … 2006-05-31 23:50:30 wearthbow.com native.cheyenne origin. 2006-05-31 23:50:30 l6 screensaver 2006-05-31 23:50:30 horses for sale in tn ky 2006-05-31 23:50:30 bank of america.com 2006-05-31 23:50:30 ask 2006-05-31 23:50:29 del rosa lanes 2006-05-31 23:50:28 www.spirit airlines.com 2006-05-31 23:50:28 find holy women of the bible 2006-05-31 23:50:27 trains 2006-05-31 23:50:27 todaysmiricles 2006-05-31 23:50:27 constition 2006-05-31 23:50:26 german grocceries in las vegas nv 2006-05-31 23:50:25 porn 2006-05-31 23:50:25 northwest indiana 2006-05-31 23:50:24 united.eprize.net 2006-05-31 23:50:24 jessica laguna … <-0.7Gb 20 / 27

Overlap experiment Use the query-log to build the index (days 1..91) Choose randomly 2K test queries from the day 92 query its combinationsAnswer each test query with Google and compare to the union of top- DF max Google results for each of its combinations that are indexed according to the logs. Mimics our P2PIR system if Google’s ranking is used. Example: Original query Non-superfluous (indexed) combinations X X overlap@5=3/5=60% G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index 21 / 27

Overlap example G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index what did babe ruth do in the 1920” >id=481, q=“what did babe ruth do in the 1920” “1920 babe ruth”, qf=0 ----> Ov@100= 100% “1920 babe”, qf=0 ---------> Ov@100= 9% 1920 ruth”33% +++“1920 ruth”, qf=1 ---------> Ov@100= 33% babe ruth” 69% +++“babe ruth”, qf=495 -------> Ov@100= 69% ---“1920”, qf=716 ------------> Ov@100= 1% ---“babe”, qf=3196 -----------> Ov@100= 2% ---“ruth”, qf=1653 -----------> Ov@100= 7% 192294% Size: 192, Keys used: 2, Overlap@100: 94% Cut-n-paste from the simulation log: 22 / 27

Overlap experiment: impact of QF min, DF max G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index Top20 overlap: impact of DF max with QF min =1Top20 overlap: impact of QF min with DF max =600 23 / 27

Overlap experiment: impact of the log size G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index Top20 overlap: impact of the log size (Qf min =1, DF max =600) 24 / 27

TREC Experiment WT10G collection (~1.69 M docs) 100 TREC queries (from TREC Web Track 9 & 10) Query statistics generated form 17M AOL queries Using Okapi-BM25 weighting schema to compute ranking score QF min = 1, 3, 5, ∞ DF max = 100, 500 s max =3 G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index DF max =100DF max =500 ST-BM25 QF min =∞QF min =5QF min =3QF min =1QF min =∞QF min =5QF min =3QF min =1 P@10.4080.449 0.4290.439 P@20.3880.4390.434 0.4180.429 P@30.3470.412 0.4080.3910.395 P@40.3240.3700.3720.3700.3670.362 0.360 P@50.3060.3450.3470.3410.3450.343 0.337 P@100.2660.2990.2950.2940.3070.3020.3030.3020.298 P@150.2370.267 0.2760.2790.2800.278 P@200.2120.243 0.2460.2540.259 0.257 P@300.1740.2060.2090.2120.2140.221 0.2240.226 P@500.1390.1690.1710.1740.1750.181 0.1830.186 P@1000.0970.1260.1270.1300.1280.135 0.1360.140 Precision is similar to centralized indexing TREC: Precision at Top Ranked Pages (table) 25 / 27

Conclusions query-driven indexing strategyWe presented the query-driven indexing strategy for scalable web text retrieval with structured P2P networks: popular –Keeps only popular and non-redundant multi-term keys in the index truncated –Associates keys with truncated posting lists And we showed that: good retrieval quality –With real query-logs our approach achieves good retrieval quality comparable to a centralized engine tradeoff –The QF min parameter alows to adjust the traffic/quality tradeoff G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index 26 / 27

G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index Last slide Thank you for your attention! Questions? AlvisP2P web site: http://globalcomputing.epfl.ch/alvis/ 27 / 27

G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index Web Text Retrieval with a P2P Query-Driven Index Gleb Skobeltsyn EPFL, Lausanne Switzerland.

Similar presentations

Presentation on theme: "G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index Web Text Retrieval with a P2P Query-Driven Index Gleb Skobeltsyn EPFL, Lausanne Switzerland."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index Web Text Retrieval with a P2P Query-Driven Index Gleb Skobeltsyn EPFL, Lausanne Switzerland.

Similar presentations

Presentation on theme: "G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index Web Text Retrieval with a P2P Query-Driven Index Gleb Skobeltsyn EPFL, Lausanne Switzerland."— Presentation transcript:

Similar presentations

About project

Feedback