Presentation is loading. Please wait.

Presentation is loading. Please wait.

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

Similar presentations


Presentation on theme: "INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID"— Presentation transcript:

1 INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Lecture # 17 Processing a phrase query Proximity queries

2 ACKNOWLEDGEMENTS The presentation of this lecture has been taken from the underline sources “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D. Manning, and Hinrich Schütze “Managing gigabytes” by Ian H. Witten, ‎Alistair Moffat, ‎Timothy C. Bell “Modern information retrieval” by Baeza-Yates Ricardo, ‎  “Web Information Retrieval” by Stefano Ceri, ‎Alessandro Bozzon, ‎Marco Brambilla

3 Outline Processing a phrase query Proximity queries
Combination schemes

4 Processing a phrase query
Extract inverted index entries for each distinct term: to, be, or, not. Merge their doc:position lists to enumerate all positions with “to be or not to be”. to: 2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191; ... be: 1:17,19; 4:17,191,291,430,434; 5:14,19,101; ... Same general method for proximity searches 00:06:20  00:07:45 (merge their) 00:07:50  00:16:05 (to: & be:)(NLE: layering Required) 00:18:30  00:18:42

5 Proximity queries LIMIT! /3 STATUTE /3 FEDERAL /2 TORT Here, /k means “within k words of”. Clearly, positional indexes can be used for such queries; biword indexes cannot. Exercise: Adapt the linear merge of postings to handle proximity queries. Can you make it work for any value of k? 00:19:30  00:19:50 00:21:12  00:21:50(limit) 00:22:50  00:23:40 (clearly) 00:24:20  00:24:40 (exercise)

6 Positional index size Can compress position values/offsets as we did with docs in the last lecture Nevertheless, this expands postings storage substantially 00:29:20  00:29:40(can) 00:30:00  00:30:30(nevertheless)

7 Positional index size Need an entry for each occurrence, not just once per document Index size depends on average document size Average web page has <1000 terms SEC filings, books, even some epic poems … easily 100,000 terms Consider a term with frequency 0.1% Why? 00:32:10  00:33:10(Index size) 00:33:14  00:33:25(Consider) 00:33:30  00:34:30(Table) 100 1 100,000 1000 Positional postings Postings Document size

8 Rules of thumb Positional index size factor of 2-4 over non-positional index Positional index size 35-50% of volume of original text Caveat: all of this holds for “English-like” languages 00:34:50  00:35:00(Positional 2-4) 00:35:16  00:36:00(Positional 35-50)

9 Combination schemes A positional index expands postings storage substantially (Why?) Biword indexes and positional indexes approaches can be profitably combined For particular phrases (“Michael Jackson”, “Britney Spears”) it is inefficient to keep on merging positional postings lists Even more so for phrases like “The Who” 00:39:50  00:40:10 00:42:00  00:42:20

10 Wild Card Queries Example Stan*  Standard, Stanford S*T  Start
*ion  Option Pa*an  Pakistan Pa*t*an etc… 00:48:55  00:50:15

11 Resources for today’s lecture
MG 3.6, 4.3; MIR 7.2 Porter’s stemmer: http// H.E. Williams, J. Zobel, and D. Bahle, “Fast Phrase Querying with Combined Indexes”, ACM Transactions on Information Systems.


Download ppt "INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID"

Similar presentations


Ads by Google