INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID Lecture # 17 Processing a phrase query Proximity queries
ACKNOWLEDGEMENTS The presentation of this lecture has been taken from the underline sources “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D. Manning, and Hinrich Schütze “Managing gigabytes” by Ian H. Witten, Alistair Moffat, Timothy C. Bell “Modern information retrieval” by Baeza-Yates Ricardo, “Web Information Retrieval” by Stefano Ceri, Alessandro Bozzon, Marco Brambilla
Outline Processing a phrase query Proximity queries Combination schemes
Processing a phrase query Extract inverted index entries for each distinct term: to, be, or, not. Merge their doc:position lists to enumerate all positions with “to be or not to be”. to: 2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191; ... be: 1:17,19; 4:17,191,291,430,434; 5:14,19,101; ... Same general method for proximity searches 00:06:20 00:07:45 (merge their) 00:07:50 00:16:05 (to: & be:)(NLE: layering Required) 00:18:30 00:18:42
Proximity queries LIMIT! /3 STATUTE /3 FEDERAL /2 TORT Here, /k means “within k words of”. Clearly, positional indexes can be used for such queries; biword indexes cannot. Exercise: Adapt the linear merge of postings to handle proximity queries. Can you make it work for any value of k? 00:19:30 00:19:50 00:21:12 00:21:50(limit) 00:22:50 00:23:40 (clearly) 00:24:20 00:24:40 (exercise)
Positional index size Can compress position values/offsets as we did with docs in the last lecture Nevertheless, this expands postings storage substantially 00:29:20 00:29:40(can) 00:30:00 00:30:30(nevertheless)
Positional index size Need an entry for each occurrence, not just once per document Index size depends on average document size Average web page has <1000 terms SEC filings, books, even some epic poems … easily 100,000 terms Consider a term with frequency 0.1% Why? 00:32:10 00:33:10(Index size) 00:33:14 00:33:25(Consider) 00:33:30 00:34:30(Table) 100 1 100,000 1000 Positional postings Postings Document size
Rules of thumb Positional index size factor of 2-4 over non-positional index Positional index size 35-50% of volume of original text Caveat: all of this holds for “English-like” languages 00:34:50 00:35:00(Positional 2-4) 00:35:16 00:36:00(Positional 35-50)
Combination schemes A positional index expands postings storage substantially (Why?) Biword indexes and positional indexes approaches can be profitably combined For particular phrases (“Michael Jackson”, “Britney Spears”) it is inefficient to keep on merging positional postings lists Even more so for phrases like “The Who” 00:39:50 00:40:10 00:42:00 00:42:20
Wild Card Queries Example Stan* Standard, Stanford S*T Start *ion Option Pa*an Pakistan Pa*t*an etc… 00:48:55 00:50:15
Resources for today’s lecture MG 3.6, 4.3; MIR 7.2 Porter’s stemmer: http//www.sims.berkeley.edu/~hearst/irbook/porter.html H.E. Williams, J. Zobel, and D. Bahle, “Fast Phrase Querying with Combined Indexes”, ACM Transactions on Information Systems. http://www.seg.rmit.edu.au/research/research.php?author=4