INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

Slides:



Advertisements
Similar presentations
Pemrosesan Teks Pendahuluan. Buku referensi [1]Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze Introduction to Information.
Advertisements

Srihari-CSE535-Spring2008 CSE 535 Information Retrieval Lecture 2: Boolean Retrieval Model.
Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
The term vocabulary and postings lists
CS276A Information Retrieval
Parametric search and zone weighting Lecture 6. Recap of lecture 4 Query expansion Index construction.
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
1 INF 2914 Information Retrieval and Web Search Lecture 9: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.
PrasadL3InvertedIndex1 Inverted Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)
Introducing Information Retrieval and Web Search
Chapter 5: Information Retrieval and Web Search
CS347 Lecture 2 April 9, 2001 ©Prabhakar Raghavan.
LIS618 lecture 2 the Boolean model Thomas Krichel
Recap Preprocessing to form the term vocabulary Documents Tokenization token and term Normalization Case-folding Lemmatization Stemming Thesauri Stop words.
Chapter 6: Information Retrieval and Web Search
Information Retrieval and Web Search Lecture 2: Dictionary and Postings.
Modern Information Retrieval Presented by Miss Prattana Chanpolto Faculty of Information Technology.
Fast Phrase Querying With Combined Indexes HUGH E. WILLIAMS, JUSTIN ZOBEL, and DIRK BAHLE RMIT University 2004 Burak Görener Doğuş University.
Information Retrieval and Web Search Boolean retrieval Instructor: Rada Mihalcea (Note: some of the slides in this set have been adapted from a course.
Query processing: optimizations Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 2.3.
Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
Why indexing? For efficient searching of a document
Search in Google's N-grams
Large Scale Search: Inverted Index, etc.
Text Indexing and Search
Query processing: phrase queries and positional indexes
Query processing: phrase queries and positional indexes
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
CSCE 561 Information Retrieval System Models
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
CS276: Information Retrieval and Web Search
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Information Retrieval Systems
Information Retrieval
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Lectures 4: Skip Pointers, Phrase Queries, Positional Indexing
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Chapter 5: Information Retrieval and Web Search
Introducing Information Retrieval and Web Search
Lecture 2: The term vocabulary and postings lists
Lecture 2: The term vocabulary and postings lists
The ultimate in data organization
CS276 Information Retrieval and Web Search
Query processing: phrase queries and positional indexes
Introducing Information Retrieval and Web Search
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Recuperação de Informação B
Information Retrieval and Web Design
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Introducing Information Retrieval and Web Search
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID Lecture # 17 Processing a phrase query Proximity queries

ACKNOWLEDGEMENTS The presentation of this lecture has been taken from the underline sources “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D. Manning, and Hinrich Schütze “Managing gigabytes” by Ian H. Witten, ‎Alistair Moffat, ‎Timothy C. Bell “Modern information retrieval” by Baeza-Yates Ricardo, ‎  “Web Information Retrieval” by Stefano Ceri, ‎Alessandro Bozzon, ‎Marco Brambilla

Outline Processing a phrase query Proximity queries Combination schemes

Processing a phrase query Extract inverted index entries for each distinct term: to, be, or, not. Merge their doc:position lists to enumerate all positions with “to be or not to be”. to: 2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191; ... be: 1:17,19; 4:17,191,291,430,434; 5:14,19,101; ... Same general method for proximity searches 00:06:20  00:07:45 (merge their) 00:07:50  00:16:05 (to: & be:)(NLE: layering Required) 00:18:30  00:18:42

Proximity queries LIMIT! /3 STATUTE /3 FEDERAL /2 TORT Here, /k means “within k words of”. Clearly, positional indexes can be used for such queries; biword indexes cannot. Exercise: Adapt the linear merge of postings to handle proximity queries. Can you make it work for any value of k? 00:19:30  00:19:50 00:21:12  00:21:50(limit) 00:22:50  00:23:40 (clearly) 00:24:20  00:24:40 (exercise)

Positional index size Can compress position values/offsets as we did with docs in the last lecture Nevertheless, this expands postings storage substantially 00:29:20  00:29:40(can) 00:30:00  00:30:30(nevertheless)

Positional index size Need an entry for each occurrence, not just once per document Index size depends on average document size Average web page has <1000 terms SEC filings, books, even some epic poems … easily 100,000 terms Consider a term with frequency 0.1% Why? 00:32:10  00:33:10(Index size) 00:33:14  00:33:25(Consider) 00:33:30  00:34:30(Table) 100 1 100,000 1000 Positional postings Postings Document size

Rules of thumb Positional index size factor of 2-4 over non-positional index Positional index size 35-50% of volume of original text Caveat: all of this holds for “English-like” languages 00:34:50  00:35:00(Positional 2-4) 00:35:16  00:36:00(Positional 35-50)

Combination schemes A positional index expands postings storage substantially (Why?) Biword indexes and positional indexes approaches can be profitably combined For particular phrases (“Michael Jackson”, “Britney Spears”) it is inefficient to keep on merging positional postings lists Even more so for phrases like “The Who” 00:39:50  00:40:10 00:42:00  00:42:20

Wild Card Queries Example Stan*  Standard, Stanford S*T  Start *ion  Option Pa*an  Pakistan Pa*t*an etc… 00:48:55  00:50:15

Resources for today’s lecture MG 3.6, 4.3; MIR 7.2 Porter’s stemmer: http//www.sims.berkeley.edu/~hearst/irbook/porter.html H.E. Williams, J. Zobel, and D. Bahle, “Fast Phrase Querying with Combined Indexes”, ACM Transactions on Information Systems. http://www.seg.rmit.edu.au/research/research.php?author=4