INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

Slides:



Advertisements
Similar presentations
Information Retrieval and Text Mining Lecture 2. Recap of the previous lecture Basic inverted indexes: Structure: Dictionary and Postings Key step in.
Advertisements

Srihari-CSE535-Spring2008 CSE 535 Information Retrieval Lecture 2: Boolean Retrieval Model.
Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
Inverted Index Construction Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend romancountryman Indexer Inverted.
Preparing the Text for Indexing 1. Collecting the Documents Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend.
The term vocabulary and postings lists
CS276A Information Retrieval
CS276 Information Retrieval and Web Search Lecture 2: The term vocabulary and postings lists.
PrasadL3InvertedIndex1 Inverted Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)
Introducing Information Retrieval and Web Search
CS347 Lecture 2 April 9, 2001 ©Prabhakar Raghavan.
PrasadL4DictonaryAndQP1 Dictionary and Postings; Query Processing Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 1 Boolean retrieval.
Query processing: optimizations Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 2.3.
Recap Preprocessing to form the term vocabulary Documents Tokenization token and term Normalization Case-folding Lemmatization Stemming Thesauri Stop words.
Information Retrieval and Web Search Lecture 2: Dictionary and Postings.
Information Retrieval Lecture 2: Dictionary and Postings.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar.
Homework #1 J. H. Wang Oct. 24, 2011.
Query processing: optimizations Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 2.3.
Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
CS276 Information Retrieval and Web Search Lecture 2: Dictionary and Postings.
Information Retrieval and Web Search Lecture 2: The term vocabulary and postings lists Minqi Zhou 1.
Information Retrieval
Text Indexing and Search
CS122B: Projects in Databases and Web Applications Winter 2017
COIS 442 Foundations on IR Information Retrieval and Web Search
Query processing: phrase queries and positional indexes
Slides from Book: Christopher D
Modified from Stanford CS276 slides
Query processing: phrase queries and positional indexes
Text Processing.
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
CS276: Information Retrieval and Web Search
CS276: Information Retrieval and Web Search
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Recap of the previous lecture
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Information Retrieval Systems
Information Retrieval
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Lectures 4: Skip Pointers, Phrase Queries, Positional Indexing
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Introducing Information Retrieval and Web Search
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Lecture 2: The term vocabulary and postings lists
Lecture 2: The term vocabulary and postings lists
Query processing: phrase queries and positional indexes
Introducing Information Retrieval and Web Search
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Introducing Information Retrieval and Web Search
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID Lecture # 16 Phrase queries

ACKNOWLEDGEMENTS The presentation of this lecture has been taken from the underline sources “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D. Manning, and Hinrich Schütze “Managing gigabytes” by Ian H. Witten, ‎Alistair Moffat, ‎Timothy C. Bell “Modern information retrieval” by Baeza-Yates Ricardo, ‎  “Web Information Retrieval” by Stefano Ceri, ‎Alessandro Bozzon, ‎Marco Brambilla

Outline Types of Queries Phrase queries Biword indexes Extended biwords Positional indexes

00:01:58  00:02:22

00:04:20  00:50:00

00:06:50  00:07:20

Types of Queries Phrase Queries The crops in pakistan 2. Proximity Queries LIMIT! /3 STATUTE /3 FEDERAL /2 TORT 3. Wild Card Queries Results* 00:01:58  00:02:22 (Phrase ….) 00:04:20  00:05:00 (Phrase ….) 00:05:13  00:05:55 (proximity ….) 00:06:50  00:07:20 (Wild)

Phrase queries Want to answer queries such as “Stanford university” – as a phrase Thus the sentence “I went to university at Stanford” is not a match. No longer suffices to store only <term : docs> entries 00:14:10  00:14:35 00:15:40  00:16:00 00:16:40  00:18:00

A first attempt: Biword indexes Index every consecutive pair of terms in the text as a phrase For example the text “Friends, Romans, Countrymen” would generate the biwords friends romans romans countrymen Each of these biwords is now a dictionary term Two-word phrase query-processing is now immediate. 00:19:08  00:19:20 00:19:54  00:20:50

Can have false positives! Longer phrase queries Longer phrases are processed as we did with wild-cards: “stanford university palo alto” can be broken into the Boolean query on biwords: “stanford university” AND “university palo” AND “palo alto” Without the docs, we cannot verify that the docs matching the above Boolean query do contain the phrase. 00:21:50  00:23:15 00:25:40  00:26:00 Can have false positives!

Extended biwords Parse the indexed text and perform part-of-speech-tagging (POST). Bucket the terms into (say) Nouns (N) and articles/prepositions (X). Now deem any string of terms of the form NX*N to be an extended biword. Each such extended biword is now made a term in the dictionary. Example: catcher in the rye Capital of Pakistan 00:27:40  00:28:25 (parse the indexed & bucket the terms) 00:29:10  00:31:00 (now deem) N X X N

Query processing Given a query, parse it into N’s and X’s Issues Segment query into enhanced biwords Look up index Issues Parsing longer queries into conjunctions E.g., the query tangerine trees and marmalade skies is parsed into tangerine trees AND trees and marmalade AND marmalade skies 00:36:16  00:36:50

Other issues False positives, as noted before Index blowup due to bigger dictionary 00:37:00  00:38:00

Positional indexes Store, for each term, entries of the form: <number of docs containing term; doc1: position1, position2 … ; doc2: position1, position2 … ; etc.> 00:43:00  00:43:35 00:45:50  00:46:00

Positional index example <be: 993427; 1: 7, 18, 33, 72, 86, 231; 2: 3, 149; 4: 17, 191, 291, 430, 434; 5: 363, 367, …> Which of docs 1,2,4,5 could contain “to be or not to be”? 00:47:10  00:49:40 00:50:10  00:51:55 Can compress position values/offsets Nevertheless, this expands postings storage substantially

Resources for today’s lecture MG 3.6, 4.3; MIR 7.2 Porter’s stemmer: http//www.sims.berkeley.edu/~hearst/irbook/porter.html H.E. Williams, J. Zobel, and D. Bahle, “Fast Phrase Querying with Combined Indexes”, ACM Transactions on Information Systems. http://www.seg.rmit.edu.au/research/research.php?author=4