Presentation is loading. Please wait.

Presentation is loading. Please wait.

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

Similar presentations


Presentation on theme: "INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID"— Presentation transcript:

1 INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Lecture # 16 Phrase queries

2 ACKNOWLEDGEMENTS The presentation of this lecture has been taken from the underline sources “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D. Manning, and Hinrich Schütze “Managing gigabytes” by Ian H. Witten, ‎Alistair Moffat, ‎Timothy C. Bell “Modern information retrieval” by Baeza-Yates Ricardo, ‎  “Web Information Retrieval” by Stefano Ceri, ‎Alessandro Bozzon, ‎Marco Brambilla

3 Outline Types of Queries Phrase queries Biword indexes
Extended biwords Positional indexes

4 00:01:58  00:02:22

5 00:04:20  00:50:00

6 00:06:50  00:07:20

7 Types of Queries Phrase Queries The crops in pakistan
2. Proximity Queries LIMIT! /3 STATUTE /3 FEDERAL /2 TORT 3. Wild Card Queries Results* 00:01:58  00:02:22 (Phrase ….) 00:04:20  00:05:00 (Phrase ….) 00:05:13  00:05:55 (proximity ….) 00:06:50  00:07:20 (Wild)

8 Phrase queries Want to answer queries such as “Stanford university” – as a phrase Thus the sentence “I went to university at Stanford” is not a match. No longer suffices to store only <term : docs> entries 00:14:10  00:14:35 00:15:40  00:16:00 00:16:40  00:18:00

9 A first attempt: Biword indexes
Index every consecutive pair of terms in the text as a phrase For example the text “Friends, Romans, Countrymen” would generate the biwords friends romans romans countrymen Each of these biwords is now a dictionary term Two-word phrase query-processing is now immediate. 00:19:08  00:19:20 00:19:54  00:20:50

10 Can have false positives!
Longer phrase queries Longer phrases are processed as we did with wild-cards: “stanford university palo alto” can be broken into the Boolean query on biwords: “stanford university” AND “university palo” AND “palo alto” Without the docs, we cannot verify that the docs matching the above Boolean query do contain the phrase. 00:21:50  00:23:15 00:25:40  00:26:00 Can have false positives!

11 Extended biwords Parse the indexed text and perform part-of-speech-tagging (POST). Bucket the terms into (say) Nouns (N) and articles/prepositions (X). Now deem any string of terms of the form NX*N to be an extended biword. Each such extended biword is now made a term in the dictionary. Example: catcher in the rye Capital of Pakistan 00:27:40  00:28:25 (parse the indexed & bucket the terms) 00:29:10  00:31:00 (now deem) N X X N

12 Query processing Given a query, parse it into N’s and X’s Issues
Segment query into enhanced biwords Look up index Issues Parsing longer queries into conjunctions E.g., the query tangerine trees and marmalade skies is parsed into tangerine trees AND trees and marmalade AND marmalade skies 00:36:16  00:36:50

13 Other issues False positives, as noted before
Index blowup due to bigger dictionary 00:37:00  00:38:00

14 Positional indexes Store, for each term, entries of the form:
<number of docs containing term; doc1: position1, position2 … ; doc2: position1, position2 … ; etc.> 00:43:00  00:43:35 00:45:50  00:46:00

15 Positional index example
<be: ; 1: 7, 18, 33, 72, 86, 231; 2: 3, 149; 4: 17, 191, 291, 430, 434; 5: 363, 367, …> Which of docs 1,2,4,5 could contain “to be or not to be”? 00:47:10  00:49:40 00:50:10  00:51:55 Can compress position values/offsets Nevertheless, this expands postings storage substantially

16 Resources for today’s lecture
MG 3.6, 4.3; MIR 7.2 Porter’s stemmer: http// H.E. Williams, J. Zobel, and D. Bahle, “Fast Phrase Querying with Combined Indexes”, ACM Transactions on Information Systems.


Download ppt "INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID"

Similar presentations


Ads by Google