Lectures 4: Skip Pointers, Phrase Queries, Positional Indexing

Slides:



Advertisements
Similar presentations
Srihari-CSE535-Spring2008 CSE 535 Information Retrieval Lecture 2: Boolean Retrieval Model.
Advertisements

Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
PrasadL07IndexCompression1 Index Compression Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning.
The term vocabulary and postings lists
CS276A Information Retrieval
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 2: The term vocabulary and postings.
Indexing UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides ©Addison Wesley, 2008.
CpSc 881: Information Retrieval
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
PrasadL3InvertedIndex1 Inverted Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)
Introducing Information Retrieval and Web Search
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 1 Boolean retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 2: The term vocabulary and postings lists Related to Chapter 2:
LIS618 lecture 2 the Boolean model Thomas Krichel
Query processing: optimizations Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 2.3.
Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.
Recap Preprocessing to form the term vocabulary Documents Tokenization token and term Normalization Case-folding Lemmatization Stemming Thesauri Stop words.
| 1 Gertjan van Noord2014 Zoekmachines Lecture 2: vocabulary, posting lists.
Information Retrieval and Web Search Lecture 2: Dictionary and Postings.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
1 Information Retrieval LECTURE 1 : Introduction.
Fast Phrase Querying With Combined Indexes HUGH E. WILLIAMS, JUSTIN ZOBEL, and DIRK BAHLE RMIT University 2004 Burak Görener Doğuş University.
Introduction to Information Retrieval Boolean Retrieval.
Information Retrieval and Web Search Boolean retrieval Instructor: Rada Mihalcea (Note: some of the slides in this set have been adapted from a course.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Skip Pointers, Dictionaries and tolerant retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
Query processing: optimizations Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 2.3.
Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
Introduction to Information Retrieval Introduction to Information Retrieval Lectures 4-6: Skip Pointers, Dictionaries and tolerant retrieval.
Introduction to COMP9319: Web Data Compression and Search Search, index construction and compression Slides modified from Hinrich Schütze and Christina.
CS315 Introduction to Information Retrieval Boolean Search 1.
COMP9319: Web Data Compression and Search
Query processing: optimizations
Large Scale Search: Inverted Index, etc.
Information Retrieval and Web Search
Text Indexing and Search
Overview Recap Documents Terms Skip pointers Phrase queries
CS122B: Projects in Databases and Web Applications Winter 2017
Indexing UCSB 293S, 2017 Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides ©Addison Wesley,
Lecture 1: Introduction and the Boolean Model Information Retrieval
Indexing Goals: Store large files Support multiple search keys
Query processing: phrase queries and positional indexes
Slides from Book: Christopher D
Modified from Stanford CS276 slides Lecture 4: Index Construction
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Query processing: phrase queries and positional indexes
MG4J – Managing GigaBytes for Java Introduction
Boolean Retrieval Term Vocabulary and Posting Lists Web Search Basics
Boolean Retrieval.
Basic Information Retrieval
CS276: Information Retrieval and Web Search
Lecture 7: Index Construction
CS122B: Projects in Databases and Web Applications Winter 2017
Boolean Retrieval.
Information Retrieval and Web Search Lecture 1: Boolean retrieval
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Information Retrieval
Boolean Retrieval.
Introducing Information Retrieval and Web Search
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Lecture 2: The term vocabulary and postings lists
Lecture 2: The term vocabulary and postings lists
CS276 Information Retrieval and Web Search
Query processing: phrase queries and positional indexes
Introducing Information Retrieval and Web Search
Information Retrieval and Web Design
CSE 326: Data Structures Lecture #14
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Lecture-Hashing.
Introducing Information Retrieval and Web Search
Presentation transcript:

Lectures 4: Skip Pointers, Phrase Queries, Positional Indexing

Faster postings merges: Skip pointers/Skip lists

Sec. 2.3 Recall basic merge Walk through the two postings simultaneously, in time linear in the total number of postings entries 2 4 8 41 48 64 128 Brutus 2 8 1 2 3 8 11 17 21 31 Caesar If the list lengths are m and n, the merge takes O(m+n) operations. Can we do better? Yes (if the index isn’t changing too fast).

Augment postings with skip pointers (at indexing time) Sec. 2.3 Augment postings with skip pointers (at indexing time) 41 128 128 2 4 8 41 48 64 11 31 1 2 3 8 11 17 21 31 Why? To skip postings that will not figure in the search results. How? Where do we place skip pointers?

Query processing with skip pointers Sec. 2.3 Query processing with skip pointers 41 128 2 4 8 41 48 64 128 11 31 But the skip successor of 11 on the lower list is 31, so we can skip ahead past the intervening postings. 1 2 3 8 11 17 21 31 Suppose we’ve stepped through the lists until we process 8 on each list. We match it and advance. We then have 41 and 11 on the lower. 11 is smaller.

Where do we place skips? Tradeoff: Sec. 2.3 Where do we place skips? Tradeoff: More skips  shorter skip spans  more likely to skip. But lots of comparisons to skip pointers. Fewer skips  few pointer comparison, but then long skip spans  few successful skips

Sec. 2.3 Placing skips Simple heuristic: for postings of length L, use L evenly-spaced skip pointers [Moffat and Zobel 1996] Easy if the index is relatively static; harder if L keeps changing because of updates. This definitely used to help; with modern hardware it may not unless you’re memory-based [Bahle et al. 2002] The I/O cost of loading a bigger postings list can outweigh the gains from quicker in memory merging!

Phrase queries We want to answer a query such as [stanford university] – as a phrase. Thus The inventor Stanford Ovshinsky never went to university should not be a match. The concept of phrase query has proven easily understood by users. About 10% of web queries are phrase queries. Consequence for inverted index: it no longer suffices to store docIDs in postings lists for terms. Two ways of extending the inverted index: biword index positional index 8

Biword indexes Index every consecutive pair of terms in the text as a phrase. For example, Friends, Romans, Countrymen would generate two biwords: “friends romans” and “romans countrymen” Each of these biwords is now a vocabulary term. Two-word phrases can now easily be answered. 9

Longer phrase queries A long phrase like “stanford university palo alto” can be represented as the Boolean query “STANFORD UNIVERSITY” AND “UNIVERSITY PALO” AND “PALO ALTO” Does this always guarantee the correct match? -- We need to do post-filtering of hits to identify subset that actually contains the 4-word phrase. What about phrases like, “abolition of slavery”? 10

Extended biwords Parse each document and perform part-of-speech tagging Bucket the terms into (say) nouns (N) and articles/prepositions (X) Now deem any string of terms of the form NX*N to be an extended biword Examples: catcher in the rye N X X N king of Denmark N X N Include extended biwords in the term vocabulary Queries are processed accordingly 11

Issues with biword indexes Why are biword indexes rarely used? False positives, as noted above Index blowup due to very large term vocabulary What can be an alternative? 12

Positional indexes Positional indexes are a more efficient alternative to biword indexes. Postings lists in a nonpositional index: each posting is just a docID Postings lists in a positional index: each posting is a docID and a list of positions 13

Positional indexes: Example Query: “to1 be2 or3 not4 to5 be6” TO, 993427: ‹ 1: ‹7, 18, 33, 72, 86, 231›; 2: ‹1, 17, 74, 222, 255›; 4: ‹8, 16, 190, 429, 433›; 5: ‹363, 367›; 7: ‹13, 23, 191›; . . . › BE, 178239: ‹ 1: ‹17, 25›; 4: ‹17, 191, 291, 430, 434›; 5: ‹14, 19, 101›; . . . › Document 4 is a match! 14

Proximity search We just saw how to use a positional index for phrase searches. Can we also use it for proximity search? For example: employment /4 place Find all documents that contain EMPLOYMENT and PLACE within 4 words of each other. Employment agencies that place healthcare workers are seeing growth is a hit. Employment agencies that have learned to adapt now place healthcare workers is not a hit. 15

Proximity search Use the positional index Simplest algorithm: look at cross-product of positions of (i) EMPLOYMENT in document and (ii) PLACE in document Very inefficient for frequent words, especially stop words Note that we want to return the actual matching positions, not just a list of documents. 16

Combination scheme Biword indexes and positional indexes can be profitably combined. Many biwords are extremely frequent: Michael Jackson etc For these biwords, increased speed compared to positional postings intersection is substantial. Combination scheme: Include frequent biwords as vocabulary terms in the index. Do all other phrases by positional intersection. Williams et al. (2004) evaluate a more sophisticated mixed indexing scheme – Next Word Index. Faster than a positional index, at a cost of 26% more space for index. 17