Information Retrieval On the use of the Inverted Lists.

Slides:



Advertisements
Similar presentations
Adapted from Information Retrieval and Web Search
Advertisements

Srihari-CSE535-Spring2008 CSE 535 Information Retrieval Lecture 2: Boolean Retrieval Model.
Boolean Retrieval Lecture 2: Boolean Retrieval Web Search and Mining.
Hinrich Schütze and Christina Lioma Lecture 4: Index Construction
Inverted Index Hongning Wang
Web Algorithmics Web Search Engines. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, , blog, e-book,... Query.
Information Retrieval IR 5. Plan Last lecture Index construction This lecture Parametric and field searches Zones in documents Wild card queries Scoring.
CS347 Lecture 3 April 16, 2001 ©Prabhakar Raghavan.
Index Compression Lecture 4. Recap: lecture 3 Stemming, tokenization etc. Faster postings merges Phrase queries.
Query Expansion and Index Construction Lecture 5.
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
CpSc 881: Information Retrieval. 2 Hardware basics Many design decisions in information retrieval are based on hardware constraints. We begin by reviewing.
PrasadL3InvertedIndex1 Inverted Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)
Hinrich Schütze and Christina Lioma Lecture 4: Index Construction
INF 2914 Information Retrieval and Web Search
Information Retrieval Space occupancy evaluation.
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Indexing and Complexity. Agenda Inverted indexes Computational complexity.
CS347 Lecture 2 April 9, 2001 ©Prabhakar Raghavan.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 1 Boolean retrieval.
Information Retrieval Techniques MS(CS) Lecture 5 AIR UNIVERSITY MULTAN CAMPUS.
LIS618 lecture 2 the Boolean model Thomas Krichel
Inverted Index Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Query processing: optimizations Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 2.3.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 3: Dictionaries and tolerant retrieval Related to Chapter 3:
Recap Preprocessing to form the term vocabulary Documents Tokenization token and term Normalization Case-folding Lemmatization Stemming Thesauri Stop words.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
March 16 & 21, Csci 2111: Data and File Structures Week 9, Lectures 1 & 2 Indexed Sequential File Access and Prefix B+ Trees.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Christopher Manning and Prabhakar.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 3: tolerant retrieval.
Modern Information Retrieval Lecture 3: Boolean Retrieval.
CS276A Text Information Retrieval, Mining, and Exploitation Lecture 3 8 Oct 2002.
1 ITCS 6265 Lecture 4 Index construction. 2 Plan Last lecture: Dictionary data structures Tolerant retrieval Wildcards Spell correction Soundex This lecture:
IR Paolo Ferragina Dipartimento di Informatica Università di Pisa.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
CES 514 – Data Mining Lec 2, Feb 10 Spring 2010 Sonoma State University.
Information Retrieval Lecture 1. Query Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia? Could grep all of Shakespeare’s.
PrasadL06IndexConstruction1 Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)
Index Construction: sorting Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Chap 4.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
1. L01: Corpuses, Terms and Search Basic terminology The need for unstructured text search Boolean Retrieval Model Algorithms for compressing data Algorithms.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Related to Chapter 4:
1 Information Retrieval LECTURE 1 : Introduction.
Dictionaries and Tolerant retrieval
Introduction to Information Retrieval COMP4210: Information Retrieval and Search Engines Lecture 4: Index Construction United International College.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 1: Boolean retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Related to Chapter 4:
Spring 2004 ECE569 Lecture 05.1 ECE 569 Database System Engineering Spring 2004 Yanyong Zhang
Web Information Retrieval Textbook by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze Notes Revised by X. Meng for SEU May 2014.
Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
CS276 Lecture 4 Index construction. Plan Last lecture: Tolerant retrieval Wildcards Spell correction Soundex This time: Index construction.
Module 2: Boolean retrieval. Introduction to Information Retrieval Information Retrieval  Information Retrieval (IR) is finding material (usually documents)
CS315 Introduction to Information Retrieval Boolean Search 1.
Why indexing? For efficient searching of a document
Query processing: optimizations
Information Retrieval in Practice
CS122B: Projects in Databases and Web Applications Winter 2017
Dictionary data structures for the Inverted Index
Information Retrieval in Practice
Modified from Stanford CS276 slides Lecture 4: Index Construction
Lecture 7: Index Construction
Implementation Issues & IR Systems
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Dictionary data structures for the Inverted Index
Index Construction: sorting
CS122B: Projects in Databases and Web Applications Winter 2017
Boolean Retrieval.
Information Retrieval and Web Search Lecture 1: Boolean retrieval
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

Information Retrieval On the use of the Inverted Lists

The index we just built Various issues come into play: How do we process a query? What kinds of queries can we process? Stopword list: terms that are so common that they’re ignored for indexing. e.g., the, a, an, of, to … language-specific. We keep everything !!!

Query processing Consider the query: Brutus AND Caesar Brutus Caesar 2 8 If the list lengths are m and n, this takes O(m+n) time. Crucial: postings sorted by docID (further reason for doing this! Recall gap-coding).

Use skip pointers To skip postings that will not figure in the search results, take 16 vs Lucene stores one out of 16

Query optimization Best order for query processing: Consider a query that is an AND of t terms. For each of the t terms, get its postings, then AND together. Brutus Calpurnia Caesar Query: Brutus AND Calpurnia AND Caesar

Query optimization example Process in order of increasing freq: start with smallest set (keep cutting further). Brutus Calpurnia Caesar This is one reason of keeping freq in dictionary Execute the query as (Caesar AND Brutus) AND Calpurnia.

Information Retrieval Sophisticate queries

Expand the posting lists to: 2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191;... be: 1:17,19; 4:17,191,291,430,434; 5:14,19,101;... Larger space occupancy, about 4 times more Positional index

Wild-card queries: * mon*: find all docs containing words prefixed by “mon”. Easy with a trie (or B-tree) on the dictionary: retrieve all words in range: mon ≤ w < moo *mon: find words ending with “mon”: harder !!! Maintain another trie for reversed terms. Now retrieve all words in range: nom ≤ w < non. What about compressed full-text indexes ??

Wildcard query: Permuterm index May we design an index that efficiently answers queries of the form X*Y on the dictionary ? The term hello is indexed as: hello$, ello$h, llo$he, lo$hel, o$hell, $hello How do we find X, X*, *X, *X*, X*Y,… ? X lookup on X$ X* lookup on $X* *X lookup on X$* *X* lookup on X* X*Y lookup on Y$X* What about X*Y*Z ???

Information Retrieval Inverted-List caching

What about caching? Two opposite approaches : I. Cache the query results (exploits query locality) II. Cache pages of Posting Lists (exploits term locality)

Which caching ? Two opposite approaches : I. Cache the query results (exploits query locality) II. Cache pages of Posting Lists (exploits term locality) 10Mb/s disk 50Mb/s disk

Architectural features Ratio disk_rate and decompression_speed Bottleneck is disk, thus fast methods may be useless if disk is slow!!

Information Retrieval Dynamic Indexing

What about dynamic indexing ? Docs come in over time postings updates for terms already in dictionary new terms added to dictionary Docs get deleted Docs get changed

The simplest approach Maintain “big” main index New docs go into “small” auxiliary index Search across both, and then merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation bit-vector Periodically, re-index into one main index

A cascade of indexes The merging is advantageous when the “small” and “big” indices have almost the same size How to ensure this?? c 2c 4c 8c 2ic2ic c new docs arrive c 2c 4c 8c Lucene #docs per index Every col is a collection of docs on which we build one index 16c +2c +4c +8c +16c