Information Retrieval On the use of the Inverted Lists.

Information Retrieval On the use of the Inverted Lists

The index we just built Various issues come into play: How do we process a query? What kinds of queries can we process? Stopword list: terms that are so common that they’re ignored for indexing. e.g., the, a, an, of, to … language-specific. We keep everything !!!

Query processing Consider the query: Brutus AND Caesar 34 12824816 3264 12 3 581321 128 34 248163264123581321 Brutus Caesar 2 8 If the list lengths are m and n, this takes O(m+n) time. Crucial: postings sorted by docID (further reason for doing this! Recall gap-coding).

Use skip pointers To skip postings that will not figure in the search results, take 16 vs 8. 12824816326431123581721 31 8 16 128 Lucene stores one out of 16

Query optimization Best order for query processing: Consider a query that is an AND of t terms. For each of the t terms, get its postings, then AND together. Brutus Calpurnia Caesar 12358132134 248163264128 1316 Query: Brutus AND Calpurnia AND Caesar

Query optimization example Process in order of increasing freq: start with smallest set (keep cutting further). Brutus Calpurnia Caesar 12358132134 248163264128 1316 This is one reason of keeping freq in dictionary Execute the query as (Caesar AND Brutus) AND Calpurnia.

Information Retrieval Sophisticate queries

Expand the posting lists to: 2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191;... be: 1:17,19; 4:17,191,291,430,434; 5:14,19,101;... Larger space occupancy, about 4 times more Positional index

Wild-card queries: * mon*: find all docs containing words prefixed by “mon”. Easy with a trie (or B-tree) on the dictionary: retrieve all words in range: mon ≤ w < moo *mon: find words ending with “mon”: harder !!! Maintain another trie for reversed terms. Now retrieve all words in range: nom ≤ w < non. What about compressed full-text indexes ??

Wildcard query: Permuterm index May we design an index that efficiently answers queries of the form X*Y on the dictionary ? The term hello is indexed as: hello$, ello$h, llo$he, lo$hel, o$hell, $hello How do we find X, X*, *X, *X*, X*Y,… ? X lookup on X$ X* lookup on $X* *X lookup on X$* *X* lookup on X* X*Y lookup on Y$X* What about X*Y*Z ???

Information Retrieval Inverted-List caching

What about caching? Two opposite approaches : I. Cache the query results (exploits query locality) II. Cache pages of Posting Lists (exploits term locality)

Which caching ? Two opposite approaches : I. Cache the query results (exploits query locality) II. Cache pages of Posting Lists (exploits term locality) 10Mb/s disk 50Mb/s disk

Architectural features Ratio disk_rate and decompression_speed Bottleneck is disk, thus fast methods may be useless if disk is slow!!

Information Retrieval Dynamic Indexing

What about dynamic indexing ? Docs come in over time postings updates for terms already in dictionary new terms added to dictionary Docs get deleted Docs get changed

The simplest approach Maintain “big” main index New docs go into “small” auxiliary index Search across both, and then merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation bit-vector Periodically, re-index into one main index

A cascade of indexes The merging is advantageous when the “small” and “big” indices have almost the same size How to ensure this?? c 2c 4c 8c 2ic2ic c new docs arrive c 2c 4c 8c Lucene #docs per index Every col is a collection of docs on which we build one index 16c +2c +4c +8c +16c

Information Retrieval On the use of the Inverted Lists.

Similar presentations

Presentation on theme: "Information Retrieval On the use of the Inverted Lists."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Information Retrieval On the use of the Inverted Lists.

Similar presentations

Presentation on theme: "Information Retrieval On the use of the Inverted Lists."— Presentation transcript:

Similar presentations

About project

Feedback