Information Retrieval On the use of the Inverted Lists
The index we just built Various issues come into play: How do we process a query? What kinds of queries can we process? Stopword list: terms that are so common that they’re ignored for indexing. e.g., the, a, an, of, to … language-specific. We keep everything !!!
Query processing Consider the query: Brutus AND Caesar Brutus Caesar 2 8 If the list lengths are m and n, this takes O(m+n) time. Crucial: postings sorted by docID (further reason for doing this! Recall gap-coding).
Use skip pointers To skip postings that will not figure in the search results, take 16 vs Lucene stores one out of 16
Query optimization Best order for query processing: Consider a query that is an AND of t terms. For each of the t terms, get its postings, then AND together. Brutus Calpurnia Caesar Query: Brutus AND Calpurnia AND Caesar
Query optimization example Process in order of increasing freq: start with smallest set (keep cutting further). Brutus Calpurnia Caesar This is one reason of keeping freq in dictionary Execute the query as (Caesar AND Brutus) AND Calpurnia.
Information Retrieval Sophisticate queries
Expand the posting lists to: 2:1,17,74,222,551; 4:8,16,190,429,433; 7:13,23,191;... be: 1:17,19; 4:17,191,291,430,434; 5:14,19,101;... Larger space occupancy, about 4 times more Positional index
Wild-card queries: * mon*: find all docs containing words prefixed by “mon”. Easy with a trie (or B-tree) on the dictionary: retrieve all words in range: mon ≤ w < moo *mon: find words ending with “mon”: harder !!! Maintain another trie for reversed terms. Now retrieve all words in range: nom ≤ w < non. What about compressed full-text indexes ??
Wildcard query: Permuterm index May we design an index that efficiently answers queries of the form X*Y on the dictionary ? The term hello is indexed as: hello$, ello$h, llo$he, lo$hel, o$hell, $hello How do we find X, X*, *X, *X*, X*Y,… ? X lookup on X$ X* lookup on $X* *X lookup on X$* *X* lookup on X* X*Y lookup on Y$X* What about X*Y*Z ???
Information Retrieval Inverted-List caching
What about caching? Two opposite approaches : I. Cache the query results (exploits query locality) II. Cache pages of Posting Lists (exploits term locality)
Which caching ? Two opposite approaches : I. Cache the query results (exploits query locality) II. Cache pages of Posting Lists (exploits term locality) 10Mb/s disk 50Mb/s disk
Architectural features Ratio disk_rate and decompression_speed Bottleneck is disk, thus fast methods may be useless if disk is slow!!
Information Retrieval Dynamic Indexing
What about dynamic indexing ? Docs come in over time postings updates for terms already in dictionary new terms added to dictionary Docs get deleted Docs get changed
The simplest approach Maintain “big” main index New docs go into “small” auxiliary index Search across both, and then merge results Deletions Invalidation bit-vector for deleted docs Filter docs output on a search result by this invalidation bit-vector Periodically, re-index into one main index
A cascade of indexes The merging is advantageous when the “small” and “big” indices have almost the same size How to ensure this?? c 2c 4c 8c 2ic2ic c new docs arrive c 2c 4c 8c Lucene #docs per index Every col is a collection of docs on which we build one index 16c +2c +4c +8c +16c