Type Less, Find More: Fast Autocompletion Search with a Succinct Index

Type Less, Find More: Fast Autocompletion Search with a Succinct Index
SIGIR 2006 in Seattle, USA, August Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Ingmar Weber Tahoma sieht hier 10 mal besser aus als Bodoni! Auch im Folgenden?

It's useful Basic Autocompletion saves typing
no more information than necessary salton find out about formulations used autocomplete, autocompose error correction autocomplit, autocompleet

Workshop on Faceted Search
It's more useful Complete to phrases phrase voronoi diagram → add word voronoi_diagram to index Complete to subwords compound word eigenproblem → add word problem to index Complete to category names author Börkur Sigurbjörnsson → add sigurbjörnson:börkur::author börkur::sigurbjörnson:author Faceted search add ct:conference:sigir add ct:author:Börkur_Sigurbjörnson add ct:year:2005 Workshop on Faceted Search on Thursday all via the same mechanism

Related Engines

Basic Problem Definition
Query a set D of documents (= hits for the first part of the query) a range W of words (= potential completions of last word) Answer all documents D' from D, containing a word from W all words W' from W, contained in a document from D Extensions (see paper) ranking (best hits from D' and best completions from W') positional information (proximity queries) First try: inverted index (INV)

Processing 1-word queries with INV
For example, sigir* D all documents W all words matching sigir* Iterate over all words from W sigir Doc.18, Doc. 53, Doc. 591, ... sigir03 Doc. 3, Doc. 66, Doc. 765, ... sigir04 Doc. 25, Doc. 98, Doc. 221, ... sigirlist Doc. 67, Doc. 189, Doc. 221, ... sigirforum Doc. 16, Doc. 110, Doc. 141, ... Merge the documents lists D' Doc. 3, Doc. 16, Doc. 18, Doc. 25, … Output all words from range as completions W' sigir, sigir03, sigir04, sigirlist, … Expensive! Trivial for 1-word queries

Processing multi-word queries with INV
For example, sigir* sal* D Doc. 3, Doc. 16, Doc. 18, Doc. 25, … (hits for sigir*) W all words matching sal* Iterate over all words from W salary Doc. 8, Doc. 23, Doc. 291, ... salesman Doc. 24, Doc. 36, Doc. 165, ... salton Doc. 3, Doc. 18, Doc. 66, ... salutation Doc. 56, Doc. 129, Doc. 251, ... salvador Doc. 18, Doc. 21, Doc. 25, ... Intersect each list with D, then merge D' Doc. 3, Doc. 18, Doc. 25, … Output all words with non-empty intersection W' salton, salvador Most intersection are empty, but INV has to compute them all!

INV — Problems Asymptotic time complexity is bad (for our problem)
many intersections (one per potential completion) has to merge/sort (the non-empty intersections) Still hard to beat INV in practice highly compressible half the space on disk means half the time to read it INV has very good locality of access the ratio random access time/sequential access time is 50,000 for disk, and still 100 for main memory simple code instruction cache, branch prediction, etc.

But this looks very wasteful
A Hybrid Index (HYB) Basic Idea: have lists for ranges of words salary – salvador Doc. 3, Doc. 16, Doc.18, Doc. 25, ... Problem: not enough to show completions Solution: store the word(s) along with each doc id salary salvador salton salary salton salvador But this looks very wasteful

How well does it compress? Which block size?
HYB — Details HYB has a block for each word range, conceptually: 1 3 5 6 7 8 9 11 12 13 15 D A C B Replace doc ids by gaps and words by frequency ranks: +1 +2 +0 3rd 1st 2nd 4th Encode both gaps and ranks such that x  log2 x bits +0    110 1st (A)  nd (C)  rd (D)  th (B)  110 An actual block of HYB How well does it compress? Which block size?

INV vs. HYB — Space Consumption
Theorem: The empirical entropy of INV is Σ ni ∙ (1/ln 2 + log2(n/ni)) Theorem: The empirical entropy of HYB with block size ε∙n is Σ ni ∙ ((1+ε)/ln 2 + log2(n/ni)) ni = number of documents containing i-th word, n = number of documents MEDICINE 44,015 docs 263,817 words with positions WIKIPEDIA 2,866,503 docs 6,700,119 words TREC .GOV 25,204,013 docs 25,263,176 words no positions raw size 452 MB 7.4 GB 426 GB INV 13 MB 0.48 GB 4.6 GB HYB 14 MB 0.51 GB 4.9 GB Nice match of theory and practice

INV vs. HYB — Query Time Theoretical analysis  see paper
Experiment: type ordinary queries from left to right sig , sigi , sigir , sigir sal , sigir salt , sigir salto , sigir salton MEDICINE 44,015 docs 263,817 words 5,732 real queries with proximity avg : 0.03 secs max: 0.38 secs avg : .003 secs max: 0.06 secs WIKIPEDIA 2,866,503 docs 6,700,119 words 100 random queries with proximity avg : 0.17 secs max: 2.27 secs avg : 0.05 secs max: 0.49 secs TREC .GOV 25,204,013 docs 25,263,176 words 50 TREC queries no proximity avg : secs max: secs avg : secs max: secs INV HYB HYB better by an order of magnitude

System Design — High Level View
Compute Server C++ Web Server PHP User Client JavaScript Debugging such an application is hell!

Summary of Results Properties of HYB Autocompletion and more
highly compressible (just like INV) fast prefix-completion queries (perfect locality of access) fast indexing (no full inversion necessary) Autocompletion and more phrase and subword completion, semantic completion, XML support, … faceted search (Workshop Talk on Thursday) efficient DB joins: author[sigir sigmod] NEW all with one and the same (efficient) mechanism

Definition: empirical entropy H = optimal number of bits Theorem: H(INV) Σ ni ∙ (1/ln 2 + log2(n/ni)) Theorem: The empirical entropy of HYB with block size ε∙n is Σ ni ∙ ((1+ε)/ln 2 + log2(n/ni)) ni = number of documents containing i-th word, n = number of documents MED BOOKS 44,015 docs 263,817 words WIKIPEDIA 2,866,503 docs 6,700,119 words TREC .GOV 25,204,013 docs 25,263,176 words raw size 452 MB 7.4 GB 426 GB INV 13 MB 0.48 GB 4.6 GB HYB 14 MB 0.51 GB 4.9 GB Perfect match of theory and practice

Theorem: Entropy(INV) = Σ ni ∙ (1/ln 2 + log2(n/ni)) Theorem: Entropy(HYB) = Σ ni ∙ ((1+ε)/ln 2 + log2(n/ni)) We define a notion of empirical entropy in the paper, in terms of ni = number of documents containing i-th word, n = number of documents MED BOOKS 44,015 docs 263,817 words WIKIPEDIA 2,866,503 docs 6,700,119 words TREC .GOV 25,204,013 docs 25,263,176 words raw size 452 MB 7.4 GB 426 GB INV 13 MB 0.48 GB 4.6 GB HYB 14 MB 0.51 GB 4.9 GB Perfect match of theory and practice

HYB vs. INV — Query Time INV HYB 0.03 0.17 0.58 0.38 2.27 16.83 .003
MED BOOKS 44,015 docs 263,817 words WIKIPEDIA 2,866,503 docs 6,700,119 words TREC .GOV 25,204,013 docs 25,263,176 words INV avg: 0.03 secs 0.17 0.58 max: 0.38 2.27 16.83 HYB .003 0.05 0.11 max 0.06 0.49 0.86

Processing a 1-word Query with INV
Processing a 1-word query, e.g., sigir* Iterate over all words matching sigir* Merge the documents lists sigir Doc.18, Doc. 53, Doc. 591, ... sigir03 Doc. 3, Doc. 66, Doc. 765, ... sigir04 Doc. 25, Doc. 98, Doc. 221, ... sigir05 Doc. 57, Doc.99, Doc. 110, ... sigirlist Doc. 67, Doc. 189, Doc. 221, ... sigirforum Doc. 16, Doc. 110, Doc. 141, ... Hits Doc. 3, Doc. 16, Doc. 18, ... Completions sigir, sigir03, sigir04, sigir05, ...

Processing sigir* sal with INV
Iterate over all words matching sigir* sigir Doc.18, Doc. 53, Doc. 591, ... sigir03 Doc. 3, Doc. 66, Doc. 765, ... sigir04 Doc. 25, Doc. 98, Doc. 221, ... sigirlist Doc. 67, Doc. 189, Doc. 221, ... sigirforum Doc. 16, Doc. 110, Doc. 141, ... Merge the documents lists Hits D' Doc. 3, Doc. 16, Doc. 18, … Output all words from range as completions Completions W' sigir, sigir03, sigir05, … Expensive! Trivial for 1-word queries

Using an Inverted Index (INV)
salary Doc.18, Doc. 53, Doc. 591, ... salesman Doc. 3, Doc. 66, Doc. 765, ... salient Doc. 25, Doc. 98, Doc. 221, ... salton Doc. 57, Doc.99, Doc. 110, ... salutation Doc. 67, Doc. 189, Doc. 221, ... salvador Doc. 16, Doc. 110, Doc. 141, ... salvucci Doc. 18, Doc. 25, Doc. 765, ... salzberg Doc. 53, Doc. 121, Doc. 187, ... D Doc. 57, Doc 87, Doc. 110, ... W salary - salzberg D' Doc. 57, Doc. 110, ... W' salton, salvador Problem 1: one intersection per potential completion Problem 2: merging of non-empty intersections

HYB — Details HYB has a block for each word range one block of HYB
document ids 1 3 5 6 7 8 9 11 12 13 15 words D A C B gaps +1 +2 +0 3rd 1st 2nd 4th ranks by frequency universal encoding: small gaps/ranks => short codes +0    110 1st (A)  nd (C)  rd (D)  th (B)  110 one block of HYB

INV vs. HYB — Query Time INV HYB MED BOOKS
44,015 docs 263,817 words avg: secs max: 0.38 secs avg: secs max: 0.06 secs WIKIPEDIA 2,866,503 docs 6,700,119 words avg: secs max: 2.27 secs avg: secs max: 0.49 secs TREC .GOV 25,204,013 docs 25,263,176 words avg: secs max: secs avg: secs max: secs INV HYB avg = average time per keystroke max = maximum time per keystroke (outliers removed)

Start with DEMO autocomp sig sigir sigir sal sal

Related Search Engine Features
Complete from precompiled list of queries Google Suggest AllTheWeb Livesearch … Desktop Search engines Apple Spotlight Copernic Desktop Search So apparently we do something which is not so easy, so how do we do it efficiently, and such that it scales to large and very large collections?

Type Less, Find More: Fast Autocompletion Search with a Succinct Index

Similar presentations

Presentation on theme: "Type Less, Find More: Fast Autocompletion Search with a Succinct Index"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Type Less, Find More: Fast Autocompletion Search with a Succinct Index

Similar presentations

Presentation on theme: "Type Less, Find More: Fast Autocompletion Search with a Succinct Index"— Presentation transcript:

Similar presentations

About project

Feedback