Presentation is loading. Please wait.

Presentation is loading. Please wait.

Type Less, Find More: Fast Autocompletion Search with a Succinct Index Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with.

Similar presentations


Presentation on theme: "Type Less, Find More: Fast Autocompletion Search with a Succinct Index Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with."— Presentation transcript:

1 Type Less, Find More: Fast Autocompletion Search with a Succinct Index Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Ingmar Weber at Google in Mountain View, USA, August 14

2

3 Basic Autocompletion –saves typing –no more information than necessary –find out about formulations used googlism, googlearchy –error correction googel It's useful …

4 It's more useful … Complete to phrases –phrase mountain view → add word mountain_view to index Complete to subwords –compound word eigenproblem → add word problem to index Complete to category names –author Edleno Moura → add moura:edleno::author edleno::moura:author Faceted search –add ct:conference:sigir –add ct:author:edleno_moura –add ct:year:2005 all via the same mechanism

5 Related Engines

6

7 Basic Problem Definition Query –a set D of documents (= hits for the first part of the query) –a range W of words (= potential completions of last word) Answer –all documents D' from D, containing a word from W –all words W' from W, contained in a document from D Extensions (see paper at SIGIR'06) –ranking (best hits from D' and best completions from W') –positional information (proximity queries) First try: inverted index (INV)

8 Processing 1-word queries with INV For example, goog* Dall documents W all words matching goog* Iterate over all words from W googleDoc.18, Doc. 53, Doc. 591,... googlearchyDoc. 3, Doc. 66, Doc. 765,... googlesDoc. 25, Doc. 98, Doc. 221,... googlingDoc. 67, Doc. 189, Doc. 221,... googlismDoc. 16, Doc. 110, Doc. 141,... Merge the documents lists D'Doc. 3, Doc. 16, Doc. 18, Doc. 25, … Output all words from range as completions W' google, googlearchy, googles, … Expensive! Trivial for 1-word queries

9 Processing multi-word queries with INV For example, goog* mou* D Doc. 3, Doc. 16, Doc. 18, Doc. 25, … (hits for goog*) W all words matching mou* Iterate over all words from W mouldDoc. 8, Doc. 23, Doc. 291,... mountDoc. 24, Doc. 36, Doc. 165,... mountain Doc. 3, Doc. 18, Doc. 66,... mountingDoc. 56, Doc. 129, Doc. 251,... mouraDoc. 18, Doc. 21, Doc. 25,... Intersect each list with D, then merge D'Doc. 3, Doc. 18, Doc. 25, … Output all words with non-empty intersection W' mountain, moura Most intersection are empty, but INV has to compute them all!

10 INV — Problems Asymptotic time complexity is bad (for our problem) –many intersections (one per potential completion) –has to merge/sort (the non-empty intersections) Still hard to beat INV in practice –highly compressible half the space on disk means half the time to read it –INV has very good locality of access the ratio random access time/sequential access time is 50,000 for disk, and still 100 for main memory –simple code instruction cache, branch prediction, etc.

11 A Hybrid Index (HYB) But this looks very wasteful Basic Idea: have lists for ranges of words mould – moura Doc. 3, Doc. 16, Doc.18, Doc. 25,... Problem: not enough to show completions Solution: store the word(s) along with each doc id mould – moura Doc. 3, Doc. 16, Doc.18, Doc. 25,... mould moura mount mould mountain mounting moura

12 HYB — Details HYB has a block for each word range, conceptually: Replace doc ids by gaps and words by frequency ranks: 133556788911 121315 DACABACADAABCACA +1+1 +2+2 +0+0 +2+2 +0+0 +1+1 +1+1 +1+1 +0+0 +1+1 +2+2 +0+0 +0+0 +1+1 +1+1 +2+2 3 rd 1 st 2 nd 1 st 4 th 1 st 2 nd 1 st 3 rd 1 st 4 th 2 nd 1 st 2 nd 1 st Encode both gaps and ranks such that x  log 2 x bits +0  0 +1  10 +2  110 1 st (A)  0 2 nd (C)  10 3 rd (D)  111 4 th (B)  110 10 110 0 110 0 10 10 10 0 10 110 0 0 10 10 110 111 0 10 0 110 0 10 0 111 0 0 110 10 0 10 0 An actual block of HYB How well does it compress? Which block size?

13 INV vs. HYB — Space Consumption Theorem: The empirical entropy of INV is Σ n i ∙ ( 1/ln 2 + log 2 (n/n i ) ) Theorem: The empirical entropy of HYB with block size ε∙n is Σ n i ∙ ( (1+ε)/ln 2 + log 2 (n/n i ) ) MEDICINE 44,015 docs 263,817 words with positions WIKIPEDIA 2,866,503 docs 6,700,119 words with positions TREC.GOV 25,204,013 docs 25,263,176 words no positions raw size452 MB 7.4 GB426 GB INV 13 MB0.48 GB 4.6 GB HYB 14 MB0.51 GB 4.9 GB Nice match of theory and practice n i = number of documents containing i-th word, n = number of documents

14 INV vs. HYB — Query Time MEDICINE 44,015 docs 263,817 words 5,732 real queries with proximity avg : 0.03 secs max: 0.38 secs avg :.003 secs max: 0.06 secs INV HYB WIKIPEDIA 2,866,503 docs 6,700,119 words 100 random queries with proximity avg : 0.17 secs max: 2.27 secs avg : 0.05 secs max: 0.49 secs Theoretical analysis  see paper at SIGIR'06 Experiment: type ordinary queries from left to right – go, goo, goog, googl, google, google mo, google mou,... TREC.GOV 25,204,013 docs 25,263,176 words 50 TREC queries no proximity avg : 0.58 secs max: 16.83 secs avg : 0.11 secs max: 0.86 secs HYB better by an order of magnitude

15 System Design — High Level View Debugging such an application is hell! Compute Server C++ Web Server PHP User Client JavaScript

16 Summary of Results Properties of HYB –highly compressible (just like INV) –fast prefix-completion queries (perfect locality of access) –fast indexing (no full inversion necessary) Autocompletion and more –phrase and subword completion, semantic completion, XML support, … –faceted search (Workshop Talk on Thursday) –efficient DB joins: author[sigir sigmod] NEW all with one and the same (efficient) mechanism

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37


Download ppt "Type Less, Find More: Fast Autocompletion Search with a Succinct Index Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with."

Similar presentations


Ads by Google