The CompleteSearch Engine: Interactive, Efficient, and Towards IR & DB integration Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany.

The CompleteSearch Engine: Interactive, Efficient, and Towards IR & DB integration Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea, Deb Majumdar, Christian Mortensen, Fabian Suchanek, Markus Tetzlaff, Thomas Warken, Ingmar Weber, … Vortrag an der Universität Trier, 13ter Februar 2007

general-purpose but slow on large data scales very well but special-purpose IR versus DB (simplified view) IR system (search engine) single data structure and query algorithm, optimized for ranked retrieval on textual data highly compressible and high locality of access ranking is an integral part  can't do even simple selects, joins, etc. DB system (relational) variety of indices and query algorithms, to suit all sorts of complex queries on structured data  space overhead and limited locality of access  no integrated ranked retrieval can do complex selects, joins, … (SQL)

Our work (in a nutshell) The CompleteSearch engine novel data structure and query algorithm for context-sensitive prefix search and completion highly compressible and high locality of access IR-style ranked retrieval DB-style selects and joins natural blend of the two subsecond query times for up to a terabyte on a single machine  no transactions, recovery, etc.  for low dynamics (few insertions/deletions)  other open issues at the end of the talk … fairly general-purpose and scales very well

Context-Sensitive Autocompletion Complete to words that would lead to a hit –saves typing, avoids overspecification of query, find out about formulations used, error correction, etc. Complete to phrases –for the phrase uni trier –add the word uni_trier to the index Complete to subwords –for the compound word eigenproblem –add the word problem to the index Complete to arbitrary substrings –there are standard techniques –but usually not worth it (in text search)

Semantic Completion Complete to instances of categories –for the author Henning Fernau –add henning:fernau::author and fernau::henning:author Complete to names of categories –for the author Henning Fernau –add author:henning_fernau Refine search result by category (faceted search) –add ct:conference:stacs –add ct:author:henning_fernau –add ct:year:2005 –proactively launch query with ct: appended

DB-style joins Find authors which have published at SIGIR and SIGMOD –must collect information from several documents –no way to do this with standard keyword search –with our context-sensitive prefix completion, we can launch conference:sigir author:* conference:sigmod author:* –and intersect the list of completions (not documents) Like that can realize any kind of join –note that adding conference:stacs author:henning_fernau year:2005 etc. effectively creates a table with schema (conference, author, year, publication) Henning FernauSTACS2005paper #23876 Jianer ChenSTACS2005paper #23876 Henning FernauICALP2001paper #31457 Rolf NiedermeierICALP2001paper #31457 …………

Incorporating Ontologies (ongoing work) Consider an entity like John Lennon who we know was a –singer, songwriter, person, human being, organism, guitarist, pacifist, vegetarian, entertainer, musician, … We cannot add all the annotations to every occurrence of John Lennon –index size would explode –better to keep annotations separately But we can –add entity:john_lennon for every occurrence –in a special document about him, add entity:john_lennon along with class:songwriter, class:musician, class:person, … And then intersect the completions of, for example, –beatles entity: and class:musician entity:

Related Engines suggests whole queries from precompiled list

Related Engines similar to Google Suggest + proactively snaps to one query and shows result

D98 E B A S D98 E B A S D78 K L S D78 K L S D53 J D E A D53 J D E A Context-Sensitive Prefix Search D2 B F A D2 B F A D4 K L K A B D4 K L K A B D9 E E R D9 E E R D27 K L D F D27 K L D F D92 P U D E M D92 P U D E M D43 D Q D43 D Q D32 I L S D H D32 I L S D H D1 A O E W H D1 A O E W H D88 P A E G Q D88 P A E G Q D3 Q D A D3 Q D A D17 B W U K A D17 B W U K A D74 J W Q D74 J W Q D13 A O E W H D13 A O E W H D13 D17 D88 … C D E F G H Data is given as –documents containing words –documents have ids (D1, D2, …) –words have ids (A, B, C, …) Query –given a sorted list of doc ids –and a range of word ids

Context-Sensitive Prefix Search Data is given as –documents containing words –documents have ids (D1, D2, …) –words have ids (A, B, C, …) Query –given a sorted list of doc ids –and a range of word ids Answer –all matching word-in-doc pairs –with scores D13 E 0.5 0.2 0.7 … D88 E … … D98 E B A S D98 E B A S D78 K L S D78 K L S D53 J D E A D53 J D E A D2 B F A D2 B F A D4 K L K A B D4 K L K A B D9 E E R D9 E E R D27 K L D F D27 K L D F D92 P U D E M D92 P U D E M D43 D Q D43 D Q D32 I L S D H D32 I L S D H D1 A O E W H D1 A O E W H D88 P A E G Q D88 P A E G Q D3 Q D A D3 Q D A D17 B W U K A D17 B W U K A D74 J W Q D74 J W Q D13 A O E W H D13 A O E W H D88 P A E G Q D88 P A E G Q D17 B W U K A D17 B W U K A D13 A O E W H D13 A O E W H D13 D17 D88 … C D E F G H D88 G

Solution via an Inverted Index (INV) For example, db* given the sorted list of all document ids given the range of word ids matching db* Iterate over all words from W Word 781 (dbms) Doc. 16, Doc. 53, Doc. 591,... Word 782 (db2) Doc. 3, Doc. 66, Doc. 765,... Word 783 (dbase) Doc. 25, Doc. 98, Doc. 221,... Word 784 (dbis) Doc. 67, Doc. 189, Doc. 221,... Word 785 (dblp) Doc. 16, Doc. 110, Doc. 141,... Have to merge the lists Doc. 3, Doc. 16, Doc. 16, Doc. 25, … Word 782, Word 781, Word 785, Word 783, … query time = output size ∙ log(size of W)

Solution via an Inverted Index (INV) For example, db* uni* given the doc id list: Doc. 3, Doc. 16, Doc. 18, Doc. 25, … (hits for db*) given the range of word ids matching uni* Iterate over all words from W Word 578 (uniform)Doc. 8, Doc. 23, Doc. 291,... Word 579 (unit)Doc. 24, Doc. 36, Doc. 165,... Word 580 (uni trier)Doc. 3, Doc. 18, Doc. 66,... Word 581 (unique)Doc. 56, Doc. 129, Doc. 251,... Word 582 (university)Doc. 18, Doc. 21, Doc. 25,... Intersect each list with D, then merge Doc. 3, Doc. 18, Doc. 18, Doc. 25, … Word 580, Word 580, Word 582, Word 582, … query time = size of D ∙ size of W + merging

The Inverted Index (INV) — Problems Asymptotic time complexity is bad (for our problem) –with INV we either have to merge/sort a lot –or intersect the same list over and over again Still a tough baseline to beat in practice –highly compressible half the space on disk means half the time to read it –INV has very good locality of access the ratio random access time/sequential access time is 50,000 for disk, and still up to 100 for main memory –simple code instruction cache, branch prediction, etc.

A Tree-Based Index (AutoTree) Output-sensitive behaviour –query time = size of result list –anytime algorithm: produces result element in every step Beats the inverted index by a factor of 5 –but only in main memory –heavy use of bit rank data structures (to compute number of set bits before a given position in constant time) SPIRE 2006

A Hybrid Index (HYB) HYB has a block for each word range, conceptually: Replace doc ids by gaps and words by frequency ranks: 133556788911 121315 DACABACADAABCACA +1+1 +2+2 +0+0 +2+2 +0+0 +1+1 +1+1 +1+1 +0+0 +1+1 +2+2 +0+0 +0+0 +1+1 +1+1 +2+2 3 rd 1 st 2 nd 1 st 4 th 1 st 2 nd 1 st 3 rd 1 st 4 th 2 nd 1 st 2 nd 1 st Encode both gaps and ranks such that x  log 2 x bits +0  0 +1  10 +2  110 1 st (A)  0 2 nd (C)  10 3 rd (D)  111 4 th (B)  110 10 110 0 110 0 10 10 10 0 10 110 0 0 10 10 110 111 0 10 0 110 0 10 0 111 0 0 110 10 0 10 0 An actual block of HYB SIGIR 2006

INV vs. HYB — Space Consumption Theorem: The empirical entropy of INV is Σ n i ∙ ( 1/ln 2 + log 2 (n/n i ) ) Theorem: The empirical entropy of HYB with block size ε∙n is Σ n i ∙ ( (1+ε)/ln 2 + log 2 (n/n i ) ) HOMEOPATHY 44,015 docs 263,817 words with positions WIKIPEDIA 2,866,503 docs 6,700,119 words with positions TREC.GOV 25,204,013 docs 25,263,176 words no positions raw size452 MB 7.4 GB426 GB INV 13 MB0.48 GB 4.6 GB HYB 14 MB0.51 GB 4.9 GB Nice match of theory and practice n i = number of documents containing i-th word, n = number of documents

INV vs. HYB — Query Time HOMEOPATHY 44,015 docs 263,817 words 5,732 real queries with proximity avg : 0.03 secs max: 0.38 secs avg :.003 secs max: 0.06 secs INV HYB WIKIPEDIA 2,866,503 docs 6,700,119 words 100 random queries with proximity avg : 0.17 secs max: 2.27 secs avg : 0.05 secs max: 0.49 secs Experiment: type ordinary queries from left to right db, dbl, dblp, dblp un, dblp uni, dblp univ, dblp unive,... TREC.GOV 25,204,013 docs 25,263,176 words 50 TREC queries no proximity avg : 0.58 secs max: 16.83 secs avg : 0.11 secs max: 0.86 secs HYB beats INV by an order of magnitude

Engineering Careful implementation in C++ –Experiment: sum over array of 10 million 4-byte integers (on a Linux PC with an approx. 2 GB/sec memory bandwidth) With HYB, every query is essentially one block scan –perfect locality of access, no sorting or merging, etc. –balanced ratio of read, decompression, processing, etc. C++JavaMySQLPerl readdecomp.intersectrankhistory 21%18%11%15%35%

Engineering Careful implementation in C++ –Experiment: sum over array of 10 million 4-byte integers (on a Linux PC with an approx. 2 GB/sec memory bandwidth) With HYB, every query is essentially one block scan –perfect locality of access, no sorting or merging, etc. –balanced ratio of read, decompression, processing, etc. C++JavaMySQLPerl 1800 MB/sec readdecomp.intersectrankhistory 21%18%11%15%35%

Engineering Careful implementation in C++ –Experiment: sum over array of 10 million 4-byte integers (on a Linux PC with an approx. 2 GB/sec memory bandwidth) With HYB, every query is essentially one block scan –perfect locality of access, no sorting or merging, etc. –balanced ratio of read, decompression, processing, etc. C++JavaMySQLPerl 1800 MB/sec300 MB/sec readdecomp.intersectrankhistory 21%18%11%15%35%

Engineering Careful implementation in C++ –Experiment: sum over array of 10 million 4-byte integers (on a Linux PC with an approx. 2 GB/sec memory bandwidth) With HYB, every query is essentially one block scan –perfect locality of access, no sorting or merging, etc. –balanced ratio of read, decompression, processing, etc. C++JavaMySQLPerl 1800 MB/sec300 MB/sec16 MB/sec readdecomp.intersectrankhistory 21%18%11%15%35%

Engineering Careful implementation in C++ –Experiment: sum over array of 10 million 4-byte integers (on a Linux PC with an approx. 2 GB/sec memory bandwidth) With HYB, every query is essentially one block scan –perfect locality of access, no sorting or merging, etc. –balanced ratio of read, decompression, processing, etc. C++JavaMySQLPerl 1800 MB/sec300 MB/sec16 MB/sec2 MB/sec readdecomp.intersectrankhistory 21%18%11%15%35%

System Design — High Level View Debugging such an application is hell! Compute Server C++ Web Server PHP User Client JavaScript

Conclusions Summary –central mechanism for context-sensitive range search –very efficient in space and time, scales very well –combines IR-style ranked retrieval with DB-style selects and joins –support for interactive / semantic / faceted / ontology search On our TODO list –achieve both output-sensitivity and locality of access –integrate top-k query processing –find out which SQL queries can be supported efficiently? –deal with high dynamics (many insertions/deletions) Thank you!

Basic Problem Definition Definition: Context-sensitive prefix search and completion Given a query consisting of –sorted list D of doc ids Doc15 Doc183 Doc185 Doc17351 … –range W of word ids Word1893 – Word7329 Compute as a result – all (w, d) w Є W, d Є D Doc15 Doc15 Doc17351... sorted by doc id Word7014 Word5112 Word2011 … Refinements –positions Pos12 Pos73 Pos44... –scores 0.7 0.3 0.5...

Basic Problem Definition For example, dblp uni –set D = document ids from result for dblp –range W = word ids of all words starting with uni → multi-dimensional query processed as sequence of 1½ dimensional queries For example, intersect completions of results for conf:sigir author: and conf:sigmod author: D11D25D57D91 W25W23W24 D23D54D56D58D69 W27 W23 W27

Basic Problem Definition For example, dblp uni –set D = document ids from result for dblp –range W = word ids of all words starting with uni → multi-dimensional query processed as sequence of 1½ dimensional queries For example, intersect completions of results for conf:sigir author: and conf:sigmod author: → efficient, because completions are from small range D11D25D57D91 W25W23W24 D23D54D56D58D69 W27 W23 W27

Conclusions Context-sensitive prefix search and completion –is a fundamental operation supports autocompletion search, semantic search, faceted search, DB-style selects and joins, ontology search, … –efficient support via HYB index very good compression properties perfect locality of access Some open issues –integrate top-k query processing –what else can we do with it? –very short prefixes

The CompleteSearch Engine: Interactive, Efficient, and Towards IR & DB integration Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany.

Similar presentations

Presentation on theme: "The CompleteSearch Engine: Interactive, Efficient, and Towards IR & DB integration Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The CompleteSearch Engine: Interactive, Efficient, and Towards IR & DB integration Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany.

Similar presentations

Presentation on theme: "The CompleteSearch Engine: Interactive, Efficient, and Towards IR & DB integration Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany."— Presentation transcript:

Similar presentations

About project

Feedback