Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority.

Slides:

Advertisements

Similar presentations

Information Retrieval in Practice

Advertisements

IMPLEMENTATION OF INFORMATION RETRIEVAL SYSTEMS VIA RDBMS.

Introduction to Information Retrieval

Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis.

© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert

| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.

Advanced Databases: Lecture 2 Query Optimization (I) 1 Query Optimization (introduction to query processing) Advanced Databases By Dr. Akhtar Ali.

Inverted Index Hongning Wang

From last time What’s the real point of using vector spaces?: A user’s query can be viewed as a (very) short document. Query becomes a vector in the same.

Fast Two-Sided Error-Tolerant Search Hannah Bast, Marjan Celikik University of Freiburg, Germany KEYS 2010.

Information Retrieval in Practice

Modern Information Retrieval

BTrees & Bitmap Indexes

Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.

Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)

1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman

Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.

6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.

Overview of Search Engines

Information Retrieval Space occupancy evaluation.

 Fatemeh Lashkari UNB University May 7 th  Indexing  Semantic Search  Semantic Search Architecture  Index process  Index Maintenance.

DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.

Search Engines and Information Retrieval Chapter 1.

The Power of Prefix Search (with a nice open problem) Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany Talk at ADS 2007 in Bertinoro,

The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB integration Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint.

The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB Integration Holger Bast, Ingmar Weber Max-Planck-Institut für Informatik CIDR 2007)

The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.

Data Structures & Algorithms and The Internet: A different way of thinking.

Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.

1 Efficient Search Ranking in Social Network ACM CIKM2007 Monique V. Vieira, Bruno M. Fonseca, Rodrigo Damazio, Paulo B. Golgher, Davi de Castro Reis,

« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)

©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.

Querying Structured Text in an XML Database By Xuemei Luo.

Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.

Type Less, Find More: Fast Autocompletion Search with a Succinct Index Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with.

Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.

Fast and Intelligent Search In Very Large Amounts of Data Hannah Bast Max-Planck-Institute for Informatics Saarbrücken Kick-off meeting for Cluster of.

The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB Integration Holger Bast, Ingmar Weber CIDR 2007) Conference on Innovative Data Systems.

ESTER Efficient Search on Text, Entities, and Relations Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea,

A fast algorithm for the generalized k- keyword proximity problem given keyword offsets Sung-Ryul Kim, Inbok Lee, Kunsoo Park Information Processing Letters,

The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.

The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.

Symbol Tables and Search Trees CSE 2320 – Algorithms and Data Structures Vassilis Athitsos University of Texas at Arlington 1.

240-Current Research Easily Extensible Systems, Octave, Input Formats, SOA.

ESTER Efficient Search on Text, Entities, and Relations Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea,

COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.

Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.

Evidence from Content INST 734 Module 2 Doug Oard.

Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,

Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.

Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.

Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Related to Chapter 4:

Internal and External Sorting External Searching

Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.

CES 592 Theory of Software Systems B. Ravikumar (Ravi) Office: 124 Darwin Hall.

Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.

Cool algorithms for a cool feature Holger Bast Max-Planck-Institut für Informatik (MPII) Saarbrücken, Germany joint work with Christian Mortensen and Ingmar.

Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.

The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.

Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.

The CompleteSearch Engine: Interactive, Efficient, and Towards IR & DB integration Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany.

University of Maryland Baltimore County

Large Scale Search: Inverted Index, etc.

Information Retrieval in Practice

Database Management System

Implementation Issues & IR Systems

Type Less, Find More: Fast Autocompletion Search with a Succinct Index

Chapter 15 QUERY EXECUTION.

Database Design and Programming

CS246: Search-Engine Scale

Presentation transcript:

Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority Programme “Algorithm Engineering” Kickoff Meeting in Karlsruhe, December 2 – 3, 2007

General theme of this project Search engines –large variety of challenging algorithmic problems with high practical relevance –algorithm engineering is absolutely essential Focus on scalability –terabytes of data, hundreds of millions of documents –query times in a fraction of a second Focus on advanced queries –beyond Google-style keyword search –but still as efficient in time and space Fancy Searches, yet Fast efficiency is often a secondary issue in DB, AI, CL, or ML research

Problems encountered in this project Indexing: fast queries, succinct index, fast construction –Index structures for advanced queries (beyond keyword search) –How to build them fast Learning from text: scalable, yet effective –large-scale spelling correction –large-scale synonymy detection –large-scale entity annotation “Basic Toolbox” (for search) –fast intersection of (sorted) sequences –efficient (de)compression I will give a few glimpses in the following algorythm  algorithm web ≈ internet Einstein  the physicist? the physical unit? the musicologist? possible synergies with Peter Sanders’ project

Prefix Completion Fundamental search problem –definition on next slide –many notoriously difficult search problems can be reduced to it –for example, faceted search: for, say, an article by Peter Sanders that appeared in WEA 2007, add author:Peter Sanders Doc. 17 venue:WEA Doc. 17 year:2007Doc. 17

D98 E B A S D98 E B A S D78 K L S D78 K L S D53 J D E A D53 J D E A Prefix Completion — Problem Definition D2 B F A D2 B F A D4 K L K A B D4 K L K A B D9 E E R D9 E E R D27 K L D F D27 K L D F D92 P U D E M D92 P U D E M D43 D Q D43 D Q D32 I L S D H D32 I L S D H D1 A O E W H D1 A O E W H D88 P A E G Q D88 P A E G Q D3 Q D A D3 Q D A D17 B W U K A D17 B W U K A D74 J W Q D74 J W Q D13 A O E W H D13 A O E W H D13 D17 D88 … C D E F G Data is given as –documents containing words –documents have ids (D1, D2, …) –words have ids (A, B, C, …) Query –given a sorted list of doc ids –and a range of word ids

Prefix Completion — Problem Definition Data is given as –documents containing words –documents have ids (D1, D2, …) –words have ids (A, B, C, …) Query –given a sorted list of doc ids –and a range of word ids Answer –all matching word-in-doc pairs –with scores –and positions D13 E … D88 E … … D98 E B A S D98 E B A S D78 K L S D78 K L S D53 J D E A D53 J D E A D2 B F A D2 B F A D4 K L K A B D4 K L K A B D9 E E R D9 E E R D27 K L D F D27 K L D F D92 P U D E M D92 P U D E M D43 D Q D43 D Q D32 I L S D H D32 I L S D H D1 A O E W H D1 A O E W H D88 P A E G Q D88 P A E G Q D3 Q D A D3 Q D A D17 B W U K A D17 B W U K A D74 J W Q D74 J W Q D13 A O E W H D13 A O E W H D88 P A E G Q D88 P A E G Q D17 B W U K A D17 B W U K A D13 A O E W H D13 A O E W H D13 D17 D88 … C D E F G D88 G …

Prefix Completion — Problem Definition Data is given as –documents containing words –documents have ids (D1, D2, …) –words have ids (A, B, C, …) Query –given a sorted list of doc ids –and a range of word ids Answer –all matching word-in-doc pairs –with scores –and positions D13 E … D88 E … … D98 E B A S D98 E B A S D78 K L S D78 K L S D53 J D E A D53 J D E A D2 B F A D2 B F A D4 K L K A B D4 K L K A B D9 E E R D9 E E R D27 K L D F D27 K L D F D92 P U D E M D92 P U D E M D43 D Q D43 D Q D32 I L S D H D32 I L S D H D1 A O E W H D1 A O E W H D88 P A E G Q D88 P A E G Q D3 Q D A D3 Q D A D17 B W U K A D17 B W U K A D74 J W Q D74 J W Q D13 A O E W H D13 A O E W H D88 P A E G Q D88 P A E G Q D17 B W U K A D17 B W U K A D13 A O E W H D13 A O E W H D13 D17 D88 … C D E F G D88 G …

Prefix Completion — via the Inverted Index For example, algor* eng* given the documents: D13, D17, D88, … (ids of hits for algor*) and the word range : C D E F G (ids for eng*) Iterate over all words from the given range C (engage) D8, D23, D291,... D (engel) D24, D36, D165,... E (engine) D13, D24, D88,... F (engines) D56, D129, D251,... G (engineering) D3, D15, D88,... Intersect each list with the given one and merge the results D13 D88D88… E EG… running time |D|∙ |W| + log |W|∙ merge volume

Prefix Completion — Status Quo & Problems The inverted index –highly compressible –perfect locality of access (T operations  T / block size IOs) –but quadratic worst-case complexity AutoTree [Bast, Weber, Mortensen, SPIRE’06] –output-sensitive (query time linear in size of output) –but poor locality of access (heavy use of bit rank operations) The half-inverted index [Bast, Weber, SIGIR’06] –highly compressible + perfect locality of access –query time linear in the number of docs, with small constant Major open problem: output-sensitive and IO-efficient Note: time for 100 disk seeks = time for reading 200 MB of compressed data 99% correlation with actual running times perfect prediction of time & space consum.

Error-Tolerant Search With prefix search available, reduces to the following –Problem: Given a set of distinct words (lexicon), find all clusters of words that are spelling variants of each other algorithm algorytm alogrithm logaythm logarithm mahcine machine maschine Challenges –find appropriate measure of distance between words –algorithm that scales in theory as well as in practice Master thesis of Marjan Celikik (talk on Wednesday) possible synergies with Ernst Mayr’s project

Semantic Search — Problems Problem 1: how to index –previous engines built on top of DBMS (e.g., Oracle) –DBMSs are hard to control (opposite of algorithm engineering) –ongoing work: reduction to prefix search and join Problem 2: integrate an ontology –relate words / phrases in text to entities from ontology –no time for deep parsing, reasoning etc. –learn from neighboring words –numerous algorithmic and engineering problems to make it scale to something like Wikipedia (> 10,000,000,000 words) Data Base Management System

Semantic Search — Entity Recognition Recognize entities by looking at neighboring words Quantum inequalities Einstein's theory of General Relativity amounts to a description … Quantum inequalities Einstein's theory of General Relativity amounts to a description … Albert Einstein, the physicist is a: physicist, mathematician, vegetarian, person, entity, … born in: 1879 Violin Sonata No. 5 …, according to Einstein's Mozart: His Character, His Work. Violin Sonata No. 5 …, according to Einstein's Mozart: His Character, His Work. Alfred Einstein, the musicologist is a: musicologist, scholar, intellectual, person, entity, … born in: 1880

Software Enhance our prototype –improve source code, documentation, … –integrate our results into the system Make available to others –public demonstrators –as a platform for experimentation –as a fancy search engine construction toolkit Thank you!

General theme of this project Project title Efficient Search in Very Large Text Collections, Databases, and Ontologies In short Fancy searches, yet fast –advanced search, yet highly scalable –quality is an issue –but must not sacrifice performance (as often happens in AI, CL, ML) General “Search engines are a fascinating, multi-faceted field of research giving rise to a multitude of challenging algorithmic problems with a strong algorithm engineering component and of high practical relevance.“

Overview [just for myself not for the talk] An Index for prefix search –inverted index + our + open problem + top-k Building such an index –INV = sorting, HYB = semi-sorting Error-tolerant search –reduce to spelling variants clustering, define problem Semantic Search –point out entity annotation problem

Prefix Search Show demo –first explain prefix search –then how to use if for faceted search –use DBLP + show dblp.mpi-inf.mpg.de Explain inverted index –show for example prefix query –point out IO-efficiency –point out compressability –but quadratic worst-case complexity

Problems encountered in this project Indexing: fast queries, succinct index, fast construction –Index structures for advanced queries (beyond keyword search) –How to build them fast Learning from text: scalable, yet effective –large-scale spelling correction –large-scale synonymy detection –large-scale entity annotation Fundamental problems –fast intersection of (sorted) sequences –efficient (de)compression I will explain each of these in detail in the following algorythm  algorithm web ≈ internet Einstein  the physicist? the physical unit? the musicologist?

Problems encountered in this project Indexing: fast queries, succinct index, fast construction –Index structures for advanced queries (beyond keyword search) –How to build them fast Learning from text: scalable, yet effective –large-scale spelling correction –large-scale synonymy detection –large-scale entity annotation Fundamental problems –fast intersection of (sorted) sequences –efficient (de)compression just kidding algorythm  algorithm web ≈ internet Einstein  the physicist? the physical unit? the musicologist?

Problems encountered in this project Indexing: fast queries, succinct index, fast construction –Index structures for advanced queries (beyond keyword search) –How to build them fast Learning from text: scalable, yet effective –large-scale spelling correction –large-scale synonymy detection –large-scale entity annotation Fundamental problems –fast intersection of (sorted) sequences –efficient (de)compression I will give you a glimpse of some of these in the following algorythm  algorithm web ≈ internet Einstein  the physicist? the physical unit? the musicologist? Example: prefix search Demo + problem definition Demo

Overview Part 1 –Definition of our prefix search problem –Applications –Demos of our search engine Part 2 –Problem definition again –One way to solve it –Another way to solve it –Your way to solve it

Part 1 Definition, Applications, Demos

Problem Definition — Formal Context-Sensitive Prefix Search Preprocess –a given collection of text documents such that queries of the following kind can be processed efficiently Given –an arbitrary set of documents D –and a range of words W Compute –all word-in-document pairs (w, d) such that w є W and d є D

D98 E B A S D98 E B A S D78 K L S D78 K L S D53 J D E A D53 J D E A Problem Definition — Visual D2 B F A D2 B F A D4 K L K A B D4 K L K A B D9 E E R D9 E E R D27 K L D F D27 K L D F D92 P U D E M D92 P U D E M D43 D Q D43 D Q D32 I L S D H D32 I L S D H D1 A O E W H D1 A O E W H D88 P A E G Q D88 P A E G Q D3 Q D A D3 Q D A D17 B W U K A D17 B W U K A D74 J W Q D74 J W Q D13 A O E W H D13 A O E W H D13 D17 D88 … C D E F G Data is given as –documents containing words –documents have ids (D1, D2, …) –words have ids (A, B, C, …) Query –given a sorted list of doc ids –and a range of word ids

Problem Definition — Visual Data is given as –documents containing words –documents have ids (D1, D2, …) –words have ids (A, B, C, …) Query –given a sorted list of doc ids –and a range of word ids Answer –all matching word-in-doc pairs –with scores –and positions D13 E … D88 E … … D98 E B A S D98 E B A S D78 K L S D78 K L S D53 J D E A D53 J D E A D2 B F A D2 B F A D4 K L K A B D4 K L K A B D9 E E R D9 E E R D27 K L D F D27 K L D F D92 P U D E M D92 P U D E M D43 D Q D43 D Q D32 I L S D H D32 I L S D H D1 A O E W H D1 A O E W H D88 P A E G Q D88 P A E G Q D3 Q D A D3 Q D A D17 B W U K A D17 B W U K A D74 J W Q D74 J W Q D13 A O E W H D13 A O E W H D88 P A E G Q D88 P A E G Q D17 B W U K A D17 B W U K A D13 A O E W H D13 A O E W H D13 D17 D88 … C D E F G D88 G …

Problem Definition — Visual Data is given as –documents containing words –documents have ids (D1, D2, …) –words have ids (A, B, C, …) Query –given a sorted list of doc ids –and a range of word ids Answer –all matching word-in-doc pairs –with scores –and positions D13 E … D88 E … … D98 E B A S D98 E B A S D78 K L S D78 K L S D53 J D E A D53 J D E A D2 B F A D2 B F A D4 K L K A B D4 K L K A B D9 E E R D9 E E R D27 K L D F D27 K L D F D92 P U D E M D92 P U D E M D43 D Q D43 D Q D32 I L S D H D32 I L S D H D1 A O E W H D1 A O E W H D88 P A E G Q D88 P A E G Q D3 Q D A D3 Q D A D17 B W U K A D17 B W U K A D74 J W Q D74 J W Q D13 A O E W H D13 A O E W H D88 P A E G Q D88 P A E G Q D17 B W U K A D17 B W U K A D13 A O E W H D13 A O E W H D13 D17 D88 … C D E F G D88 G …

Application 1: Autocompletion After each keystroke –display completions of the last query word that lead to the best hits, together with the best such hits –e.g., for the query google amp display amphitheatre and the corresponding hits

Application 2: Error Correction As before, but also … –… display spelling variants of completions that would lead to a hit –e.g., for the query probabilistic algorithm also consider a document containing probalistic aigorithm Implementation –if, say, aigorithm occurs as a misspelling of algorithm, then for every occurrence of aigorithm in the index aigorithm Doc. 17 also add algorithm::aigorithm Doc. 17

Application 3: Query Expansion As before, but also … –… display words related to completions that would lead to a hit –e.g., for the query russia metal also consider documents containing russia aluminium Implementation –for, say, every occurrence of aluminium in the index aluminium Doc. 17 also add (once for every occurrence) s:67:aluminium Doc. 17 and (one once for the whole collection) s:aluminium:67 Doc. 00

Application 4: Faceted Search As before, but also … –… along with the completions and hits, display a breakdown of the result set by various categories –e.g., for the query algorithm show (prominent) authors of articles containing these words Implementation –for, say, an article by Thomas Hofmann that appeared in NIPS 2004, add author:Thomas_Hofmann Doc. 17 venue:NIPS Doc. 17 year:2004Doc. 17 –also add thomas:author:Thomas_Hofmann Doc. 17 hofmann:author:Thomas_HofmannDoc. 17 etc.

Application 5: Semantic Search As before, but also … –… display “semantic” completions –e.g., for the query beatles musician display instances of the class musician that occur together with the word beatles Implementation –cannot simply duplicate index entries of an entity for each category it belongs to, e.g. John Lennon is a singer, songwriter, person, human being, organism, guitarist, pacifist, vegetarian, entertainer, musician, … –tricky combination of completions and joins  SIGIR’07 and still more applications …

Part 2 Solutions and Open Problem

Solution 1: Inverted Index For example, probab* alg* given the documents: D13, D17, D88, … (ids of hits for probab*) and the word range : C D E F G (ids for alg*) Iterate over all words from the given range C (algae) D8, D23, D291,... D (algarve) D24, D36, D165,... E (algebra) D13, D24, D88,... F (algol) D56, D129, D251,... G (algorithm) D3, D15, D88,... Intersect each list with the given one and merge the results D13 D88D88… E EG… running time |D|∙ |W| + log |W|∙ merge volume

A General Idea Precompute inverted lists for ranges of words DACABACADAABCACA Note –each prefix corresponds to a word range –ideally precompute list for each possible prefix –too much space –but lots of redundancy list for A-D

Solution 2: AutoTree SPIRE’06 / JIR’07 Trick 1: Relative bit vectors –the i-th bit of the root node corresponds to the i-th doc –the i-th bit of any other node corresponds to the i-th set bit of its parent node aachen-zyskowski … maakeb-zyskowski … maakeb-stream … corresponds to doc 5 corresponds to doc 10

Solution 2: AutoTree SPIRE’06 / JIR’07 Tricks 2: Push up the words –For each node, by each set bit, store the leftmost word of that doc that is not already stored by a parent node … … … aachen advance algol algorithm advance aachen art advance manner manning maximal maximum maple mazza middle D= 5, 7, 10 W= max* D = 5, 10 ( → 2, 5) report: maximum D = 5 report: Ø → STOP

Solution 2: AutoTree SPIRE’06 / JIR’07 Tricks 3: divide into blocks –and build a tree over each block as shown before

Solution 2: AutoTree SPIRE’06 / JIR’07 Tricks 3: divide into blocks –and build a tree over each block as shown before

Solution 2: AutoTree SPIRE’06 / JIR’07 Tricks 3: divide into blocks –and build a tree over each block as shown before Theorem: –query processing time O(|D| + |output|) –uses no more space than an inverted index AutoTree Summary: + output-sensitive –not IO-efficient (heavy use of bit-rank operations) –compression not optimal 99% correlation with actual running times

Parenthesis Despite its quadratic worst-case complexity, the inverted index is hard to beat in practice –very simple code –lists are highly compressible –perfect locality of access Number of operations is a deceptive measure –100 disk seeks take about half a second –in that time can read 200 MB of contiguous data (if stored compressed) –main memory: 100 non-local accesses  10 KB data block data

Solution 3: HYB Flat division of word range into blocks DACABACADAABCACA SIGIR’06 / IR’07 list for A-D EFGJHIIEFGHJI list for E-J LNMNNKLMNMKLMKL list for K-N

Solution 3: HYB Flat division of word range into blocks Replace doc ids by gaps and words by frequency ranks: DACABACADAABCACA rd 1 st 2 nd 1 st 4 th 1 st 2 nd 1 st 3 rd 1 st 4 th 2 nd 1 st 2 nd 1 st Encode both gaps and ranks such that x  log 2 x bits +0  0 +1   st (A)  0 2 nd (C)  10 3 rd (D)  th (B)  An actual block of HYB SIGIR’06 / IR’07

Solution 3: HYB Flat division of word range into blocks Theorem: –Let n = number of documents, m = number of words –If blocks are chosen of equal volume ~ n –Then query time ~ n and empiricial entropy H HYB ~ (1+ ε) ∙ H INV DACABACADAABCACA SIGIR’06 / IR’07 HYB Summary: + IO-efficient (mere scans of data) + very good compression –not output-sensitive experimental results match perfectly

Conclusion Context-sensitive prefix search –core mechanism of the CompleteSearch engine –simple enough to allow efficient realization –powerful enough to support many advanced search features Open problems –solution which is both output-sensitive and IO-efficient –implement the whole thing using MapReduce –support yet more features –…–… Thank you!

Processing the query “beatles musician” Gitanes … legend says that John Lennon entity:john_lennon of the Beatles smoked Gitanes to deepen his voice … Gitanes … legend says that John Lennon entity:john_lennon of the Beatles smoked Gitanes to deepen his voice … John Lennon 0 entity:john_lennon 1 relation:is_a 2 class:musician 2 class:singer … John Lennon 0 entity:john_lennon 1 relation:is_a 2 class:musician 2 class:singer … entity:john_lennon entity:1964 entity:liverpool etc. entity:wolfang_amadeus_mozart entity:johann_sebastian_bach entity:john_lennon etc. entity:john_lennon etc. two prefix queries one join position beatles entity:*entity:*. relation:is_a. class:musician

Processing the query “beatles musician” Problem: entity:* has a huge number of occurrences –≈ 200 million for Wikipedia, which is ≈ 20% of all occurrences –prefix search efficient only for up to ≈ 1% (explanation follows) Solution: frontier classes –classes at “appropriate” level in the hierarchy –e.g.: artist, believer, worker, vegetable, animal, … Gitanes … legend says that John Lennon entity:john_lennon of the Beatles smoked Gitanes to deepen his voice … Gitanes … legend says that John Lennon entity:john_lennon of the Beatles smoked Gitanes to deepen his voice … John Lennon 0 entity:john_lennon 1 relation:is_a 2 class:musician 2 class:singer … John Lennon 0 entity:john_lennon 1 relation:is_a 2 class:musician 2 class:singer … position beatles entity:*entity:*. relation:is_a. class:musician

Processing the query “beatles musician” Gitanes … legend says that John Lennon artist:john_lennon believer:john_lennon of the Beatles smoked … Gitanes … legend says that John Lennon artist:john_lennon believer:john_lennon of the Beatles smoked … John Lennon 0 artist:john_lennon 0 believer:john_lennon 1 relation:is_a 2 class:musician … John Lennon 0 artist:john_lennon 0 believer:john_lennon 1 relation:is_a 2 class:musician … artist:john_lennon artist:graham_greene artist:pete_best etc. artist:wolfang_amadeus_mozart artist:johann_sebastian_bach artist:john_lennon etc. artist:john_lennon etc. position beatles artist:*artist:*. relation:is_a. class:musician two prefix queries one join first figure out: musician  artist (easy)

INV vs. HYB — Space Consumption Theorem: The empirical entropy of INV is Σ n i ∙ ( 1/ln 2 + log 2 (n/n i ) ) Theorem: The empirical entropy of HYB with block size ε∙n is Σ n i ∙ ( (1+ε)/ln 2 + log 2 (n/n i ) ) HOMEOPATHY 44,015 docs 263,817 words with positions WIKIPEDIA 2,866,503 docs 6,700,119 words with positions TREC.GOV 25,204,013 docs 25,263,176 words no positions raw size452 MB 7.4 GB426 GB INV 13 MB0.48 GB 4.6 GB HYB 14 MB0.51 GB 4.9 GB Nice match of theory and practice n i = number of documents containing i-th word, n = number of documents

INV vs. HYB — Query Time HOMEOPATHY 44,015 docs 263,817 words 5,732 real queries with proximity avg : 0.03 secs max: 0.38 secs avg :.003 secs max: 0.06 secs INV HYB WIKIPEDIA 2,866,503 docs 6,700,119 words 100 random queries with proximity avg : 0.17 secs max: 2.27 secs avg : 0.05 secs max: 0.49 secs Experiment: type ordinary queries from left to right db, dbl, dblp, dblp un, dblp uni, dblp univ, dblp unive,... TREC.GOV 25,204,013 docs 25,263,176 words 50 TREC queries no proximity avg : 0.58 secs max: secs avg : 0.11 secs max: 0.86 secs HYB beats INV by an order of magnitude

Engineering Careful implementation in C++ –Experiment: sum over array of 10 million 4-byte integers (on a Linux PC with an approx. 2 GB/sec memory bandwidth) With HYB, every query is essentially one block scan –perfect locality of access, no sorting or merging, etc. –balanced ratio of read, decompression, processing, etc. C++JavaMySQLPerl readdecomp.intersectrankhistory 21%18%11%15%35%

Engineering Careful implementation in C++ –Experiment: sum over array of 10 million 4-byte integers (on a Linux PC with an approx. 2 GB/sec memory bandwidth) With HYB, every query is essentially one block scan –perfect locality of access, no sorting or merging, etc. –balanced ratio of read, decompression, processing, etc. C++JavaMySQLPerl 1800 MB/sec readdecomp.intersectrankhistory 21%18%11%15%35%

Engineering Careful implementation in C++ –Experiment: sum over array of 10 million 4-byte integers (on a Linux PC with an approx. 2 GB/sec memory bandwidth) With HYB, every query is essentially one block scan –perfect locality of access, no sorting or merging, etc. –balanced ratio of read, decompression, processing, etc. C++JavaMySQLPerl 1800 MB/sec300 MB/sec readdecomp.intersectrankhistory 21%18%11%15%35%

Engineering Careful implementation in C++ –Experiment: sum over array of 10 million 4-byte integers (on a Linux PC with an approx. 2 GB/sec memory bandwidth) With HYB, every query is essentially one block scan –perfect locality of access, no sorting or merging, etc. –balanced ratio of read, decompression, processing, etc. C++JavaMySQLPerl 1800 MB/sec300 MB/sec16 MB/sec readdecomp.intersectrankhistory 21%18%11%15%35%

Engineering Careful implementation in C++ –Experiment: sum over array of 10 million 4-byte integers (on a Linux PC with an approx. 2 GB/sec memory bandwidth) With HYB, every query is essentially one block scan –perfect locality of access, no sorting or merging, etc. –balanced ratio of read, decompression, processing, etc. C++JavaMySQLPerl 1800 MB/sec300 MB/sec16 MB/sec2 MB/sec readdecomp.intersectrankhistory 21%18%11%15%35%

System Design — High Level View Debugging such an application is hell! Compute Server C++ Web Server PHP User Client JavaScript