The Power of Prefix Search (with a nice open problem) Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany Talk at ADS 2007 in Bertinoro,

Slides:



Advertisements
Similar presentations
IMPLEMENTATION OF INFORMATION RETRIEVAL SYSTEMS VIA RDBMS.
Advertisements

Introduction to Information Retrieval
Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis.
Binary Trees CSC 220. Your Observations (so far data structures) Array –Unordered Add, delete, search –Ordered Linked List –??
Advanced Databases: Lecture 2 Query Optimization (I) 1 Query Optimization (introduction to query processing) Advanced Databases By Dr. Akhtar Ali.
From last time What’s the real point of using vector spaces?: A user’s query can be viewed as a (very) short document. Query becomes a vector in the same.
Fast Two-Sided Error-Tolerant Search Hannah Bast, Marjan Celikik University of Freiburg, Germany KEYS 2010.
Sorting Chapter Sorting Consider list x 1, x 2, x 3, … x n We seek to arrange the elements of the list in order –Ascending or descending Some O(n.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Modern Information Retrieval
BTrees & Bitmap Indexes
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Design a Data Structure Suppose you wanted to build a web search engine, a la Alta Vista (so you can search for “banana slugs” or “zyzzyvas”) index say.
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
Information Retrieval IR 4. Plan This time: Index construction.
Evaluating the Performance of IR Sytems
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
CS4432: Database Systems II
Information Retrieval Space occupancy evaluation.
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Algorithms for Information Retrieval Is algorithmic design a 5-mins thinking task ???
The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB integration Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint.
The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB Integration Holger Bast, Ingmar Weber Max-Planck-Institut für Informatik CIDR 2007)
Sorting HKOI Training Team (Advanced)
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Data Structures & Algorithms and The Internet: A different way of thinking.
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Querying Structured Text in an XML Database By Xuemei Luo.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Type Less, Find More: Fast Autocompletion Search with a Succinct Index Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB Integration Holger Bast, Ingmar Weber CIDR 2007) Conference on Innovative Data Systems.
Efficient Search in Very Large Text Collections, Databases, and Ontologies Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany DFG Priority.
ESTER Efficient Search on Text, Entities, and Relations Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea,
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
CSC 211 Data Structures Lecture 13
Symbol Tables and Search Trees CSE 2320 – Algorithms and Data Structures Vassilis Athitsos University of Texas at Arlington 1.
ESTER Efficient Search on Text, Entities, and Relations Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with Alexandru Chitea,
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Evidence from Content INST 734 Module 2 Doug Oard.
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Internal and External Sorting External Searching
Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data can be stored.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Keyword search on encrypted data. Keyword search problem  Linux utility: grep  Information retrieval Basic operation Advanced operations – relevance.
Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
Data Structures and Algorithms Instructor: Tesfaye Guta [M.Sc.] Haramaya University.
Cool algorithms for a cool feature Holger Bast Max-Planck-Institut für Informatik (MPII) Saarbrücken, Germany joint work with Christian Mortensen and Ingmar.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
The CompleteSearch Engine: Interactive, Efficient, and Towards IR & DB integration Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Large Scale Search: Inverted Index, etc.
Indexing & querying text
Database Management System
Implementation Issues & IR Systems
13 Text Processing Hongfei Yan June 1, 2016.
Type Less, Find More: Fast Autocompletion Search with a Succinct Index
Lecture 7: Index Construction
Lecture 2- Query Processing (continued)
Database Design and Programming
Information Retrieval and Web Design
Efficient Aggregation over Objects with Extent
Presentation transcript:

The Power of Prefix Search (with a nice open problem) Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany Talk at ADS 2007 in Bertinoro, October 3 rd

Overview Part 1 –Definition of our prefix search problem –Applications –Demos of our search engine Part 2 –Problem definition again –One way to solve it –Another way to solve it –Your way to solve it

Part 1 Definition, Applications, Demos

Problem Definition — Formal Context-Sensitive Prefix Search Preprocess –a given collection of text documents such that queries of the following kind can be processed efficiently Given –an arbitrary set of documents D –and a range of words W Compute –all word-in-document pairs (w, d) such that w є W and d є D

D98 E B A S D98 E B A S D78 K L S D78 K L S D53 J D E A D53 J D E A Problem Definition — Visual D2 B F A D2 B F A D4 K L K A B D4 K L K A B D9 E E R D9 E E R D27 K L D F D27 K L D F D92 P U D E M D92 P U D E M D43 D Q D43 D Q D32 I L S D H D32 I L S D H D1 A O E W H D1 A O E W H D88 P A E G Q D88 P A E G Q D3 Q D A D3 Q D A D17 B W U K A D17 B W U K A D74 J W Q D74 J W Q D13 A O E W H D13 A O E W H D13 D17 D88 … C D E F G H Data is given as –documents containing words –documents have ids (D1, D2, …) –words have ids (A, B, C, …) Query –given a sorted list of doc ids –and a range of word ids

Problem Definition — Visual Data is given as –documents containing words –documents have ids (D1, D2, …) –words have ids (A, B, C, …) Query –given a sorted list of doc ids –and a range of word ids Answer –all matching word-in-doc pairs –with scores –and positions D13 E … D88 E … … D98 E B A S D98 E B A S D78 K L S D78 K L S D53 J D E A D53 J D E A D2 B F A D2 B F A D4 K L K A B D4 K L K A B D9 E E R D9 E E R D27 K L D F D27 K L D F D92 P U D E M D92 P U D E M D43 D Q D43 D Q D32 I L S D H D32 I L S D H D1 A O E W H D1 A O E W H D88 P A E G Q D88 P A E G Q D3 Q D A D3 Q D A D17 B W U K A D17 B W U K A D74 J W Q D74 J W Q D13 A O E W H D13 A O E W H D88 P A E G Q D88 P A E G Q D17 B W U K A D17 B W U K A D13 A O E W H D13 A O E W H D13 D17 D88 … C D E F G H D88 G …

Problem Definition — Visual Data is given as –documents containing words –documents have ids (D1, D2, …) –words have ids (A, B, C, …) Query –given a sorted list of doc ids –and a range of word ids Answer –all matching word-in-doc pairs –with scores –and positions D13 E … D88 E … … D98 E B A S D98 E B A S D78 K L S D78 K L S D53 J D E A D53 J D E A D2 B F A D2 B F A D4 K L K A B D4 K L K A B D9 E E R D9 E E R D27 K L D F D27 K L D F D92 P U D E M D92 P U D E M D43 D Q D43 D Q D32 I L S D H D32 I L S D H D1 A O E W H D1 A O E W H D88 P A E G Q D88 P A E G Q D3 Q D A D3 Q D A D17 B W U K A D17 B W U K A D74 J W Q D74 J W Q D13 A O E W H D13 A O E W H D88 P A E G Q D88 P A E G Q D17 B W U K A D17 B W U K A D13 A O E W H D13 A O E W H D13 D17 D88 … C D E F G H D88 G …

Application 1: Autocompletion After each keystroke –display completions of the last query word that lead to the best hits, together with the best such hits –e.g., for the query probabilistic alg display algorithm and algebra and show hits for both

Application 2: Error Correction As before, but also … –… display spelling variants of completions that would lead to a hit –e.g., for the query probabilistic algorithm also consider a document containing probalistic aigorithm Implementation –if, say, aigorithm occurs as a misspelling of algorithm, then for every occurrence of aigorithm in the index aigorithm Doc. 17 also add algorithm::aiogorithm Doc. 17

Application 3: Query Expansion As before, but also … –… display words related to completions that would lead to a hit –e.g., for the query russia metal also consider documents containing russia aluminium Implementation –for, say, every occurrence of aluminium in the index aluminium Doc. 17 also add (once for every occurrence) s:67:aluminium Doc. 17 and (one once for the whole collection) s:aluminium:67 Doc. 00

Application 4: Faceted Search As before, but also … –… along with the completions and hits, display a breakdown of the result set by various categories –e.g., for the query algorithm show (prominent) authors of articles containing these words Implementation –for, say, an article by Camil Detrescu that appeared in SODA 2006, add author:Camil_Demetrescu Doc. 17 venue:SODA Doc. 17 year:2006Doc. 17 –also add camil:author:Camil_Demetrescu Doc. 17 demetrescu:author:Camil_DemetrescuDoc. 17 etc.

Application 5: Semantic Search As before, but also … –… display “semantic” completions –e.g., for the query beatles musician display instances of the class musician that occur together with the word beatles Implementation –cannot simply duplicate index entries of an entity for each category it belongs to, e.g. John Lennon is a singer, songwriter, person, human being, organism, guitarist, pacifist, vegetarian, entertainer, musician, … –tricky combination of completions and joins  SIGIR’07 and still more applications …

Part 2 Solutions and Open Problem

Solution 1: Inverted Index For example, probab* alg* given the documents: D13, D17, D88, … (ids of hits for probab*) and the word range : C D E F G (ids for alg*) Iterate over all words from the given range C (algae) D8, D23, D291,... D (algarve) D24, D36, D165,... E (algebra) D13, D24, D88,... F (algol) D56, D129, D251,... G (algorithm) D3, D15, D88,... Intersect each list with the given one and merge the results D13 D88D88… E EG… running time |D|∙ |W| + log |W|∙ merge volume

A General Idea Precompute inverted lists for ranges of words DACABACADAABCACA Note –each prefix corresponds to a word range –ideally precompute list for each possible prefix –too much space –but lots of redundancy list for A-D

Solution 2: AutoTree SPIRE’06 / JIR’07 Trick 1: Relative bit vectors –the i-th bit of the root node corresponds to the i-th doc –the i-th bit of any other node corresponds to the i-th set bit of its parent node aachen-zyskowski … maakeb-zyskowski … maakeb-stream … corresponds to doc 5 corresponds to doc 10

Solution 2: AutoTree SPIRE’06 / JIR’07 Tricks 2: Push up the words –For each node, by each set bit, store the leftmost word of that doc that is not already stored by a parent node … … … aachen advance algol algorithm advance aachen art advance manner manning maximal maximum maple mazza middle D= 5, 7, 10 W= max* D = 5, 10 ( → 2, 5) report: maximum D = 5 report: Ø → STOP

Solution 2: AutoTree SPIRE’06 / JIR’07 Tricks 3: divide into blocks –and build a tree over each block as shown before

Solution 2: AutoTree SPIRE’06 / JIR’07 Tricks 3: divide into blocks –and build a tree over each block as shown before

Solution 2: AutoTree SPIRE’06 / JIR’07 Tricks 3: divide into blocks –and build a tree over each block as shown before Theorem: –query processing time O(|D| + |output|) –uses no more space than an inverted index AutoTree Summary: + output-sensitive –not IO-efficient (heavy use of bit-rank operations) –compression not optimal

Parenthesis Despite its quadratic worst-case complexity, the inverted index is hard to beat in practice –very simple code –lists are highly compressible –perfect locality of access Number of operations is a deceptive measure –100 disk seeks take about half a second –in that time can read 200 MB of contiguous data (if stored compressed) –main memory: 100 non-local accesses  10 KB data block data

Solution 3: HYB Flat division of word range into blocks DACABACADAABCACA SIGIR’06 / IR’07 list for A-D EFGJHIIEFGHJI list for E-J LNMNNKLMNMKLMKL list for K-N

Solution 3: HYB Flat division of word range into blocks Replace doc ids by gaps and words by frequency ranks: DACABACADAABCACA rd 1 st 2 nd 1 st 4 th 1 st 2 nd 1 st 3 rd 1 st 4 th 2 nd 1 st 2 nd 1 st Encode both gaps and ranks such that x  log 2 x bits +0  0 +1   st (A)  0 2 nd (C)  10 3 rd (D)  th (B)  An actual block of HYB SIGIR’06 / IR’07

Solution 3: HYB Flat division of word range into blocks Theorem: –Let n = number of documents, m = number of words –If blocks are chosen of equal volume ε ∙ n –Then query time ε ∙ n and empiricial entropy H HYB ~ (1+ ε) ∙ H INV DACABACADAABCACA SIGIR’06 / IR’07 HYB Summary: + IO-efficient (mere scans of data) + very good compression –not output-sensitive

Open Problem A solution for context-sensitive prefix search which is both output-sensitive and IO-efficient –Note: the interesting queries are those with large D and W but small result set Similar situation for substring search / suffix arrays –all algorithms with good compression have poor locality of access But prefix search is easier … –… and more relevant for text search Thank you!

INV vs. HYB — Space Consumption Theorem: The empirical entropy of INV is Σ n i ∙ ( 1/ln 2 + log 2 (n/n i ) ) Theorem: The empirical entropy of HYB with block size ε∙n is Σ n i ∙ ( (1+ε)/ln 2 + log 2 (n/n i ) ) HOMEOPATHY 44,015 docs 263,817 words with positions WIKIPEDIA 2,866,503 docs 6,700,119 words with positions TREC.GOV 25,204,013 docs 25,263,176 words no positions raw size452 MB 7.4 GB426 GB INV 13 MB0.48 GB 4.6 GB HYB 14 MB0.51 GB 4.9 GB Nice match of theory and practice n i = number of documents containing i-th word, n = number of documents

INV vs. HYB — Query Time HOMEOPATHY 44,015 docs 263,817 words 5,732 real queries with proximity avg : 0.03 secs max: 0.38 secs avg :.003 secs max: 0.06 secs INV HYB WIKIPEDIA 2,866,503 docs 6,700,119 words 100 random queries with proximity avg : 0.17 secs max: 2.27 secs avg : 0.05 secs max: 0.49 secs Experiment: type ordinary queries from left to right db, dbl, dblp, dblp un, dblp uni, dblp univ, dblp unive,... TREC.GOV 25,204,013 docs 25,263,176 words 50 TREC queries no proximity avg : 0.58 secs max: secs avg : 0.11 secs max: 0.86 secs HYB beats INV by an order of magnitude

Engineering Careful implementation in C++ –Experiment: sum over array of 10 million 4-byte integers (on a Linux PC with an approx. 2 GB/sec memory bandwidth) With HYB, every query is essentially one block scan –perfect locality of access, no sorting or merging, etc. –balanced ratio of read, decompression, processing, etc. C++JavaMySQLPerl readdecomp.intersectrankhistory 21%18%11%15%35%

Engineering Careful implementation in C++ –Experiment: sum over array of 10 million 4-byte integers (on a Linux PC with an approx. 2 GB/sec memory bandwidth) With HYB, every query is essentially one block scan –perfect locality of access, no sorting or merging, etc. –balanced ratio of read, decompression, processing, etc. C++JavaMySQLPerl 1800 MB/sec readdecomp.intersectrankhistory 21%18%11%15%35%

Engineering Careful implementation in C++ –Experiment: sum over array of 10 million 4-byte integers (on a Linux PC with an approx. 2 GB/sec memory bandwidth) With HYB, every query is essentially one block scan –perfect locality of access, no sorting or merging, etc. –balanced ratio of read, decompression, processing, etc. C++JavaMySQLPerl 1800 MB/sec300 MB/sec readdecomp.intersectrankhistory 21%18%11%15%35%

Engineering Careful implementation in C++ –Experiment: sum over array of 10 million 4-byte integers (on a Linux PC with an approx. 2 GB/sec memory bandwidth) With HYB, every query is essentially one block scan –perfect locality of access, no sorting or merging, etc. –balanced ratio of read, decompression, processing, etc. C++JavaMySQLPerl 1800 MB/sec300 MB/sec16 MB/sec readdecomp.intersectrankhistory 21%18%11%15%35%

Engineering Careful implementation in C++ –Experiment: sum over array of 10 million 4-byte integers (on a Linux PC with an approx. 2 GB/sec memory bandwidth) With HYB, every query is essentially one block scan –perfect locality of access, no sorting or merging, etc. –balanced ratio of read, decompression, processing, etc. C++JavaMySQLPerl 1800 MB/sec300 MB/sec16 MB/sec2 MB/sec readdecomp.intersectrankhistory 21%18%11%15%35%

System Design — High Level View Debugging such an application is hell! Compute Server C++ Web Server PHP User Client JavaScript

Basic Problem Definition Definition: Context-sensitive prefix search and completion Given a query consisting of –sorted list D of doc ids Doc15 Doc183 Doc185 Doc17351 … –range W of word ids Word1893 – Word7329 Compute as a result – all (w, d) w Є W, d Є D Doc15 Doc15 Doc sorted by doc id Word7014 Word5112 Word2011 … Refinements –positions Pos12 Pos73 Pos44... –scores

Basic Problem Definition For example, dblp uni –set D = document ids from result for dblp –range W = word ids of all words starting with uni → multi-dimensional query processed as sequence of 1½ dimensional queries For example, intersect completions of results for conf:sigir author: and conf:sigmod author: D11D25D57D91 W25W23W24 D23D54D56D58D69 W27 W23 W27

Basic Problem Definition For example, dblp uni –set D = document ids from result for dblp –range W = word ids of all words starting with uni → multi-dimensional query processed as sequence of 1½ dimensional queries For example, intersect completions of results for conf:sigir author: and conf:sigmod author: → efficient, because completions are from small range D11D25D57D91 W25W23W24 D23D54D56D58D69 W27 W23 W27

Conclusions Context-sensitive prefix search and completion –is a fundamental operation supports autocompletion search, semantic search, faceted search, DB-style selects and joins, ontology search, … –efficient support via HYB index very good compression properties perfect locality of access Some open issues –integrate top-k query processing –what else can we do with it? –very short prefixes