Cool algorithms for a cool feature Holger Bast Max-Planck-Institut für Informatik (MPII) Saarbrücken, Germany joint work with Christian Mortensen and Ingmar.

Slides:



Advertisements
Similar presentations
Gerth Stølting Brodal University of Aarhus Monday June 9, 2008, IT University of Copenhagen, Denmark International PhD School in Algorithms for Advanced.
Advertisements

Chapter 4: Trees Part II - AVL Tree
B+-Trees (PART 1) What is a B+ tree? Why B+ trees? Searching a B+ tree
Constant-Time LCA Retrieval
296.3: Algorithms in the Real World
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
COMP 451/651 Indexes Chapter 1.
2-dimensional indexing structure
Modern Information Retrieval
Multiple-key indexes Index on one attribute provides pointer to an index on the other. If V is a value of the first attribute, then the index we reach.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
B-Trees Disk Storage What is a multiway tree? What is a B-tree?
1 B-Trees Disk Storage What is a multiway tree? What is a B-tree? Why B-trees? Comparing B-trees and AVL-trees Searching a B-tree Insertion in a B-tree.
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
Primary Indexes Dense Indexes
R-Trees 2-dimensional indexing structure. R-trees 2-dimensional version of the B-tree: B-tree of maximum degree 8; degree between 3 and 8 Internal nodes.
B-Trees and B+-Trees Disk Storage What is a multiway tree?
DAST 2005 Week 4 – Some Helpful Material Randomized Quick Sort & Lower bound & General remarks…
Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
Important Problem Types and Fundamental Data Structures
COMP s1 Computing 2 Complexity
1 Multiway trees & B trees & 2_4 trees Go&Ta Chap 10.
Data Structures Arrays both single and multiple dimensions Stacks Queues Trees Linked Lists.
ICS 220 – Data Structures and Algorithms Week 7 Dr. Ken Cosh.
Improving Min/Max Aggregation over Spatial Objects Donghui Zhang, Vassilis J. Tsotras University of California, Riverside ACM GIS’01.
Chapter Tow Search Trees BY HUSSEIN SALIM QASIM WESAM HRBI FADHEEL CS 6310 ADVANCE DATA STRUCTURE AND ALGORITHM DR. ELISE DE DONCKER 1.
The Power of Prefix Search (with a nice open problem) Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany Talk at ADS 2007 in Bertinoro,
1 B Trees - Motivation Recall our discussion on AVL-trees –The maximum height of an AVL-tree with n-nodes is log 2 (n) since the branching factor (degree,
The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB integration Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint.
The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB Integration Holger Bast, Ingmar Weber Max-Planck-Institut für Informatik CIDR 2007)
Database Management 9. course. Execution of queries.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
Querying Structured Text in an XML Database By Xuemei Luo.
Type Less, Find More: Fast Autocompletion Search with a Succinct Index Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany joint work with.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Priority Queues and Binary Heaps Chapter Trees Some animals are more equal than others A queue is a FIFO data structure the first element.
Binary Trees, Binary Search Trees RIZWAN REHMAN CENTRE FOR COMPUTER STUDIES DIBRUGARH UNIVERSITY.
The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB Integration Holger Bast, Ingmar Weber CIDR 2007) Conference on Innovative Data Systems.
B + -Trees. Motivation An AVL tree with N nodes is an excellent data structure for searching, indexing, etc. The Big-Oh analysis shows that most operations.
2-3 Tree. Slide 2 Outline  Balanced Search Trees 2-3 Trees Trees.
MA/CSSE 473 Day 28 Dynamic Programming Binomial Coefficients Warshall's algorithm Student questions?
Symbol Tables and Search Trees CSE 2320 – Algorithms and Data Structures Vassilis Athitsos University of Texas at Arlington 1.
© 2010 Pearson Addison-Wesley. All rights reserved. Addison Wesley is an imprint of CHAPTER 12: Multi-way Search Trees Java Software Structures: Designing.
Advanced Algorithm Design and Analysis (Lecture 12) SW5 fall 2004 Simonas Šaltenis E1-215b
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
1 Multi-Level Indexing and B-Trees. 2 Statement of the Problem When indexes grow too large they have to be stored on secondary storage. However, there.
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
MA/CSSE 473 Day 30 B Trees Dynamic Programming Binomial Coefficients Warshall's algorithm No in-class quiz today Student questions?
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
On the Intersection of Inverted Lists Yangjun Chen and Weixin Shen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg, Manitoba,
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
Database Applications (15-415) DBMS Internals- Part III Lecture 13, March 06, 2016 Mohammad Hammoud.
MA/CSSE 473 Day 26 Student questions Boyer-Moore B Trees.
Representing Sets (2.3.3) Huffman Encoding Trees (2.3.4)
B+-Trees.
Information Retrieval in Department 1
CS 430: Information Discovery
The core algorithmic problem Ordinary Inverted Index
Type Less, Find More: Fast Autocompletion Search with a Succinct Index
Binary Trees, Binary Search Trees
Lecture 7: Index Construction
Binary Trees, Binary Search Trees
CSE 373 Data Structures and Algorithms
Binary Trees, Binary Search Trees
Presentation transcript:

cool algorithms for a cool feature Holger Bast Max-Planck-Institut für Informatik (MPII) Saarbrücken, Germany joint work with Christian Mortensen and Ingmar Weber IISc Bangalore, 9th October 2004 Searching with Autocompletion

Searching with autocompletion  While typing a google-like query, show all completions of the beginning of the last word typed that co-occur with the previous words in some document  Useful in a variety of ways: –saves typing & avoids spelling errors –find unexpected variations of words that actually lead to a hit –explore the collection while formulating the query –and more! (end of talk)

A formalization …  Query –a set D of documents (=hits of first part of query) –a prefix p (=what has been typed of last word of query)  Answer –all completions of p that occur somewhere in D –those documents from D that contain one of these words  Objective –process queries as quickly as possible … –… by a (precomputed) data structure using as little space as possible

… and its use for the actual problem  For example, given the user typed kurt mehl* alg 1.compute all completions of kurt$ → trivially only kurt itself list D 1 of documents containing kurt 2.compute all completions of mehl that occur in some doc of D 1 the list D 2 of these documents (subset of D 1 ) 3.compute all completions of alg that occur in some doc of D 2 the list D 3 of these documents (subset of D 2 ) Autocompletion and prefix search simultaneously! special end-of-word symbol

A straightforward solution  The NAIVE data structure and algorithm: –for each word w pre-compute the sorted list of all (ids of) documents containing w (so-called inverted index) –for a given set of documents D and prefix p, fetch the list of each of the potential completions of p and intersect with D mehl4, 28, 78, 105, … mehlhorn5, 17, 51, 79, 102, … mehihorn23 mehlhorns17, 102, 237, … mehltau21, 79, 157, … D = 5, 17, 23, 57, 102 p = meh

A straightforward solution  The NAIVE data structure and algorithm: –for each word w pre-compute the sorted list of all (ids of) documents containing w (so-called inverted index) –for a given set of documents D and prefix p, fetch the list of each of the potential completions of p and intersect with D mehl4, 28, 78, 105, … mehlhorn5, 17, 51, 79, 102, … mehihorn23 mehlhorns17, 102, 237, … mehltau21, 79, 157, … hits = 5, 17, 23, 102 D = 5, 17, 23, 57, 102 p = meh

A straightforward solution  Space for data structure –n∙L∙log n, where n is the number of document and L is the average number of distinct words per document  Time to process a query –needs K intersections, where K is the number of all completions of p mehl4, 28, 78, 105, … mehlhorn5, 17, 51, 79, 102, … mehihorn23 mehlhorns17, 102, 237, … mehltau21, 79, 157, … hits = 5, 17, 23, 102 D = 5, 17, 23, 57, 102 p = meh

Performance # intersections per query# bits used by index NAIVEKn∙L∙log n K = # completions overalln = # documents L = # distinct words per doc (on average)

Performance # intersections per query# bits used by index NAIVEKn∙L∙log n OUR GOALk≤ n∙L∙log n K = # completions overalln = # documents L = # distinct words per doc k = # completions in hits (on average) in our experiments we also measure total list length and, of course, running time per query

What I like about this work  Multi-faceted (and each facet challenging) –1/3 getting the problem clear –1/3 algorithms and analysis –1/3 design and implementation and workflow was not like this but more like this

Performance # intersections per query# bits used by index NAIVEKn∙L∙log n OUR GOALk≤ n∙L∙log n K = # completions overalln = # documents L = # distinct words per doc k = # completions in hits (on average)

Building a tree over the words  For each internal node, precompute the union of the lists of all leaves in the subtree

mehl 4,28,78,105,… mehlhorn 5,17,51,102,… mehihorn 23 mehlhorns 17,102,237,… zürich … aachen-zürich 1,2,3,4,5,6,7,8,9,10,11,12,13,… mehl-mehlhorns 4,5,17,23,28,51,78,79,102,105,237,… Building a tree over the words  For each internal node, precompute the union of the lists of all leaves in the subtree aachen … D = 57, 115, 250 p = meh

mehl 4,28,78,105,… mehlhorn 5,17,51,102,… mehihorn 23 mehlhorns 17,102,237,… zürich … aachen-zürich 1,2,3,4,5,6,7,8,9,10,11,12,13,… mehl-mehlhorns 4,5,17,23,28,51,78,79,102,105,237,… Building a tree over the words  For each internal node, precompute the union of the lists of all leaves in the subtree aachen … subtree need not be explored further! D = 57, 115, 250 p = meh

mehl 4,28,78,105,… mehlhorn 5,17,51,102,… mehihorn 23 mehlhorns 17,102,237,… zürich … aachen-zürich 1,2,3,4,5,6,7,8,9,10,11,12,13,… mehl-mehlhorns 4,5,17,23,28,51,78,79,102,105,237,… Building a tree over the words  For each internal node, precompute the union of the lists of all leaves in the subtree aachen … D = 51, 115, 250 p = meh

mehl 4,28,78,105,… mehlhorn 5,17,51,102,… mehihorn 23 mehlhorns 17,102,237,… zürich … aachen-zürich 1,2,3,4,5,6,7,8,9,10,11,12,13,… mehl-mehlhorns 4,5,17,23,28,51,78,79,102,105,237,… Building a tree over the words  For each internal node, precompute the union of the lists of all leaves in the subtree aachen … D = 51, 115, 250 p = meh but up to log K intersections per hit!

Performance # intersections per query# bits used by index NAIVEKn∙L∙log n TREEk ∙ log K≈ n∙L∙log n∙log m OUR GOALk≤ n∙L∙log n K = # completions overalln = # documents L = # distinct words per doc k = # completions in hitsm = # words (on average) way too much space!

Trick 1: Relative bit vectors  Store each doc list as a bit vector, relative to its parent –the i-th bit of the root node corresponds to the i-th doc –the i-th bit of any other node corresponds to the i-th set bit of its parent node aachen-zürich … mehl-zürich … mehl-stream …

Trick 1: Relative bit vectors aachen-zürich … mehl-zürich … mehl-stream … corresponds to doc 5 corresponds to doc 10  Store each doc list as a bit vector, relative to its parent –the i-th bit of the root node corresponds to the i-th doc –the i-th bit of any other node corresponds to the i-th set bit of its parent node

Trick 1: Complexity  Instead of log n bits, each occurrence of a doc id now accounts for mere 2 bits –wherever there was a whole doc id before, there is now a single bit, set to 1 –if a node has a bit set to 0, than its parent node has the bit for the corresponding doc set to 1, and the other sibling has the corresponding bit set to 1  But we can't start processing in the middle of the tree anymore –but first have to travel from the root to the respective node to get the set of doc ids it corresponds to –costs up to an additional log m intersections per query!

Performance # intersections per query# bits used by index NAIVEKn∙L∙log n TREEk ∙ log K≈ n∙L∙log n∙log m +BIT VECTORSk ∙ log K + log m≈ n∙L∙2∙log m OUR GOALk≤ n∙L∙log n K = # completions overalln = # documents L = # distinct words per doc k = # completions in hitsm = # words (on average) can we get rid of the log K factor?

Trick 2: Push up the words  For each node, by each set bit, store the leftmost word of that doc that is not already stored by a parent node … … … aachen advance algol algorithm advance aachen art advance mehlhorn meeting mehl mehihorn mercury mix middle D = 5, 7, 10 p = meh

Trick 2: Push up the words  For each node, by each set bit, store the leftmost word of that doc that is not already stored by a parent node … … … aachen advance algol algorithm advance aachen art advance mehlhorn meeting mehl mehihorn mercury mix middle D = 5, 7, 10 p = meh D = 5, 10 ( → 2, 5) report: mehihorn D = 5 report: Ø → STOP

Trick 2: Complexity  Lemma: When processing a node all leaves under which are potential completions, either the intersection is empty and none of the nodes in the sub-tree needs to be explored, or at least one new word is reported –because for two nodes, where one is a descendant of the other, the lexicographical smallest word reported by the older (higher up in the tree) node is strictly smaller than the lexicographical smallest word reported by its descendant.  The doc lists now need 2∙n∙L bits –each set bit now corresponds to exactly one word-in-doc pair  Storing the words by each set bit requires n∙L∙log m bits –can be reduced to n∙L∙log(m/L) because node at level i need not store first i bits of word ids

Performance # intersections per query# bits used by index NAIVEKn∙L∙log n TREEk ∙ log K≈ n∙L∙log n∙log m +BIT VECTORSk ∙ log K + log m≈ n∙L∙2∙log m +PUSH UPk + log mn∙L∙(2+log m/L) OUR GOALk≤ n∙L∙log n K = # completions overalln = # documents L = # distinct words per doc k = # completions in hitsm = # words (on average) the log m still hurts!

Trick 3: divide into blocks  Divide the word list into blocks of size T, and maintain each block by a TREE with bit vectors, and words pushed up, as developed so far D = … p = me

Trick 3: divide into blocks meals meyers D = … p = me  Divide the word list into blocks of size T, and maintain each block by a TREE with bit vectors, and words pushed up, as developed so far

Trick 3: divide into blocks meals meyers D = … p = me  Divide the word list into blocks of size T, and maintain each block by a TREE with bit vectors, and words pushed up, as developed so far

Trick 3: Complexity  Number of intersections is now Σ blocks i to inspect ( k i log (T/K i ) ) ~ k + log T + K/T  Space requirement is now –n∙L∙2 bits for the doc lists as before –n∙L∙log T for the word lists (can be reduced by saving initial bits, as before) –plus an additional m/T∙n bits for the (complete) lists in the root node of each block

Performance # intersections per query# bits used by index NAIVEKn∙L∙log n TREEk ∙ log K≈ n∙L∙log n∙log m +BIT VECTORSk ∙ log K + log m≈ n∙L∙2∙log m +PUSH UPk + log mn∙L ∙ (2+log m/L) +BLOCKSk + log T + K/Tn∙L∙(2+log T+m/(LT)) OUR GOALk≤ n∙L∙log n K = # completions overalln = # documents L = # distinct words per doc k = # completions in hitsm = # termsT = block size (arbitrary)

Performance # intersections per query# bits used by index NAIVEKn∙L∙log n TREEk ∙ log K> 2∙n∙L∙log n +BIT VECTORSk ∙ log K + log m≈ n∙L∙2∙log m +PUSH UPk + log mn∙L ∙ (2+log m/L) +BLOCKSk + log T + K/Tn∙L∙(2+log T+m/(LT)) OUR GOALk≤ n∙L∙log n nice time-space tradeoff + works fine in practice!

Where to go from here …  Theoretically –bring down complexity to the ideal k and ≤ n∙L∙log n  Practically –make software package available (almost done) –combine with other features: proximity done; concepts? XML? –there is a lot of potential in this autocompletion principle → explore it!

Where to go from here …  Theoretically –bring down complexity to the ideal k and ≤ n∙L∙log n  Practically –make software package available (almost done) –combine with other features: proximity done; concepts? XML? –there is a lot of potential in this autocompletion principle → explore it! Thank you !

Another obvious approach  For each word, precompute the set of words with which it it co-occurs in some document –needs ~m^2 bits –query time ~k only for two-word queries!

Trick 2: Space complexity  The total number of bits in all the document lists is exactly twice n*L= #word-in-doc pairs, because –each set bit has a word from a particular doc stored by it, and that word-doc-pair is stored by no other bit –each zero bit has a set bit as its parent, and its sibling cannot be zero as well  Storing the words by each set bit requires n*L*log m bits –can be reduced to n*L*log(m/L) because node at level i need not store first i bits of word ids

Search + autoCompletion  While typing the query, show all completions of the last prefix typed that co-occur with the previous words in some document  E.g. having typed "kurt meh", show –mehlhorn, mehlhorns –but none of the (many) other words starting with "meh" but that do not cooccur with "kurt", e.g. "mehr".  SHOW DEMO: –kurt.meh –comput*..geometr

Building a tree over the words  For each internal node, precompute the union of the lists of all leafs in the subtree mehl 4,28,78,105,… mehlhorn 5,17,51,79,102,… mehihorn 23 mehlhorns 17,102,237,… aachen … zürich … aachen-zürich 1,2,3,4,5,6,7,8,9,10,11,12,13,… mehl-mehlhorns 4,5,17,23,28,51,78,79,102,105,237,…

Trick 3: divide into blocks  Divide the word list into blocks of size T, and maintain each block by a TREE+BV+WL as developed so far aachen algol zoo zürich ……… algorithm advance aachen-algolalgorithm-advancezoo-zürich

Performance #intersectionsspace in bits NAÏVEKn∙L∙log n TREEk ∙ log (K/k)≈ n∙L∙log n∙log m IDEALkn∙L∙log n K = all potential completion k = completions leading to hit n = number of document m = number of terms L = distinct words per doc T = free parameter way to much space!

Performance #intersectionsspace in bits NAÏVEKn∙L∙log n TREEmax(K,k∙log K)≈ n∙L∙log n∙log m TREE+BVmax(K,k∙log K) + log m≈ n∙L∙2∙log m TREE+BV+PUk + log mn∙L∙(2+log m/L) IDEALkn∙L∙log n K = all potential completion k = completions leading to hit n = number of document m = number of terms L = distinct words per doc T = free parameter But the "+ log m" really hurts in practice!

Performance #intersectionsspace in bits NAÏVEKn*L*log n IDEALkn*L*log n K = all potential completion k = completions leading to hit n = number of document m = number of terms L = distinct words per doc T = free parameter

Performance #intersectionsspace in bits NAÏVEKn∙L∙log n TREEk∙log (K/k)≈ n∙L∙log n∙log m TREE+BVk∙log (K/k) + log m≈ n∙L∙2∙log m IDEALkn∙L∙log n K = all potential completion k = completions leading to hit n = number of document m = number of terms L = distinct words per doc T = free parameter can't we get rid of the log (K/k) factor?

cool algorithms for a cool feature Holger Bast Max-Planck-Institut für Informatik (MPII) Saarbrücken, Germany joint work with Christian Mortensen and Ingmar Weber MPI Saarbrücken, 6th October 2004 Searching with Autocompletion

A straightforward solution  Space for data structure –n∙L∙log n, where n is the number of document and L is the average number of distinct words per document  Time to process a query –needs K intersections, where K is the number of all completions of p mehl4, 28, 78, 105, … mehlhorn5, 17, 51, 79, 102, … mehihorn23 mehlhorns17, 102, 237, … mehltau21, 79, 157, … Our goal is k, the number of completions in hits (3 in this case)

Searching with autocompletion  While typing a google-like query, show all completions of the beginning of the last word typed that co-occur with the previous words in some document  E.g., having typed kurt mehl* alg, show –algorithm, algorithms, algorithmics –but none of the (many) other words starting with alg but that do not occur in hits for kurt mehl*  Useful in a variety of ways: –saves typing & avoids spelling errors –find unexpected variations of words that actually lead to a hit –explore the collection while formulating the query –and more! (end of talk)