Cool algorithms for a cool feature Holger Bast Max-Planck-Institut für Informatik (MPII) Saarbrücken, Germany joint work with Christian Mortensen and Ingmar.

cool algorithms for a cool feature Holger Bast Max-Planck-Institut für Informatik (MPII) Saarbrücken, Germany joint work with Christian Mortensen and Ingmar Weber IISc Bangalore, 9th October 2004 Searching with Autocompletion

Searching with autocompletion  While typing a google-like query, show all completions of the beginning of the last word typed that co-occur with the previous words in some document  Useful in a variety of ways: –saves typing & avoids spelling errors –find unexpected variations of words that actually lead to a hit –explore the collection while formulating the query –and more! (end of talk)

A formalization …  Query –a set D of documents (=hits of first part of query) –a prefix p (=what has been typed of last word of query)  Answer –all completions of p that occur somewhere in D –those documents from D that contain one of these words  Objective –process queries as quickly as possible … –… by a (precomputed) data structure using as little space as possible

… and its use for the actual problem  For example, given the user typed kurt mehl* alg 1.compute all completions of kurt$ → trivially only kurt itself list D 1 of documents containing kurt 2.compute all completions of mehl that occur in some doc of D 1 the list D 2 of these documents (subset of D 1 ) 3.compute all completions of alg that occur in some doc of D 2 the list D 3 of these documents (subset of D 2 ) Autocompletion and prefix search simultaneously! special end-of-word symbol

A straightforward solution  The NAIVE data structure and algorithm: –for each word w pre-compute the sorted list of all (ids of) documents containing w (so-called inverted index) –for a given set of documents D and prefix p, fetch the list of each of the potential completions of p and intersect with D mehl4, 28, 78, 105, … mehlhorn5, 17, 51, 79, 102, … mehihorn23 mehlhorns17, 102, 237, … mehltau21, 79, 157, … D = 5, 17, 23, 57, 102 p = meh

A straightforward solution  The NAIVE data structure and algorithm: –for each word w pre-compute the sorted list of all (ids of) documents containing w (so-called inverted index) –for a given set of documents D and prefix p, fetch the list of each of the potential completions of p and intersect with D mehl4, 28, 78, 105, … mehlhorn5, 17, 51, 79, 102, … mehihorn23 mehlhorns17, 102, 237, … mehltau21, 79, 157, … hits = 5, 17, 23, 102 D = 5, 17, 23, 57, 102 p = meh

A straightforward solution  Space for data structure –n∙L∙log n, where n is the number of document and L is the average number of distinct words per document  Time to process a query –needs K intersections, where K is the number of all completions of p mehl4, 28, 78, 105, … mehlhorn5, 17, 51, 79, 102, … mehihorn23 mehlhorns17, 102, 237, … mehltau21, 79, 157, … hits = 5, 17, 23, 102 D = 5, 17, 23, 57, 102 p = meh

Performance # intersections per query# bits used by index NAIVEKn∙L∙log n K = # completions overalln = # documents L = # distinct words per doc (on average)

Performance # intersections per query# bits used by index NAIVEKn∙L∙log n OUR GOALk≤ n∙L∙log n K = # completions overalln = # documents L = # distinct words per doc k = # completions in hits (on average) in our experiments we also measure total list length and, of course, running time per query

What I like about this work  Multi-faceted (and each facet challenging) –1/3 getting the problem clear –1/3 algorithms and analysis –1/3 design and implementation and workflow was not like this but more like this

Performance # intersections per query# bits used by index NAIVEKn∙L∙log n OUR GOALk≤ n∙L∙log n K = # completions overalln = # documents L = # distinct words per doc k = # completions in hits (on average)

Building a tree over the words  For each internal node, precompute the union of the lists of all leaves in the subtree

mehl 4,28,78,105,… mehlhorn 5,17,51,102,… mehihorn 23 mehlhorns 17,102,237,… zürich … aachen-zürich 1,2,3,4,5,6,7,8,9,10,11,12,13,… mehl-mehlhorns 4,5,17,23,28,51,78,79,102,105,237,… Building a tree over the words  For each internal node, precompute the union of the lists of all leaves in the subtree aachen … D = 57, 115, 250 p = meh

mehl 4,28,78,105,… mehlhorn 5,17,51,102,… mehihorn 23 mehlhorns 17,102,237,… zürich … aachen-zürich 1,2,3,4,5,6,7,8,9,10,11,12,13,… mehl-mehlhorns 4,5,17,23,28,51,78,79,102,105,237,… Building a tree over the words  For each internal node, precompute the union of the lists of all leaves in the subtree aachen … subtree need not be explored further! D = 57, 115, 250 p = meh

mehl 4,28,78,105,… mehlhorn 5,17,51,102,… mehihorn 23 mehlhorns 17,102,237,… zürich … aachen-zürich 1,2,3,4,5,6,7,8,9,10,11,12,13,… mehl-mehlhorns 4,5,17,23,28,51,78,79,102,105,237,… Building a tree over the words  For each internal node, precompute the union of the lists of all leaves in the subtree aachen … D = 51, 115, 250 p = meh

mehl 4,28,78,105,… mehlhorn 5,17,51,102,… mehihorn 23 mehlhorns 17,102,237,… zürich … aachen-zürich 1,2,3,4,5,6,7,8,9,10,11,12,13,… mehl-mehlhorns 4,5,17,23,28,51,78,79,102,105,237,… Building a tree over the words  For each internal node, precompute the union of the lists of all leaves in the subtree aachen … D = 51, 115, 250 p = meh but up to log K intersections per hit!

Performance # intersections per query# bits used by index NAIVEKn∙L∙log n TREEk ∙ log K≈ n∙L∙log n∙log m OUR GOALk≤ n∙L∙log n K = # completions overalln = # documents L = # distinct words per doc k = # completions in hitsm = # words (on average) way too much space!

Trick 1: Relative bit vectors  Store each doc list as a bit vector, relative to its parent –the i-th bit of the root node corresponds to the i-th doc –the i-th bit of any other node corresponds to the i-th set bit of its parent node aachen-zürich 1111111111111… mehl-zürich 1001000111101… mehl-stream 1001110…

Trick 1: Relative bit vectors aachen-zürich 1111111111111… mehl-zürich 1001000111101… mehl-stream 1001110… corresponds to doc 5 corresponds to doc 10  Store each doc list as a bit vector, relative to its parent –the i-th bit of the root node corresponds to the i-th doc –the i-th bit of any other node corresponds to the i-th set bit of its parent node

Trick 1: Complexity  Instead of log n bits, each occurrence of a doc id now accounts for mere 2 bits –wherever there was a whole doc id before, there is now a single bit, set to 1 –if a node has a bit set to 0, than its parent node has the bit for the corresponding doc set to 1, and the other sibling has the corresponding bit set to 1  But we can't start processing in the middle of the tree anymore –but first have to travel from the root to the respective node to get the set of doc ids it corresponds to –costs up to an additional log m intersections per query!

Performance # intersections per query# bits used by index NAIVEKn∙L∙log n TREEk ∙ log K≈ n∙L∙log n∙log m +BIT VECTORSk ∙ log K + log m≈ n∙L∙2∙log m OUR GOALk≤ n∙L∙log n K = # completions overalln = # documents L = # distinct words per doc k = # completions in hitsm = # words (on average) can we get rid of the log K factor?

Trick 2: Push up the words  For each node, by each set bit, store the leftmost word of that doc that is not already stored by a parent node 1 1 1 1 1 1 1 1 1 1 … 1 0 0 0 1 0 0 1 1 1 … 1 0 0 1 1 … aachen advance algol algorithm advance aachen art advance mehlhorn meeting mehl mehihorn mercury mix middle D = 5, 7, 10 p = meh

Trick 2: Push up the words  For each node, by each set bit, store the leftmost word of that doc that is not already stored by a parent node 1 1 1 1 1 1 1 1 1 1 … 1 0 0 0 1 0 0 1 1 1 … 1 0 0 1 1 … aachen advance algol algorithm advance aachen art advance mehlhorn meeting mehl mehihorn mercury mix middle D = 5, 7, 10 p = meh D = 5, 10 ( → 2, 5) report: mehihorn D = 5 report: Ø → STOP

Trick 2: Complexity  Lemma: When processing a node all leaves under which are potential completions, either the intersection is empty and none of the nodes in the sub-tree needs to be explored, or at least one new word is reported –because for two nodes, where one is a descendant of the other, the lexicographical smallest word reported by the older (higher up in the tree) node is strictly smaller than the lexicographical smallest word reported by its descendant.  The doc lists now need 2∙n∙L bits –each set bit now corresponds to exactly one word-in-doc pair  Storing the words by each set bit requires n∙L∙log m bits –can be reduced to n∙L∙log(m/L) because node at level i need not store first i bits of word ids

Performance # intersections per query# bits used by index NAIVEKn∙L∙log n TREEk ∙ log K≈ n∙L∙log n∙log m +BIT VECTORSk ∙ log K + log m≈ n∙L∙2∙log m +PUSH UPk + log mn∙L∙(2+log m/L) OUR GOALk≤ n∙L∙log n K = # completions overalln = # documents L = # distinct words per doc k = # completions in hitsm = # words (on average) the log m still hurts!

Trick 3: divide into blocks  Divide the word list into blocks of size T, and maintain each block by a TREE with bit vectors, and words pushed up, as developed so far D = … p = me

Trick 3: divide into blocks meals meyers D = … p = me  Divide the word list into blocks of size T, and maintain each block by a TREE with bit vectors, and words pushed up, as developed so far

Trick 3: Complexity  Number of intersections is now Σ blocks i to inspect ( k i + 1 + log (T/K i ) ) ~ k + log T + K/T  Space requirement is now –n∙L∙2 bits for the doc lists as before –n∙L∙log T for the word lists (can be reduced by saving initial bits, as before) –plus an additional m/T∙n bits for the (complete) lists in the root node of each block

Performance # intersections per query# bits used by index NAIVEKn∙L∙log n TREEk ∙ log K≈ n∙L∙log n∙log m +BIT VECTORSk ∙ log K + log m≈ n∙L∙2∙log m +PUSH UPk + log mn∙L ∙ (2+log m/L) +BLOCKSk + log T + K/Tn∙L∙(2+log T+m/(LT)) OUR GOALk≤ n∙L∙log n K = # completions overalln = # documents L = # distinct words per doc k = # completions in hitsm = # termsT = block size (arbitrary)

Performance # intersections per query# bits used by index NAIVEKn∙L∙log n TREEk ∙ log K> 2∙n∙L∙log n +BIT VECTORSk ∙ log K + log m≈ n∙L∙2∙log m +PUSH UPk + log mn∙L ∙ (2+log m/L) +BLOCKSk + log T + K/Tn∙L∙(2+log T+m/(LT)) OUR GOALk≤ n∙L∙log n nice time-space tradeoff + works fine in practice!

Where to go from here …  Theoretically –bring down complexity to the ideal k and ≤ n∙L∙log n  Practically –make software package available (almost done) –combine with other features: proximity done; concepts? XML? –there is a lot of potential in this autocompletion principle → explore it!

Where to go from here …  Theoretically –bring down complexity to the ideal k and ≤ n∙L∙log n  Practically –make software package available (almost done) –combine with other features: proximity done; concepts? XML? –there is a lot of potential in this autocompletion principle → explore it! Thank you !

Another obvious approach  For each word, precompute the set of words with which it it co-occurs in some document –needs ~m^2 bits –query time ~k only for two-word queries!

Trick 2: Space complexity  The total number of bits in all the document lists is exactly twice n*L= #word-in-doc pairs, because –each set bit has a word from a particular doc stored by it, and that word-doc-pair is stored by no other bit –each zero bit has a set bit as its parent, and its sibling cannot be zero as well  Storing the words by each set bit requires n*L*log m bits –can be reduced to n*L*log(m/L) because node at level i need not store first i bits of word ids

Search + autoCompletion  While typing the query, show all completions of the last prefix typed that co-occur with the previous words in some document  E.g. having typed "kurt meh", show –mehlhorn, mehlhorns –but none of the (many) other words starting with "meh" but that do not cooccur with "kurt", e.g. "mehr".  SHOW DEMO: –kurt.meh –comput*..geometr

Building a tree over the words  For each internal node, precompute the union of the lists of all leafs in the subtree mehl 4,28,78,105,… mehlhorn 5,17,51,79,102,… mehihorn 23 mehlhorns 17,102,237,… aachen … zürich … aachen-zürich 1,2,3,4,5,6,7,8,9,10,11,12,13,… mehl-mehlhorns 4,5,17,23,28,51,78,79,102,105,237,…

Trick 3: divide into blocks  Divide the word list into blocks of size T, and maintain each block by a TREE+BV+WL as developed so far aachen algol zoo zürich ……… algorithm advance aachen-algolalgorithm-advancezoo-zürich

Performance #intersectionsspace in bits NAÏVEKn∙L∙log n TREEk ∙ log (K/k)≈ n∙L∙log n∙log m IDEALkn∙L∙log n K = all potential completion k = completions leading to hit n = number of document m = number of terms L = distinct words per doc T = free parameter way to much space!

Performance #intersectionsspace in bits NAÏVEKn∙L∙log n TREEmax(K,k∙log K)≈ n∙L∙log n∙log m TREE+BVmax(K,k∙log K) + log m≈ n∙L∙2∙log m TREE+BV+PUk + log mn∙L∙(2+log m/L) IDEALkn∙L∙log n K = all potential completion k = completions leading to hit n = number of document m = number of terms L = distinct words per doc T = free parameter But the "+ log m" really hurts in practice!

Performance #intersectionsspace in bits NAÏVEKn*L*log n IDEALkn*L*log n K = all potential completion k = completions leading to hit n = number of document m = number of terms L = distinct words per doc T = free parameter

Performance #intersectionsspace in bits NAÏVEKn∙L∙log n TREEk∙log (K/k)≈ n∙L∙log n∙log m TREE+BVk∙log (K/k) + log m≈ n∙L∙2∙log m IDEALkn∙L∙log n K = all potential completion k = completions leading to hit n = number of document m = number of terms L = distinct words per doc T = free parameter can't we get rid of the log (K/k) factor?

cool algorithms for a cool feature Holger Bast Max-Planck-Institut für Informatik (MPII) Saarbrücken, Germany joint work with Christian Mortensen and Ingmar Weber MPI Saarbrücken, 6th October 2004 Searching with Autocompletion

A straightforward solution  Space for data structure –n∙L∙log n, where n is the number of document and L is the average number of distinct words per document  Time to process a query –needs K intersections, where K is the number of all completions of p mehl4, 28, 78, 105, … mehlhorn5, 17, 51, 79, 102, … mehihorn23 mehlhorns17, 102, 237, … mehltau21, 79, 157, … Our goal is k, the number of completions in hits (3 in this case)

Searching with autocompletion  While typing a google-like query, show all completions of the beginning of the last word typed that co-occur with the previous words in some document  E.g., having typed kurt mehl* alg, show –algorithm, algorithms, algorithmics –but none of the (many) other words starting with alg but that do not occur in hits for kurt mehl*  Useful in a variety of ways: –saves typing & avoids spelling errors –find unexpected variations of words that actually lead to a hit –explore the collection while formulating the query –and more! (end of talk)

Cool algorithms for a cool feature Holger Bast Max-Planck-Institut für Informatik (MPII) Saarbrücken, Germany joint work with Christian Mortensen and Ingmar.

Similar presentations

Presentation on theme: "Cool algorithms for a cool feature Holger Bast Max-Planck-Institut für Informatik (MPII) Saarbrücken, Germany joint work with Christian Mortensen and Ingmar."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Cool algorithms for a cool feature Holger Bast Max-Planck-Institut für Informatik (MPII) Saarbrücken, Germany joint work with Christian Mortensen and Ingmar.

Similar presentations

Presentation on theme: "Cool algorithms for a cool feature Holger Bast Max-Planck-Institut für Informatik (MPII) Saarbrücken, Germany joint work with Christian Mortensen and Ingmar."— Presentation transcript:

Similar presentations

About project

Feedback