Download presentation
Presentation is loading. Please wait.
Published byShanon Pope Modified over 8 years ago
1
cool algorithms for a cool feature Holger Bast Max-Planck-Institut für Informatik (MPII) Saarbrücken, Germany joint work with Christian Mortensen and Ingmar Weber IISc Bangalore, 9th October 2004 Searching with Autocompletion
2
Searching with autocompletion While typing a google-like query, show all completions of the beginning of the last word typed that co-occur with the previous words in some document Useful in a variety of ways: –saves typing & avoids spelling errors –find unexpected variations of words that actually lead to a hit –explore the collection while formulating the query –and more! (end of talk)
3
A formalization … Query –a set D of documents (=hits of first part of query) –a prefix p (=what has been typed of last word of query) Answer –all completions of p that occur somewhere in D –those documents from D that contain one of these words Objective –process queries as quickly as possible … –… by a (precomputed) data structure using as little space as possible
4
… and its use for the actual problem For example, given the user typed kurt mehl* alg 1.compute all completions of kurt$ → trivially only kurt itself list D 1 of documents containing kurt 2.compute all completions of mehl that occur in some doc of D 1 the list D 2 of these documents (subset of D 1 ) 3.compute all completions of alg that occur in some doc of D 2 the list D 3 of these documents (subset of D 2 ) Autocompletion and prefix search simultaneously! special end-of-word symbol
5
A straightforward solution The NAIVE data structure and algorithm: –for each word w pre-compute the sorted list of all (ids of) documents containing w (so-called inverted index) –for a given set of documents D and prefix p, fetch the list of each of the potential completions of p and intersect with D mehl4, 28, 78, 105, … mehlhorn5, 17, 51, 79, 102, … mehihorn23 mehlhorns17, 102, 237, … mehltau21, 79, 157, … D = 5, 17, 23, 57, 102 p = meh
6
A straightforward solution The NAIVE data structure and algorithm: –for each word w pre-compute the sorted list of all (ids of) documents containing w (so-called inverted index) –for a given set of documents D and prefix p, fetch the list of each of the potential completions of p and intersect with D mehl4, 28, 78, 105, … mehlhorn5, 17, 51, 79, 102, … mehihorn23 mehlhorns17, 102, 237, … mehltau21, 79, 157, … hits = 5, 17, 23, 102 D = 5, 17, 23, 57, 102 p = meh
7
A straightforward solution Space for data structure –n∙L∙log n, where n is the number of document and L is the average number of distinct words per document Time to process a query –needs K intersections, where K is the number of all completions of p mehl4, 28, 78, 105, … mehlhorn5, 17, 51, 79, 102, … mehihorn23 mehlhorns17, 102, 237, … mehltau21, 79, 157, … hits = 5, 17, 23, 102 D = 5, 17, 23, 57, 102 p = meh
8
Performance # intersections per query# bits used by index NAIVEKn∙L∙log n K = # completions overalln = # documents L = # distinct words per doc (on average)
9
Performance # intersections per query# bits used by index NAIVEKn∙L∙log n OUR GOALk≤ n∙L∙log n K = # completions overalln = # documents L = # distinct words per doc k = # completions in hits (on average) in our experiments we also measure total list length and, of course, running time per query
10
What I like about this work Multi-faceted (and each facet challenging) –1/3 getting the problem clear –1/3 algorithms and analysis –1/3 design and implementation and workflow was not like this but more like this
11
Performance # intersections per query# bits used by index NAIVEKn∙L∙log n OUR GOALk≤ n∙L∙log n K = # completions overalln = # documents L = # distinct words per doc k = # completions in hits (on average)
12
Building a tree over the words For each internal node, precompute the union of the lists of all leaves in the subtree
13
mehl 4,28,78,105,… mehlhorn 5,17,51,102,… mehihorn 23 mehlhorns 17,102,237,… zürich … aachen-zürich 1,2,3,4,5,6,7,8,9,10,11,12,13,… mehl-mehlhorns 4,5,17,23,28,51,78,79,102,105,237,… Building a tree over the words For each internal node, precompute the union of the lists of all leaves in the subtree aachen … D = 57, 115, 250 p = meh
14
mehl 4,28,78,105,… mehlhorn 5,17,51,102,… mehihorn 23 mehlhorns 17,102,237,… zürich … aachen-zürich 1,2,3,4,5,6,7,8,9,10,11,12,13,… mehl-mehlhorns 4,5,17,23,28,51,78,79,102,105,237,… Building a tree over the words For each internal node, precompute the union of the lists of all leaves in the subtree aachen … subtree need not be explored further! D = 57, 115, 250 p = meh
15
mehl 4,28,78,105,… mehlhorn 5,17,51,102,… mehihorn 23 mehlhorns 17,102,237,… zürich … aachen-zürich 1,2,3,4,5,6,7,8,9,10,11,12,13,… mehl-mehlhorns 4,5,17,23,28,51,78,79,102,105,237,… Building a tree over the words For each internal node, precompute the union of the lists of all leaves in the subtree aachen … D = 51, 115, 250 p = meh
16
mehl 4,28,78,105,… mehlhorn 5,17,51,102,… mehihorn 23 mehlhorns 17,102,237,… zürich … aachen-zürich 1,2,3,4,5,6,7,8,9,10,11,12,13,… mehl-mehlhorns 4,5,17,23,28,51,78,79,102,105,237,… Building a tree over the words For each internal node, precompute the union of the lists of all leaves in the subtree aachen … D = 51, 115, 250 p = meh but up to log K intersections per hit!
17
Performance # intersections per query# bits used by index NAIVEKn∙L∙log n TREEk ∙ log K≈ n∙L∙log n∙log m OUR GOALk≤ n∙L∙log n K = # completions overalln = # documents L = # distinct words per doc k = # completions in hitsm = # words (on average) way too much space!
18
Trick 1: Relative bit vectors Store each doc list as a bit vector, relative to its parent –the i-th bit of the root node corresponds to the i-th doc –the i-th bit of any other node corresponds to the i-th set bit of its parent node aachen-zürich 1111111111111… mehl-zürich 1001000111101… mehl-stream 1001110…
19
Trick 1: Relative bit vectors aachen-zürich 1111111111111… mehl-zürich 1001000111101… mehl-stream 1001110… corresponds to doc 5 corresponds to doc 10 Store each doc list as a bit vector, relative to its parent –the i-th bit of the root node corresponds to the i-th doc –the i-th bit of any other node corresponds to the i-th set bit of its parent node
20
Trick 1: Complexity Instead of log n bits, each occurrence of a doc id now accounts for mere 2 bits –wherever there was a whole doc id before, there is now a single bit, set to 1 –if a node has a bit set to 0, than its parent node has the bit for the corresponding doc set to 1, and the other sibling has the corresponding bit set to 1 But we can't start processing in the middle of the tree anymore –but first have to travel from the root to the respective node to get the set of doc ids it corresponds to –costs up to an additional log m intersections per query!
21
Performance # intersections per query# bits used by index NAIVEKn∙L∙log n TREEk ∙ log K≈ n∙L∙log n∙log m +BIT VECTORSk ∙ log K + log m≈ n∙L∙2∙log m OUR GOALk≤ n∙L∙log n K = # completions overalln = # documents L = # distinct words per doc k = # completions in hitsm = # words (on average) can we get rid of the log K factor?
22
Trick 2: Push up the words For each node, by each set bit, store the leftmost word of that doc that is not already stored by a parent node 1 1 1 1 1 1 1 1 1 1 … 1 0 0 0 1 0 0 1 1 1 … 1 0 0 1 1 … aachen advance algol algorithm advance aachen art advance mehlhorn meeting mehl mehihorn mercury mix middle D = 5, 7, 10 p = meh
23
Trick 2: Push up the words For each node, by each set bit, store the leftmost word of that doc that is not already stored by a parent node 1 1 1 1 1 1 1 1 1 1 … 1 0 0 0 1 0 0 1 1 1 … 1 0 0 1 1 … aachen advance algol algorithm advance aachen art advance mehlhorn meeting mehl mehihorn mercury mix middle D = 5, 7, 10 p = meh D = 5, 10 ( → 2, 5) report: mehihorn D = 5 report: Ø → STOP
24
Trick 2: Complexity Lemma: When processing a node all leaves under which are potential completions, either the intersection is empty and none of the nodes in the sub-tree needs to be explored, or at least one new word is reported –because for two nodes, where one is a descendant of the other, the lexicographical smallest word reported by the older (higher up in the tree) node is strictly smaller than the lexicographical smallest word reported by its descendant. The doc lists now need 2∙n∙L bits –each set bit now corresponds to exactly one word-in-doc pair Storing the words by each set bit requires n∙L∙log m bits –can be reduced to n∙L∙log(m/L) because node at level i need not store first i bits of word ids
25
Performance # intersections per query# bits used by index NAIVEKn∙L∙log n TREEk ∙ log K≈ n∙L∙log n∙log m +BIT VECTORSk ∙ log K + log m≈ n∙L∙2∙log m +PUSH UPk + log mn∙L∙(2+log m/L) OUR GOALk≤ n∙L∙log n K = # completions overalln = # documents L = # distinct words per doc k = # completions in hitsm = # words (on average) the log m still hurts!
26
Trick 3: divide into blocks Divide the word list into blocks of size T, and maintain each block by a TREE with bit vectors, and words pushed up, as developed so far D = … p = me
27
Trick 3: divide into blocks meals meyers D = … p = me Divide the word list into blocks of size T, and maintain each block by a TREE with bit vectors, and words pushed up, as developed so far
28
Trick 3: divide into blocks meals meyers D = … p = me Divide the word list into blocks of size T, and maintain each block by a TREE with bit vectors, and words pushed up, as developed so far
29
Trick 3: Complexity Number of intersections is now Σ blocks i to inspect ( k i + 1 + log (T/K i ) ) ~ k + log T + K/T Space requirement is now –n∙L∙2 bits for the doc lists as before –n∙L∙log T for the word lists (can be reduced by saving initial bits, as before) –plus an additional m/T∙n bits for the (complete) lists in the root node of each block
30
Performance # intersections per query# bits used by index NAIVEKn∙L∙log n TREEk ∙ log K≈ n∙L∙log n∙log m +BIT VECTORSk ∙ log K + log m≈ n∙L∙2∙log m +PUSH UPk + log mn∙L ∙ (2+log m/L) +BLOCKSk + log T + K/Tn∙L∙(2+log T+m/(LT)) OUR GOALk≤ n∙L∙log n K = # completions overalln = # documents L = # distinct words per doc k = # completions in hitsm = # termsT = block size (arbitrary)
31
Performance # intersections per query# bits used by index NAIVEKn∙L∙log n TREEk ∙ log K> 2∙n∙L∙log n +BIT VECTORSk ∙ log K + log m≈ n∙L∙2∙log m +PUSH UPk + log mn∙L ∙ (2+log m/L) +BLOCKSk + log T + K/Tn∙L∙(2+log T+m/(LT)) OUR GOALk≤ n∙L∙log n nice time-space tradeoff + works fine in practice!
32
Where to go from here … Theoretically –bring down complexity to the ideal k and ≤ n∙L∙log n Practically –make software package available (almost done) –combine with other features: proximity done; concepts? XML? –there is a lot of potential in this autocompletion principle → explore it!
33
Where to go from here … Theoretically –bring down complexity to the ideal k and ≤ n∙L∙log n Practically –make software package available (almost done) –combine with other features: proximity done; concepts? XML? –there is a lot of potential in this autocompletion principle → explore it! Thank you !
35
Another obvious approach For each word, precompute the set of words with which it it co-occurs in some document –needs ~m^2 bits –query time ~k only for two-word queries!
36
Trick 2: Space complexity The total number of bits in all the document lists is exactly twice n*L= #word-in-doc pairs, because –each set bit has a word from a particular doc stored by it, and that word-doc-pair is stored by no other bit –each zero bit has a set bit as its parent, and its sibling cannot be zero as well Storing the words by each set bit requires n*L*log m bits –can be reduced to n*L*log(m/L) because node at level i need not store first i bits of word ids
37
Search + autoCompletion While typing the query, show all completions of the last prefix typed that co-occur with the previous words in some document E.g. having typed "kurt meh", show –mehlhorn, mehlhorns –but none of the (many) other words starting with "meh" but that do not cooccur with "kurt", e.g. "mehr". SHOW DEMO: –kurt.meh –comput*..geometr
38
Building a tree over the words For each internal node, precompute the union of the lists of all leafs in the subtree mehl 4,28,78,105,… mehlhorn 5,17,51,79,102,… mehihorn 23 mehlhorns 17,102,237,… aachen … zürich … aachen-zürich 1,2,3,4,5,6,7,8,9,10,11,12,13,… mehl-mehlhorns 4,5,17,23,28,51,78,79,102,105,237,…
39
Trick 3: divide into blocks Divide the word list into blocks of size T, and maintain each block by a TREE+BV+WL as developed so far aachen algol zoo zürich ……… algorithm advance aachen-algolalgorithm-advancezoo-zürich
40
Performance #intersectionsspace in bits NAÏVEKn∙L∙log n TREEk ∙ log (K/k)≈ n∙L∙log n∙log m IDEALkn∙L∙log n K = all potential completion k = completions leading to hit n = number of document m = number of terms L = distinct words per doc T = free parameter way to much space!
41
Performance #intersectionsspace in bits NAÏVEKn∙L∙log n TREEmax(K,k∙log K)≈ n∙L∙log n∙log m TREE+BVmax(K,k∙log K) + log m≈ n∙L∙2∙log m TREE+BV+PUk + log mn∙L∙(2+log m/L) IDEALkn∙L∙log n K = all potential completion k = completions leading to hit n = number of document m = number of terms L = distinct words per doc T = free parameter But the "+ log m" really hurts in practice!
42
Performance #intersectionsspace in bits NAÏVEKn*L*log n IDEALkn*L*log n K = all potential completion k = completions leading to hit n = number of document m = number of terms L = distinct words per doc T = free parameter
43
Performance #intersectionsspace in bits NAÏVEKn∙L∙log n TREEk∙log (K/k)≈ n∙L∙log n∙log m TREE+BVk∙log (K/k) + log m≈ n∙L∙2∙log m IDEALkn∙L∙log n K = all potential completion k = completions leading to hit n = number of document m = number of terms L = distinct words per doc T = free parameter can't we get rid of the log (K/k) factor?
44
cool algorithms for a cool feature Holger Bast Max-Planck-Institut für Informatik (MPII) Saarbrücken, Germany joint work with Christian Mortensen and Ingmar Weber MPI Saarbrücken, 6th October 2004 Searching with Autocompletion
45
A straightforward solution Space for data structure –n∙L∙log n, where n is the number of document and L is the average number of distinct words per document Time to process a query –needs K intersections, where K is the number of all completions of p mehl4, 28, 78, 105, … mehlhorn5, 17, 51, 79, 102, … mehihorn23 mehlhorns17, 102, 237, … mehltau21, 79, 157, … Our goal is k, the number of completions in hits (3 in this case)
46
Searching with autocompletion While typing a google-like query, show all completions of the beginning of the last word typed that co-occur with the previous words in some document E.g., having typed kurt mehl* alg, show –algorithm, algorithms, algorithmics –but none of the (many) other words starting with alg but that do not occur in hits for kurt mehl* Useful in a variety of ways: –saves typing & avoids spelling errors –find unexpected variations of words that actually lead to a hit –explore the collection while formulating the query –and more! (end of talk)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.