Download presentation
Presentation is loading. Please wait.
Published byDarcy Harrington Modified over 9 years ago
1
Fast search for similar words Klaus U. Schulz CIS - LMU Munich joint work with Stoyan Mihov Bulgarian Academy of Science
2
Problem Given a large lexicon D and possibly erroneous input tokens, for each input token v efficiently find the „most similar“ entries w of the lexicon.
3
Problem Given a large lexicon D and possibly erroneous input tokens, for each input token v efficiently find the „most similar“ entries w of the lexicon. Text correction Input: text with orthographic errors (e.g., from OCR recognition). For each non-lexical token v find good correction suggestions in the lexicon.
4
Problem Given a large lexicon D and possibly erroneous input tokens, for each input token v efficiently find the „most similar“ entries w of the lexicon. Text correction Input: text with orthographic errors (e.g., from OCR recognition). For each non-lexical token v find good correction suggestions in the lexicon. Correction of search queries „Lexicon“: list of known search queries Input: new queries v possibly with misspellings. If v not known, check if there are similar known queries.
5
Notions of similarity for words Levenshtein distance (edit distance) between two words v, w: minimal number of letter deletions, insertions or substitutions needed to rewrite v into w. E.g. d(chold,child)=1, d(cold,child)=2. Variants 1.Allow for other “edit operations”: transpositions of letters, merge of two letters, splitting one letter into two,… 2.Assign non-uniform “costs” to the operations depending on the symbols used. E.g. cost(I,i)=0.35, cost(I,m)=1, cost(m,in)=0.22.
6
Formalized problem Given large lexicon D, small bound n for the Levenshtein distance d, for each input token v efficiently find all entries w of D such that d(v,w) n. (Later consider variants of Levenshtein distance)
7
Structure of talk 1.Efficient solution for „Decision Problem“: Given two words v, w, decide if d(v,w) n. 1.Efficient solution for similarity search in very large lexica 1.Refinement with substantially improved efficiency
8
Decision Problem Standard solution
9
Wagner-Fischer Method: Compute Levenshtein distance d(v,w) via dynamic programming
10
Decision Problem Standard solution Wagner-Fischer Method: Compute Levenshtein distance d(v,w) via dynamic programming arkansas 012345678 a101234567 n211233456 a322223445 n433333455 a544444445 s655555554
11
Decision Problem Standard solution Wagner-Fischer Method: Compute Levenshtein distance d(v,w) via dynamic programming arkansas 012345678 a101234567 n211233456 a322223445 n433333455 a544444445 s655555554
12
Decision Problem Standard solution Wagner-Fischer Method: Compute Levenshtein distance d(v,w) via dynamic programming arkansas 012345678 a101234567 n211233456 a322223445 n433333455 a544444445 s655555554
13
Decision Problem Use of nondeterministic “Levenshtein” automata
14
For each input token v, use a nondeterministic automaton A n to recognize all words w such that d(v,w) n. E.g. v = chold, n=2:
15
Decision Problem Use of nondeterministic “Levenshtein” automata Problems 1.Nondeterminism of automata 1.Need special automaton for each garbled input word v
16
Triangular areas Given automaton A n for each partial input u of length k the „set of active states“ is always subset of k-th „triangular area“.
17
Triangular areas Empty input
18
Triangular areas Input „h“
19
Triangular areas Input „hc“
20
Triangular areas Input „hch“
21
Triangular areas Input „hchi“
22
Input behaviour Finding next set of active states Given set of active states and input symbol , next set of active states only depends on distribution of in subword of pattern „chold“
23
Empty input Input behaviour Finding next set of active states
24
Empty input, next letter h Input behaviour Finding next set of active states
25
Empty input, next letter h Relevant subword: c h o Input behaviour Finding next set of active states
26
Empty input, next letter h Relevant subword: c h o Input behaviour Finding next set of active states
27
Empty input, next letter h Relevant subword: c h o Input behaviour Finding next set of active states
28
Input „h“ Input behaviour Finding next set of active states
29
Input „h“, next letter c Relevant subword: c h o l Input behaviour Finding next set of active states
30
Input „h“, next letter c Relevant subword: c h o l Input behaviour Finding next set of active states
31
Input „hc“ Input behaviour Finding next set of active states
32
Input „hc“ next letter h Relevant subword: c h o l d Input behaviour Finding next set of active states
33
Input „hch“ Input behaviour Finding next set of active states
34
Input „hch“, next letter i Input behaviour Finding next set of active states
35
Input „hch“, next letter i Relevant subword: h o l d Input behaviour Finding next set of active states
36
Input „hchi“ Input behaviour Finding next set of active states
37
Summing up 1. Introduce a new finite set of „universal“ states: subsets of triangle (plus second type of universal final states) Obtain universal set of states not dependent on input word v.
38
Summing up 2. Introduce universal transition function dependent only on distribution of „red“ and „green“ horizontal transitions
39
Summing up 2. Introduce universal transition function dependent only on distribution of „red“ and „green“ horizontal transitions Encode red/green by bitvector of symbols 0/1
40
Summing up 2. Introduce universal transition function dependent only on distribution of „red“ and „green“ horizontal transitions Encode red/green by bitvector of symbols 0/1 0 0 1 1 0
41
For n=2 five final triangular areas T1 T2 T3 T4 T5 Final states char vectors refer to „hold“
42
For n=2 five final triangular areas T1 T2 T3 T4 T5 Final states char vectors refer to „old“
43
For n=2 five final triangular areas T1 T2 T3 T4 T5 Final states char vectors refer to „ld“
44
For n=2 five final triangular areas T1 T2 T3 T4 T5 Final states char vectors refer to „d“
45
For n=2 five final triangular areas T1 T2 T3 T4 T5 Final states no transitions possible, no vector needed
46
Since the characteristic input vectors change their length in each step the length of the characteristic vectors „tells us“ in which triangular area we are: e.g., for input vector 01 we are in T3 Final states
47
First idea: For each set of active states in the complete region T1-T5 that fits into one triangular area Ti we introduce a new state for the universal automaton. The derived state is final if it contains a final (red) state of underlying nondet. automaton Final states
48
Example of derived final state (cf. T2) Using M as a parameter for the length of the pattern, the blue universal state is {(M-3) 2,(M-1) 2,M 1,M 2 } Final states MM-1M-2M-3
49
Example of derived non-final state (cf. T1) Final states
50
Observation: for the derived non-final states we can use the universal states we already have; the length of input vectors will give us the position. Example: Blue state corresponds to {(I-2) 2,(I-1) 2,I 1,I 2,(I+1) 2 } for input vector of length 4 (where I is M-2) Final states
51
Example: Blue state corresponds to (I-2)^2 for input vector length 1 (I-1)^2 for input vector length 2 I^2 for input vector length 3 (I+1)^2 for input vector length 4 (i+2)^2 for input vector length 5 Final states
52
Resumee New final states: sets of active states that contain a final state of the underlying non-deterministic automaton. By using abstract parameter symbol „M“ for pattern length also these states are „universal. The earlier universal states with abstract parameter symbol „I“ obtain new transitions for shorter characteristic vectors.
54
Summing up Obtain a universal and deterministic automaton for given bound n. Given input word v and another word w, each letter of w is translated into a bitvektor encoding occurrences of in relevant subwords of v. For bound n, bitvectors have length 2n+2,2n+1,…,2,1 Bitvectors of length 2n+1,…,2,1 indicate that we reach the pattern end. Bitvectors used as input „symbols“ for universal automaton. Universal automaton accepts sequence of bitvectors iff d(v,w) n.
55
Example Bound n=1 Garbled input word v=„chold“ Represented as $chold Second word w=child Translated into series (c,$cho)=0100 (h,chol)=0100 (i,hold)=0000 (l,old)=010 (d,ld)=01
56
Example Bound n=1 Garbled input word v=„chold“ Represented as $chold Second word w=child Translated into series (c,$cho)=0100 (h,chol)=0100 (i,hold)=0000 (l,old)=010 (d,ld)=01
57
Example Bound n=1 Garbled input word v=„chold“ Represented as $chold Second word w=child Translated into series (c,$cho)=0100 (h,chol)=0100 (i,hold)=0000 (l,old)=010 (d,ld)=01
58
Example Bound n=1 Garbled input word v=„chold“ Represented as $chold Second word w=child Translated into series (c,$cho)=0100 (h,chol)=0100 (i,hold)=0000 (l,old)=010 (d,ld)=01
59
Example Bound n=1 Garbled input word v=„chold“ Represented as $chold Second word w=child Translated into series (c,$cho)=0100 (h,chol)=0100 (i,hold)=0000 (l,old)=010 (d,ld)=01
60
Example Bound n=1 Garbled input word v=„chold“ Represented as $chold Second word w=child Translated into series (c,$cho)=0100 (h,chol)=0100 (i,hold)=0000 (l,old)=010 (d,ld)=01
61
New solution for „Decision Problem“ For each distance bound n, a universal deterministic Levenshtein automaton is computed offline (once). Offers very efficient method to decide for arbitrary input words v,w if d(v,w) n. Size dependent on bound n. In practice: n < 8. Universal deterministic Levenshtein automata
62
2 Similarity search in large lexicon Main ideas Represent lexicon as deterministic finite-state automaton Given pattern v, start a complete traversal of lexicon automaton but use universal Levenshtein automa as control: Translate letters of lexicon words into bitvectors using pattern v. Failure in the Levenshtein automaton causes failure and backtracking for search If in both automata a final state is reached, output lexicon word.
63
Example Lexicon: axis,behave,child,church,cold,hate,hold a x i b e h a v s c h e h i l d u r c h o a t o
64
Example: input “chold”, n=1 a x i b e h a v s c h e h i l d u r c h o a t o
65
a x i b e h a v s c h e h i l d u r c h o a t o
66
a x i b e h a v s c h e h i l d u r c h o a t (a,$cho)=0000 o
67
Example: input “chold”, n=1 a x i b e h a v s c h e h i l d u r c h o a t o
68
a x i b e h a v s c h e h i l d u r c h o a t (x, chol)=0000 o
69
Example: input “chold”, n=1 a x i b e h a v s c h e h i l d u r c h o a t (x, chol)=0000 No transition with 0000 o
70
Example: input “chold”, n=1 a x i b e h a v s c h e h i l d u r c h o a t o
71
a x i b e h a v s c h e h i l d u r c h o a t b,$cho)=0000 o
72
Example: input “chold”, n=1 a x i b e h a v s c h e h i l d u r c h o a t o
73
a x i b e h a v s c h e h i l d u r c h o a t (e, chol)=0000 o
74
Example: input “chold”, n=1 a x i b e h a v s c h e h i l d u r c h o a t (e, chol)=0000 No transition with 0000 o
75
Example: input “chold”, n=1 a x i b e h a v s c h e h i l d u r c h o a t o
76
a x i b e h a v s c h e h i l d u r c h o a t o (c,$cho)=0100
77
Example: input “chold”, n=1 a x i b e h a v s c h e h i l d u r c h o a t o
78
a x i b e h a v s c h e h i l d u r c h o a t (h,chol)=0100 o
79
Example: input “chold”, n=1 a x i b e h a v s c h e h i l d u r c h o a t o
80
a x i b e h a v s c h e h i l d u r c h o a t o (i, hold)=0000
81
Example: input “chold”, n=1 a x i b e h a v s c h e h i l d u r c h o a t o
82
a x i b e h a v s c h e h i l d u r c h o a t o (l, old)=010
83
Example: input “chold”, n=1 a x i b e h a v s c h e h i l d u r c h o a t o
84
a x i b e h a v s c h e h i l d u r c h o a t o (d, ld)=01
85
Example: input “chold”, n=1 a x i b e h a v s c h e h i l d u r c h o a t o
86
a x i b e h a v s c h e h i l d u r c h o a t o First output: child
87
Example: input “chold”, n=1 a x i b e h a v s c h e h i l d u r c h o a t o
88
a x i b e h a v s c h e h i l d u r c h o a t o (u, hold)=0000
89
Example: input “chold”, n=1 a x i b e h a v s c h e h i l d u r c h o a t o
90
a x i b e h a v s c h e h i l d u r c h o a t o (r,old)=000
91
Example: input “chold”, n=1 a x i b e h a v s c h e h i l d u r c h o a t o (r,old)=000 No transition with 000
92
Example: input “chold”, n=1 a x i b e h a v s c h e h i l d u r c h o a t o
93
a x i b e h a v s c h e h i l d u r c h o a t o (o,chol)=0010
94
Example: input “chold”, n=1 a x i b e h a v s c h e h i l d u r c h o a t o
95
a x i b e h a v s c h e h i l d u r c h o a t o (l,hold)=0010
96
Example: input “chold”, n=1 a x i b e h a v s c h e h i l d u r c h o a t o
97
a x i b e h a v s c h e h i l d u r c h o a t o (d,old)=001
98
Example: input “chold”, n=1 a x i b e h a v s c h e h i l d u r c h o a t o
99
a x i b e h a v s c h e h i l d u r c h o a t o 2nd output: cold
100
Example: input “chold”, n=1 a x i b e h a v s c h e h i l d u r c h o a t o
101
a x i b e h a v s c h e h i l d u r c h o a t o (h,$cho)=0010
102
Example: input “chold”, n=1 a x i b e h a v s c h e h i l d u r c h o a t o
103
a x i b e h a v s c h e h i l d u r c h o a t o (o,chol)=0010
104
Example: input “chold”, n=1 a x i b e h a v s c h e h i l d u r c h o a t o
105
a x i b e h a v s c h e h i l d u r c h o a t o (l,hold)=0010
106
Example: input “chold”, n=1 a x i b e h a v s c h e h i l d u r c h o a t o
107
a x i b e h a v s c h e h i l d u r c h o a t o (d,old)=001
108
Example: input “chold”, n=1 a x i b e h a v s c h e h i l d u r c h o a t o
109
a x i b e h a v s c h e h i l d u r c h o a t o 3rd output: hold
110
Example: input “chold”, n=1 a x i b e h a v s c h e h i l d u r c h o a t o
111
a x i b e h a v s c h e h i l d u r c h o a t o (a,hold)=0000 No transition with 0000
112
Example: input “chold”, n=1 a x i b e h a v s c h e h i l d u r c h o a t o Finished No transition with 0000
113
Example: input “chold”, n=1 a x i b e h a v s c h e h i l d u r c h o a t o Visited Not visited
114
Stackbasierte Implementierung
115
Evaluation
116
Three lexica BL (Bulgarian), GL (German), TL (Book Titles)
117
Evaluation
120
3. Improving efficiency
121
Structure of lexicon automaton
122
3. Improving efficiency Structure of lexicon automaton Enourmous branching close to start
123
3. Improving efficiency Structure of lexicon automaton Enourmous branching close to startAlmost no branching deeper inside
124
3. Improving efficiency Structure of lexicon automaton Enourmous branching close to startAlmost no branching deeper inside Even for bound n=1, have to visit all transitions and translate labels
125
3. Improving efficiency Main idea, here for n=1: Input word v Split v in two subwords v1, v2 If v contains at most 1 error, distinguish two cases: 1.v1 contains 0 errors, v2 at most 1 error 2.v2 contains 0 errors, v1 one error Subsearch for Case 1: Use standard lexicon automaton as before Subsearch for Case 2: Use automaton for lexicon of reversed words (backward lexicon)
126
3. Improving efficiency Main idea, here for n=1: Input word v Split v in two subwords v1, v2 If v contains at most 1 error, distinguish two cases: 1.v1 contains 0 errors, v2 at most 1 error 2.v2 contains 0 errors, v1 one error Subsearch for Case 1: Use standard lexicon automaton as before Subsearch for Case 2: Use automaton for lexicon of reversed words (backward lexicon) Both cases: no search/backtracking needed in „dense“ part of automaton
127
3. Improving efficiency Subsearch 1 0 errorsat most 1 error v1v2 forward Forward lexicon automaton
128
3. Improving efficiency Subsearch 2 1 error0 errors v1v2 backward Backward lexicon automaton
129
Evaluation
132
Generalizations of results Similar results for many variants of Levenshtein distance: Adding transpositions as edit operations Adding merges and splites as edit operations Adding more complex edit operations (lll-> in) Restricting edit operations to specific symbols or strings (With Petar Mitankin, Bulg. Ac. Of Science)
133
Conclusion When looking for algorithmic solutions for problems around sequences or strings we tend to only consider a „left-to-right“ search strategy. Sometimes much better algorithms are obtained when going beyond conventional „left-to-right“.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.