Fast search for similar words Klaus U. Schulz CIS - LMU Munich joint work with Stoyan Mihov Bulgarian Academy of Science.

Fast search for similar words Klaus U. Schulz CIS - LMU Munich joint work with Stoyan Mihov Bulgarian Academy of Science

Problem Given a large lexicon D and possibly erroneous input tokens, for each input token v efficiently find the „most similar“ entries w of the lexicon.

Problem Given a large lexicon D and possibly erroneous input tokens, for each input token v efficiently find the „most similar“ entries w of the lexicon. Text correction Input: text with orthographic errors (e.g., from OCR recognition). For each non-lexical token v find good correction suggestions in the lexicon.

Problem Given a large lexicon D and possibly erroneous input tokens, for each input token v efficiently find the „most similar“ entries w of the lexicon. Text correction Input: text with orthographic errors (e.g., from OCR recognition). For each non-lexical token v find good correction suggestions in the lexicon. Correction of search queries „Lexicon“: list of known search queries Input: new queries v possibly with misspellings. If v not known, check if there are similar known queries.

Notions of similarity for words Levenshtein distance (edit distance) between two words v, w: minimal number of letter deletions, insertions or substitutions needed to rewrite v into w. E.g. d(chold,child)=1, d(cold,child)=2. Variants 1.Allow for other “edit operations”: transpositions of letters, merge of two letters, splitting one letter into two,… 2.Assign non-uniform “costs” to the operations depending on the symbols used. E.g. cost(I,i)=0.35, cost(I,m)=1, cost(m,in)=0.22.

Formalized problem Given large lexicon D, small bound n for the Levenshtein distance d, for each input token v efficiently find all entries w of D such that d(v,w) n. (Later consider variants of Levenshtein distance)

Structure of talk 1.Efficient solution for „Decision Problem“: Given two words v, w, decide if d(v,w) n. 1.Efficient solution for similarity search in very large lexica 1.Refinement with substantially improved efficiency

Decision Problem Standard solution

Wagner-Fischer Method: Compute Levenshtein distance d(v,w) via dynamic programming

Decision Problem Standard solution Wagner-Fischer Method: Compute Levenshtein distance d(v,w) via dynamic programming arkansas 012345678 a101234567 n211233456 a322223445 n433333455 a544444445 s655555554

Decision Problem Use of nondeterministic “Levenshtein” automata

For each input token v, use a nondeterministic automaton A n to recognize all words w such that d(v,w) n. E.g. v = chold, n=2:

Decision Problem Use of nondeterministic “Levenshtein” automata Problems 1.Nondeterminism of automata 1.Need special automaton for each garbled input word v

Triangular areas Given automaton A n for each partial input u of length k the „set of active states“ is always subset of k-th „triangular area“.

Triangular areas Empty input

Triangular areas Input „h“

Triangular areas Input „hc“

Triangular areas Input „hch“

Triangular areas Input „hchi“

Input behaviour Finding next set of active states Given set of active states and input symbol , next set of active states only depends on distribution of  in subword of pattern „chold“

Empty input Input behaviour Finding next set of active states

Empty input, next letter h Input behaviour Finding next set of active states

Empty input, next letter h Relevant subword: c h o Input behaviour Finding next set of active states

Input „h“ Input behaviour Finding next set of active states

Input „h“, next letter c Relevant subword: c h o l Input behaviour Finding next set of active states

Input „hc“ Input behaviour Finding next set of active states

Input „hc“ next letter h Relevant subword: c h o l d Input behaviour Finding next set of active states

Input „hch“ Input behaviour Finding next set of active states

Input „hch“, next letter i Input behaviour Finding next set of active states

Input „hch“, next letter i Relevant subword: h o l d Input behaviour Finding next set of active states

Input „hchi“ Input behaviour Finding next set of active states

Summing up 1. Introduce a new finite set of „universal“ states: subsets of triangle (plus second type of universal final states) Obtain universal set of states not dependent on input word v.

Summing up 2. Introduce universal transition function dependent only on distribution of „red“ and „green“ horizontal transitions

Summing up 2. Introduce universal transition function dependent only on distribution of „red“ and „green“ horizontal transitions Encode red/green by bitvector of symbols 0/1

Summing up 2. Introduce universal transition function dependent only on distribution of „red“ and „green“ horizontal transitions Encode red/green by bitvector of symbols 0/1 0 0 1 1 0

For n=2 five final triangular areas T1 T2 T3 T4 T5 Final states char vectors refer to „hold“

For n=2 five final triangular areas T1 T2 T3 T4 T5 Final states char vectors refer to „old“

For n=2 five final triangular areas T1 T2 T3 T4 T5 Final states char vectors refer to „ld“

For n=2 five final triangular areas T1 T2 T3 T4 T5 Final states char vectors refer to „d“

For n=2 five final triangular areas T1 T2 T3 T4 T5 Final states no transitions possible, no vector needed

Since the characteristic input vectors change their length in each step the length of the characteristic vectors „tells us“ in which triangular area we are: e.g., for input vector 01 we are in T3 Final states

First idea: For each set of active states in the complete region T1-T5 that fits into one triangular area Ti we introduce a new state for the universal automaton. The derived state is final if it contains a final (red) state of underlying nondet. automaton Final states

Example of derived final state (cf. T2) Using M as a parameter for the length of the pattern, the blue universal state is {(M-3) 2,(M-1) 2,M 1,M 2 } Final states MM-1M-2M-3

Example of derived non-final state (cf. T1) Final states

Observation: for the derived non-final states we can use the universal states we already have; the length of input vectors will give us the position. Example: Blue state corresponds to {(I-2) 2,(I-1) 2,I 1,I 2,(I+1) 2 } for input vector of length 4 (where I is M-2) Final states

Example: Blue state corresponds to (I-2)^2 for input vector length 1 (I-1)^2 for input vector length 2 I^2 for input vector length 3 (I+1)^2 for input vector length 4 (i+2)^2 for input vector length 5 Final states

Resumee New final states: sets of active states that contain a final state of the underlying non-deterministic automaton. By using abstract parameter symbol „M“ for pattern length also these states are „universal. The earlier universal states with abstract parameter symbol „I“ obtain new transitions for shorter characteristic vectors.

Summing up Obtain a universal and deterministic automaton for given bound n. Given input word v and another word w, each letter  of w is translated into a bitvektor  encoding occurrences of  in relevant subwords of v. For bound n, bitvectors have length 2n+2,2n+1,…,2,1 Bitvectors of length 2n+1,…,2,1 indicate that we reach the pattern end. Bitvectors used as input „symbols“ for universal automaton. Universal automaton accepts sequence of bitvectors iff d(v,w) n.

Example Bound n=1 Garbled input word v=„chold“ Represented as $chold Second word w=child Translated into series  (c,$cho)=0100  (h,chol)=0100  (i,hold)=0000  (l,old)=010  (d,ld)=01

New solution for „Decision Problem“ For each distance bound n, a universal deterministic Levenshtein automaton is computed offline (once). Offers very efficient method to decide for arbitrary input words v,w if d(v,w) n. Size dependent on bound n. In practice: n < 8. Universal deterministic Levenshtein automata

2 Similarity search in large lexicon Main ideas Represent lexicon as deterministic finite-state automaton Given pattern v, start a complete traversal of lexicon automaton but use universal Levenshtein automa as control: Translate letters of lexicon words into bitvectors using pattern v. Failure in the Levenshtein automaton causes failure and backtracking for search If in both automata a final state is reached, output lexicon word.

Example Lexicon: axis,behave,child,church,cold,hate,hold a x i b e h a v s c h e h i l d u r c h o a t o

Example: input “chold”, n=1 a x i b e h a v s c h e h i l d u r c h o a t o

a x i b e h a v s c h e h i l d u r c h o a t o

a x i b e h a v s c h e h i l d u r c h o a t  (a,$cho)=0000 o

a x i b e h a v s c h e h i l d u r c h o a t  (x, chol)=0000 o

Example: input “chold”, n=1 a x i b e h a v s c h e h i l d u r c h o a t  (x, chol)=0000 No transition with 0000 o

a x i b e h a v s c h e h i l d u r c h o a t  b,$cho)=0000 o

a x i b e h a v s c h e h i l d u r c h o a t  (e, chol)=0000 o

Example: input “chold”, n=1 a x i b e h a v s c h e h i l d u r c h o a t  (e, chol)=0000 No transition with 0000 o

a x i b e h a v s c h e h i l d u r c h o a t o  (c,$cho)=0100

a x i b e h a v s c h e h i l d u r c h o a t  (h,chol)=0100 o

a x i b e h a v s c h e h i l d u r c h o a t o  (i, hold)=0000

a x i b e h a v s c h e h i l d u r c h o a t o  (l, old)=010

a x i b e h a v s c h e h i l d u r c h o a t o  (d, ld)=01

a x i b e h a v s c h e h i l d u r c h o a t o First output: child

a x i b e h a v s c h e h i l d u r c h o a t o  (u, hold)=0000

a x i b e h a v s c h e h i l d u r c h o a t o  (r,old)=000

Example: input “chold”, n=1 a x i b e h a v s c h e h i l d u r c h o a t o  (r,old)=000 No transition with 000

a x i b e h a v s c h e h i l d u r c h o a t o  (o,chol)=0010

a x i b e h a v s c h e h i l d u r c h o a t o  (l,hold)=0010

a x i b e h a v s c h e h i l d u r c h o a t o  (d,old)=001

a x i b e h a v s c h e h i l d u r c h o a t o 2nd output: cold

a x i b e h a v s c h e h i l d u r c h o a t o  (h,$cho)=0010

a x i b e h a v s c h e h i l d u r c h o a t o  (o,chol)=0010

a x i b e h a v s c h e h i l d u r c h o a t o  (l,hold)=0010

a x i b e h a v s c h e h i l d u r c h o a t o  (d,old)=001

a x i b e h a v s c h e h i l d u r c h o a t o 3rd output: hold

a x i b e h a v s c h e h i l d u r c h o a t o  (a,hold)=0000 No transition with 0000

Example: input “chold”, n=1 a x i b e h a v s c h e h i l d u r c h o a t o Finished No transition with 0000

Example: input “chold”, n=1 a x i b e h a v s c h e h i l d u r c h o a t o Visited Not visited

Stackbasierte Implementierung

Evaluation

Three lexica BL (Bulgarian), GL (German), TL (Book Titles)

Evaluation

3. Improving efficiency

Structure of lexicon automaton

3. Improving efficiency Structure of lexicon automaton Enourmous branching close to start

3. Improving efficiency Structure of lexicon automaton Enourmous branching close to startAlmost no branching deeper inside

3. Improving efficiency Structure of lexicon automaton Enourmous branching close to startAlmost no branching deeper inside Even for bound n=1, have to visit all transitions and translate labels

3. Improving efficiency Main idea, here for n=1: Input word v Split v in two subwords v1, v2 If v contains at most 1 error, distinguish two cases: 1.v1 contains 0 errors, v2 at most 1 error 2.v2 contains 0 errors, v1 one error Subsearch for Case 1: Use standard lexicon automaton as before Subsearch for Case 2: Use automaton for lexicon of reversed words (backward lexicon)

3. Improving efficiency Main idea, here for n=1: Input word v Split v in two subwords v1, v2 If v contains at most 1 error, distinguish two cases: 1.v1 contains 0 errors, v2 at most 1 error 2.v2 contains 0 errors, v1 one error Subsearch for Case 1: Use standard lexicon automaton as before Subsearch for Case 2: Use automaton for lexicon of reversed words (backward lexicon) Both cases: no search/backtracking needed in „dense“ part of automaton

3. Improving efficiency Subsearch 1 0 errorsat most 1 error v1v2 forward Forward lexicon automaton

3. Improving efficiency Subsearch 2 1 error0 errors v1v2 backward Backward lexicon automaton

Evaluation

Generalizations of results Similar results for many variants of Levenshtein distance: Adding transpositions as edit operations Adding merges and splites as edit operations Adding more complex edit operations (lll-> in) Restricting edit operations to specific symbols or strings (With Petar Mitankin, Bulg. Ac. Of Science)

Conclusion When looking for algorithmic solutions for problems around sequences or strings we tend to only consider a „left-to-right“ search strategy. Sometimes much better algorithms are obtained when going beyond conventional „left-to-right“.

Fast search for similar words Klaus U. Schulz CIS - LMU Munich joint work with Stoyan Mihov Bulgarian Academy of Science.

Similar presentations

Presentation on theme: "Fast search for similar words Klaus U. Schulz CIS - LMU Munich joint work with Stoyan Mihov Bulgarian Academy of Science."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Fast search for similar words Klaus U. Schulz CIS - LMU Munich joint work with Stoyan Mihov Bulgarian Academy of Science.

Similar presentations

Presentation on theme: "Fast search for similar words Klaus U. Schulz CIS - LMU Munich joint work with Stoyan Mihov Bulgarian Academy of Science."— Presentation transcript:

Similar presentations

About project

Feedback