1 String Edit Distance Matching Problem With Moves Graham Cormode S. Muthukrishna November 2001
2 Pattern Matching Text T of length n Pattern P of length m Our goal: find “good” matches of P in T as measured by some edit distance function d(_,_) For every i= 1,2… n find: D[i] = back
3 Pattern Matching example T = “assasin” P = “ssi” -D[1]=2 assasin Or assasin _ssi ss i -D[3]=1 assa sin s_si -naively it will take O(mn^3)
4 The main idea d(X,Y) = smallest number of operations to turn X into Y. The operations are insertion, deletion, replacement of a character and moves of a substring. The idea is to approximate D[i] up to a factor of O(LognLog*n)
5 General algorithm Embed the string distance into L1 vector distance, up to a O(LognLog*n) factor : compute the vector with a single pass over the string. Find the vector representation of O(n) substrings. Do all this in O(nLogn) time. (a deterministic algorithm with an aproximate result)
6 Edit Sensitive Parsing (ESP) for the embedding We want to parse the string so that edit operations will have a limit effect on the parsing (an edit operation on char i changes the parsing only of the “neighborhood” of i ) For example: “abcbbabcbdfgj” “xyzbbabcbdfgj” “a bc bb abc b dfgj” “xy z bb abc b dfgj”
7 Edit Sensitive Parsing (ESP) for the embedding In practice, find landmarks in the strings, based only on their locality and parse by them. Local maxima are good landmarks, “abcegiklmrtabc” but may be far apart in large alphabets, so we will reduce the alphabet.
8 Edit Sensitive Parsing (ESP) choosing the landmarks We use a technique called Alphabet reduction: Text: c a b a g e f a c e d Binary: Label: Label(A[i]) = 2l + bit(l,A[i]) l = location of first bit that is different between A[i] and A[i-1] The value of the l bit in A[i]
9 Edit Sensitive Parsing (ESP) Alphabet reduction (cont.) - In each iteration the alphabet is reduced from Σ to 2Log|Σ|. - After Log*|Σ| iterations we get |Σ| < 7. - Then we reduce from 6 to 3, ensuring no adjacent pairs are identical (start from 6 then 5 then 3): Final iteration: After reduction:
10 Edit Sensitive Parsing (ESP) Alphabet reduction (cont.) Properties of final labels: 1) Final alphabet is {0,1,2}. 2) No adjacent pair is identical. 3) Takes Log*|Σ| iterations. 4) Each label depends on O(Log*|Σ|) characters to its left.
11 Edit Sensitive Parsing (ESP) So how do we choose the landmarks? For repeats, parse in a regular way: aaaaaaa -> (aaa)(aa)(aa) For varying substrings, use alphabet reduction then mark landmarks as follows:
12 Edit Sensitive Parsing (ESP) So how do we choose the landmarks? Consider the final labels, Mark any character that is a local maxima (greater than left & right): Text: c a b a g e f a c e d Final:
13 Edit Sensitive Parsing (ESP) So how do we choose the landmarks? Consider the final labels, Mark any character that is a local maxima (greater than left & right): Text: c a b a g e f a c e d Final: Then mark any local minima if not adjacent to a marked char.
14 Edit Sensitive Parsing (ESP) So how do we choose the landmarks? Consider the final labels, Mark any character that is a local maxima (greater than left & right): Text: c a b a g e f a c e d Final: Then mark any local minima if not adjacent to a marked char. Clearly, distance between marked labels is 2 or 3.
15 Edit Sensitive Parsing (ESP) What did we achieve so far? By now the whole string has been arranged in pairs and triples. The important outcome is that 2 strings with small edit distance will be parsed to a “very close” arrangement. (the parsing of each character depends on a Log*n neighborhood)
16 Edit Sensitive Parsing (ESP) constructing the ESP tree We will now re-label each pair or triple – can be done by building a dictionary (Karp-Miller-Rosenberg) or by hashing (Karp-Rabin 1987). Hash(w[0…m-1]) =(integer value of w[0…m-1]) mod q For some large q
17 Edit Sensitive Parsing (ESP) constructing the ESP tree In O(nLogn) construction time we get a 2-3 tree:
18 How do we represent a 2-3 tree as a vector? The vector is the frequency of occurrence of each (level,label) pair: (0,a) (0,b) (0,c) (0,d) (0,e) (0,f) (0,g) (0,_) (1,2) (1,3) (1,6) (1,7) (1,8) (1,10) (1,12) (1,14) (1,16) (1,20) (1,21) (2,5) (2,7) (2,10) (2,13) (2,17) (2,20) (3,3) (3,15) (3,23) (4,10)
19 Proof of correctness Theorem: 1/2d(X,Y) V(X) – V(Y) 1 O(lognlog*n)d(X,Y)
20 Upper bound proof: V(X) – V(Y) 1 O(lognlog*n)d(X,Y) Insert/change/delete a character: at most log*n + c nodes change per tree level. Move a substring: the only changes are at the fringes, that is 4(log*n +c) nodes change per tree level. since there are logn levels in the tree: Conclusion: each operation changes V by O(lognlog*n) and there are d(X,Y) changes.
21 Lower bound proof: d(X,Y) 2 V(X) – V(Y) 1 The Idea is to transform X into Y using at most 2 V(X) – V(Y) 1 operations: We want to keep hold of large pieces of the string that are common to both X and Y. So we will go through and protect enough pieces of X that will be needed in Y.
22 Lower bound proof: (cont.) We avoid doing any changes in the protected pieces. At the first level of the tree we add or remove characters as needed. (if a character appears in Y and not in X, we add it to the end of X). So we get V 0 (X)-V 0 (Y) 1 = 0. We proceed inductively up on the tree, then to make any node in level i, we need to move at most 2 nodes from level i-1. (we know that on level i-1 we have enough nodes i.e: V i-1 (X)-V i-1 (Y) 1 = 0)
23 Lower bound proof: example Y: X: D AABCB E BA F CB E GH I J CABBA F CB E CB E KM L Mark protected pieces: Counter=0
24 Lower bound proof: example Y: X: D AABCB E BA F CB E GH I J CABBA F CB E CB E KM L Remove and add characters as needed: Counter=0Counter=1 (deletion) A Counter=2 (insertion)
25 Lower bound proof: example Y: X: D AABCB E BA F CB E GH I J ABBA F CB E CB E KM L A Move to level 2, move nodes in level 1 as needed: Counter=2 (insertion) Counter=3 (1 move)
26 Lower bound proof: example Y: X: D AABCB E BA F CB E GH I D ABBA F CB E CB E KM L A Move to level 3, move nodes in level 2 as needed: This node will not move Counter=5 (2 moves) Counter=3 (1 move)
27 Lower bound proof: example Y: X: D AABCB E BA F CB E GH I D ABCB E BA F CB E GH I A Counter=5 (2 moves) That’s it!!
28 Lower bound proof: (example with d(X,Y)=1) Did we achieve what we wanted? Y X 2 V(Y)-V(X) 1 = 2(#red nodes + # green nodes) = 18 And we surely created X from Y in less then 18 moves D AABCB E BA F CB E GH I J CABBA F CB E CB E KM L
29 Application to string matching To find D[i], we need to compare P against allD[i] possible substrings of T. we can reduce this to O(n): d(T[1,m],P) d(T[1,m], T[1,r]) + d(T[1,r], P) = |r–m| + d(T[1,r], P) 2d(T[1,r], P) So we only need to consider O(n) substrings of length m And we get a 2-approximation of the optimal matching !! Since we need at least |r–m| operations to make T[1,r] the same length as P.
30 Application to string matching – Final algorithm Create ESP trees for T and P. Find V(T[1:m])-V(P) 1 Iteratively compute D[i] V(T[i:i+m-1])-V(P) 1 Overall we get O(nLogn) time cost for the whole algorithm, and compute every D[i] up to a factor of O(lognlog*n)
31 Application to string matching – Final algorithm example:
32