1 String Matching with Errors The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20, No. 1, 1980, pp. 359~373. Speaker: C. C. Lin Adviser: R. C. T. Lee
2 In the following, we will present a problem related to the notion of edit distance. Next, let us introduce edit distance.
3 In edit distance, there are three types of differences between two strings X and Y: Insertion: a symbol of Y is missing in X at a corresponding position, with its cost being 1. Substitution: symbols at corresponding positions are distinct, with its cost being 1. Deletion: a symbol of X is missing in Y at a corresponding position, with its cost being 1. X: G C A Y: G A X : A C C Y : T C C X : A T Y : A G T
4 Given two strings X and Y, the edit distance between X and Y is the minimum number of insertions, deletions and substitutions needed to transform Y to X.
5 String X ATGAATCTTACCGCCTCG String Y ATGAGGCTCTGGCCCCTG Transformation (from string Y to string X) String X:A T G A A – – T C T T A C C G C C T C G String Y:A T G A G G C T C T G G C C – C C T – G EDIT(X, Y)=7 (2 insertions, 2 deletions and 3 changes).
6 Next, we will introduce a dynamic programming method to compute the edit distance between two strings X and Y.
7 Dynamic Programming for Edit Distance: (Delete) (Insert) (Substitute)
8 abcabba c b a b a c Given X=abcabba Y=cbabac
9 abcabba c b a b a c Given X=abcabba Y=cbabac
10 abcabba c b a b a c Given X=abcabba Y=cbabac
11 abcabba c b a b a c Given X=abcabba Y=cbabac
12 abcabba c b a b a c Given X=abcabba Y=cbabac
13 abcabba c b a b a c Given X=abcabba Y=cbabac
14 abcabba c b a b a c EDIT(X, Y)=4 a c Given X=abcabba Y=cbabac Substitute
15 abcabba c b a b a c EDIT(X, Y)=4 ba ac Given X=abcabba Y=cbabac Substitute
16 abcabba c b a b a c EDIT(X, Y)=4 bba bac Given X=abcabba Y=cbabac Match
17 abcabba c b a b a c EDIT(X, Y)=4 abba abac Given X=abcabba Y=cbabac Match
18 abcabba c b a b a c EDIT(X, Y)=4 cabba –abac Given X=abcabba Y=cbabac Insert
EDIT(X, Y)=4 bcabba b–abac Given X=abcabba Y=cbabac c a b a b c abbacba Match 0
EDIT(X, Y)=4 abcabba cb–abac Given X=abcabba Y=cbabac abcabba c b a b a c Substitute
21 abcabba c b a b a c Given X=abcabba Y=cbabac EDIT(X, Y)=4 abcabba- cb–ab-ac Substitute Match Insert Match Insert Match Delete
22 abcabba c b a b a c Given X=abcabba Y=cbabac EDIT(X, Y)=4 abcabba- cb–a-bac
23 We can recognize the time complexity of computing edit distance by the above algorithm to be O(mn) and space complexity O(mn) where n and m are the size of text and pattern, respectively.
24 In the following, we will introduce the topic, called the string matching with errors problem.
25 The definition of the problem: Given a pattern P of length m and a text T of length n, find a substring S of T such that EDIT(S, P) is minimal. Given: T=abcabba P=cbabac Find: S=cabba EDIT(S, P)=3 P= cbabac S= c–abba Given: T=abcabba P=cbabac Ts substring K=bcabb EDIT(K, P)=4 P= –cbabac K= bc–ab–b
26 Dynamic Programming for the String Matching with Error Problem:
27 The difference between EDIT[i, j] is that the EDIT[0, j]=j for the edit distance finding problem and SE[0,j]=0 for the string with error problem. The dynamic programming approach for the edit distance problem:
28 In the edit distance problem, we have EDIT[0, j]=j. In the string matching with error problem, we set SE[0, j]=0.
29 abcabba c b a b a c T=abcabba P=cbabac Since this path starts at the bottom row and ends at the top row with SE(0, j)=0, this shows that there exists a substring S in T such that EDIT(S, P)=3.
30 We find the lowest value of the last row and trace back from the point. Our output may be several strings.
31 abcabba c b a b a c S=cabba T=abcabba P=cbabac T: abc–abba P: cbabac
T=abcabba P=cbabac EDIT(S, P)=3 edit distance cabba c b a b a c S: c–abba P: cbabac
33 abcabba c b a b a c T=abcabba P=cbabac S: cabba– P: cbabac EDIT(S, P)=3
34 abcabba c b a b a c T=abcabba P=cbabac S: c-ab-- P: cbabac EDIT(S, P)=3
35 abcabba c b a b a c T=abcabba P=cbabac S: --ab-c P: cbabac EDIT(S, P)=3
36 References For Edit Distance Computation: [NW70] Neddleman, S.B., and Wunsch, C.D., A general method applicable to the search for similarities in the aminoacid sequence of two proteins, Journal of Molecular Biology 48 (1970): For String matching with error: [S80] The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20, No. 1, 1980, pp. 359~373.
37 Thank you