Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 String Matching with Errors The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20,

Similar presentations


Presentation on theme: "1 String Matching with Errors The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20,"— Presentation transcript:

1 1 String Matching with Errors The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20, No. 1, 1980, pp. 359~373. Speaker: C. C. Lin Adviser: R. C. T. Lee

2 2 In the following, we will present a problem related to the notion of edit distance. Next, let us introduce edit distance.

3 3 In edit distance, there are three types of differences between two strings X and Y: Insertion: a symbol of Y is missing in X at a corresponding position, with its cost being 1. Substitution: symbols at corresponding positions are distinct, with its cost being 1. Deletion: a symbol of X is missing in Y at a corresponding position, with its cost being 1. X: G C A Y: G A X : A C C Y : T C C X : A T Y : A G T

4 4 Given two strings X and Y, the edit distance between X and Y is the minimum number of insertions, deletions and substitutions needed to transform Y to X.

5 5 String X ATGAATCTTACCGCCTCG String Y ATGAGGCTCTGGCCCCTG Transformation (from string Y to string X) String X:A T G A A – – T C T T A C C G C C T C G String Y:A T G A G G C T C T G G C C – C C T – G EDIT(X, Y)=7 (2 insertions, 2 deletions and 3 changes).

6 6 Next, we will introduce a dynamic programming method to compute the edit distance between two strings X and Y.

7 7 Dynamic Programming for Edit Distance: (Delete) (Insert) (Substitute)

8 8 abcabba c b a b a c 01234567 1 2 3 4 5 6 Given X=abcabba Y=cbabac

9 9 abcabba c b a b a c 01234567 11 2 3 4 5 6 Given X=abcabba Y=cbabac

10 10 abcabba c b a b a c 01234567 112 2 3 4 5 6 Given X=abcabba Y=cbabac

11 11 abcabba c b a b a c 01234567 1122 2 3 4 5 6 Given X=abcabba Y=cbabac

12 12 abcabba c b a b a c 01234567 11223 2 3 4 5 6 Given X=abcabba Y=cbabac

13 13 abcabba c b a b a c Given X=abcabba Y=cbabac 01234567 11223456 22123345 32222344 43233234 54333333 65434444

14 14 abcabba c b a b a c 01234567 11223456 22123345 32222344 43233234 54333333 65434444 EDIT(X, Y)=4 a c Given X=abcabba Y=cbabac Substitute

15 15 abcabba c b a b a c 01234567 11223456 22123345 32222344 43233234 54333333 65434444 EDIT(X, Y)=4 ba ac Given X=abcabba Y=cbabac Substitute

16 16 abcabba c b a b a c EDIT(X, Y)=4 bba bac Given X=abcabba Y=cbabac 01234567 11223456 22123345 32222344 43233234 54333333 65434444 Match

17 17 abcabba c b a b a c EDIT(X, Y)=4 abba abac Given X=abcabba Y=cbabac 01234567 11223456 22123345 32222344 43233234 54333333 65434444 Match

18 18 abcabba c b a b a c 01234567 11223456 22123345 32222344 43233234 54333333 65434444 EDIT(X, Y)=4 cabba –abac Given X=abcabba Y=cbabac Insert

19 19 44443456 33333345 43233234 44322223 54332122 65432211 7654321 EDIT(X, Y)=4 bcabba b–abac Given X=abcabba Y=cbabac c a b a b c abbacba Match 0

20 20 01234567 11223456 22123345 32222344 43233234 54333333 65434444 EDIT(X, Y)=4 abcabba cb–abac Given X=abcabba Y=cbabac abcabba c b a b a c Substitute

21 21 abcabba c b a b a c Given X=abcabba Y=cbabac 01234567 11223456 22123345 32222344 43233234 54333333 65434444 EDIT(X, Y)=4 abcabba- cb–ab-ac Substitute Match Insert Match Insert Match Delete

22 22 abcabba c b a b a c Given X=abcabba Y=cbabac 01234567 11223456 22123345 32222344 43233234 54333333 65434444 EDIT(X, Y)=4 abcabba- cb–a-bac

23 23 We can recognize the time complexity of computing edit distance by the above algorithm to be O(mn) and space complexity O(mn) where n and m are the size of text and pattern, respectively.

24 24 In the following, we will introduce the topic, called the string matching with errors problem.

25 25 The definition of the problem: Given a pattern P of length m and a text T of length n, find a substring S of T such that EDIT(S, P) is minimal. Given: T=abcabba P=cbabac Find: S=cabba EDIT(S, P)=3 P= cbabac S= c–abba Given: T=abcabba P=cbabac Ts substring K=bcabb EDIT(K, P)=4 P= –cbabac K= bc–ab–b

26 26 Dynamic Programming for the String Matching with Error Problem:

27 27 The difference between EDIT[i, j] is that the EDIT[0, j]=j for the edit distance finding problem and SE[0,j]=0 for the string with error problem. The dynamic programming approach for the edit distance problem:

28 28 In the edit distance problem, we have EDIT[0, j]=j. In the string matching with error problem, we set SE[0, j]=0.

29 29 abcabba c b a b a c 33343456 22233345 22123234 12212223 21111122 11110111 00000000 T=abcabba P=cbabac Since this path starts at the bottom row and ends at the top row with SE(0, j)=0, this shows that there exists a substring S in T such that EDIT(S, P)=3.

30 30 We find the lowest value of the last row and trace back from the point. Our output may be several strings.

31 31 abcabba c b a b a c 00000000 11101111 22111112 32221221 43232122 54333222 65434333 S=cabba T=abcabba P=cbabac T: abc–abba P: cbabac

32 32 012345 101234 211123 321222 432123 543222 654333 T=abcabba P=cbabac EDIT(S, P)=3 edit distance cabba c b a b a c S: c–abba P: cbabac

33 33 abcabba c b a b a c 00000000 11101111 22111112 32221221 43232122 54333222 65434333 T=abcabba P=cbabac S: cabba– P: cbabac EDIT(S, P)=3

34 34 abcabba c b a b a c 00000000 11101111 22111112 32221221 43232122 54333222 65434333 T=abcabba P=cbabac S: c-ab-- P: cbabac EDIT(S, P)=3

35 35 abcabba c b a b a c 00000000 11101111 22111112 32221221 43232122 54333222 65434333 T=abcabba P=cbabac S: --ab-c P: cbabac EDIT(S, P)=3

36 36 References For Edit Distance Computation: [NW70] Neddleman, S.B., and Wunsch, C.D., A general method applicable to the search for similarities in the aminoacid sequence of two proteins, Journal of Molecular Biology 48 (1970): 443-453. For String matching with error: [S80] The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20, No. 1, 1980, pp. 359~373.

37 37 Thank you


Download ppt "1 String Matching with Errors The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20,"

Similar presentations


Ads by Google