1 Two Different Approximate String Matching Problems and Their Algorithms Speakers: C. W. Lu and Y. K. Shie Advisor: Richard Chia-Tung Lee.

Slides:

Advertisements

Similar presentations

1 Radio Maria World. 2 Postazioni Transmitter locations.

Advertisements

The Fall Messier Marathon Guide

Advisor: Prof. R. C. T. Lee Speaker: L. C. Chen

Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.

Trend for Precision Soil Testing % Zone or Grid Samples Tested compared to Total Samples.

AGVISE Laboratories %Zone or Grid Samples – Northwood laboratory

SKELETAL QUIZ 3.

PDAs Accept Context-Free Languages

Statistics Part II Math 416. Game Plan Creating Quintile Creating Quintile Decipher Quintile Decipher Quintile Per Centile Creation Per Centile Creation.

Reflection nurulquran.com.

EuroCondens SGB E.

STATISTICS Linear Statistical Models

Power of Evidence Review

Addition and Subtraction Equations

1 Very fast and simple approximate string matching Information Processing Letters, 72:65-70, G. Navarro and R. Baeza-Yates Advisor: Prof. R. C. T.

1 Average Case Analysis of an Exact String Matching Algorithm Advisor: Professor R. C. T. Lee Speaker: S. C. Chen.

1 Fast text searching: allowing errors Sun Wu and Udi Manber, Communications of the ACM, Vol. 35, 1992, pp Advisor: Prof. R. C. T. Lee Reporter:

1 Fast Parallel and Serial Approximate String Matching Journal of Algorithms, Vol.10 (1989), pp G. Landau and U. Vishkin Advisor: Prof. R. C.

1 Approximate Boyer-Moore String Matching Source : SIAM Journal on Computing, Vol. 22, No. 2, 1993, pp J. Tarhio and E. Ukkonen Advisor: Prof.

Speaker: C. C. Lin Adviser: R. C. T. Lee

1 Rules for Approximate String Matching R.C.T. Lee.

1 Faster algorithms for string matching with k mismatches Adviser : R. C. T. Lee Speaker: C. C. Yen Journal of Algorithms, Volume 50, Issue 2, February.

1 Approximate string matching using factor automata J. Holub and B. Melichar Theoretical Computer Science vol.249 p Speaker: L. C. Chen Advisor:

By John E. Hopcroft, Rajeev Motwani and Jeffrey D. Ullman

1 When you see… Find the zeros You think…. 2 To find the zeros...

Western Public Lands Grazing: The Real Costs Explore, enjoy and protect the planet Forest Guardians Jonathan Proctor.

Add Governors Discretionary (1G) Grants Chapter 6.

Making a Line Plot Collect data and put in chronological order

Summative Math Test Algebra (28%) Geometry (29%)

Introduction to Turing Machines

ASCII stands for American Standard Code for Information Interchange

The 5S numbers game..

突破信息检索壁垒－SciFinder Scholar 介绍

A Fractional Order (Proportional and Derivative) Motion Controller Design for A Class of Second-order Systems Center for Self-Organizing Intelligent.

Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)

The basics for simulations

© 2010 Concept Systems, Inc.1 Concept Mapping Methodology: An Example.

Connecticut Mastery Test (CMT) and the Connecticut Academic Achievement Test (CAPT) Spring 2013 Presented to the Guilford Board of Education September.

Figure 3–1 Standard logic symbols for the inverter (ANSI/IEEE Std

Dynamic Access Control the file server, reimagined Presented by Mark on twitter 1 contents copyright 2013 Mark Minasi.

Statistics Review – Part I

Progressive Aerobic Cardiovascular Endurance Run

When you see… Find the zeros You think….

2011 WINNISQUAM COMMUNITY SURVEY YOUTH RISK BEHAVIOR GRADES 9-12 STUDENTS=1021.

Before Between After.

Standard Deviation and Z score

2011 FRANKLIN COMMUNITY SURVEY YOUTH RISK BEHAVIOR GRADES 9-12 STUDENTS=332.

ST/PRM3-EU | | © Robert Bosch GmbH reserves all rights even in the event of industrial property rights. We reserve all rights of disposal such as copying.

2.10% more children born Die 0.2 years sooner Spend 95.53% less money on health care No class divide 60.84% less electricity 84.40% less oil.

Numeracy Resources for KS2

1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)

Static Equilibrium; Elasticity and Fracture

ANALYTICAL GEOMETRY ONE MARK QUESTIONS PREPARED BY:

Rizwan Rehman Centre for Computer Studies Dibrugarh University

Resistência dos Materiais, 5ª ed.

Lial/Hungerford/Holcomb/Mullins: Mathematics with Applications 11e Finite Mathematics with Applications 11e Copyright ©2015 Pearson Education, Inc. All.

UNDERSTANDING THE ISSUES. 22 HILLSBOROUGH IS A REALLY BIG COUNTY.

Chapter 8: Dialysis Providers 2014 ANNUAL DATA REPORT VOLUME 2: E ND -S TAGE R ENAL D ISEASE.

Chart Deception Main Source: How to Lie with Charts, by Gerald E. Jones Dr. Michael R. Hyman, NMSU.

1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)

Introduction Embedded Universal Tools and Online Features 2.

úkol = A 77 B 72 C 67 D = A 77 B 72 C 67 D 79.

Schutzvermerk nach DIN 34 beachten 05/04/15 Seite 1 Training EPAM and CANopen Basic Solution: Password * * Level 1 Level 2 * Level 3 Password2 IP-Adr.

Advisor: Prof. R. C. T. Lee Speaker: Y. L. Chen

Reverse Colussi algorithm

Presentation transcript:

1 Two Different Approximate String Matching Problems and Their Algorithms Speakers: C. W. Lu and Y. K. Shie Advisor: Richard Chia-Tung Lee

2 Two different definitions of approximate string matching problem: –Given a text, a pattern and a error bound k, find all the substrings of T whose edit distances with P are less than or equal to k. (Denoted as Problem 1) –Given a text, a pattern and a error bound k, find all positions i of T such that there exists a suffix of T(1, i) whose edit distances with P are less than or equal to k. (Denoted as Problem 2)

3 An example of Problem 1: T: a b a b c d b c d d P: abcd k = 1 Output: T(2, 6)=babcdT(4, 6)=bcd T(3, 6)=abcd T(6, 9)=dbcd T(3, 7)=abcdb T(7, 9)=bcd

4 An example of Problem 2: T: a b a b c d b c d d P: abcd k = 1 Output: Positions of T: 6, 7 and

5 Computing the edit distance between two strings X and Y by using dynamic programming method: (Delete) (Insert) (Substitute) Let us denote this method to be DP1.

6 i j Yaccgatgc 0 X a a a c g a Example: We can find the edit distances between all prefixes of Y and all prefixes of X from this table.

7 Problem 1 can be solved by computing the edit distance between T(i, i+m-1+k) and P for all 0<i<n, and the time complexity is O(m 2 n). (m=size of P and n=size of T) That is, for every i, we perform DP1 on T(i, i+m-1+k) and P, for i=1 to n. Thus, we open a window with size m+k all the time and slide the window.

8 T: a b a b c d b c P: a b c d k = m+k i j ababc 0 P a b c d Output: ψ

9 T: a b a b c d b c P: a b c d k = m+k i j babcd 0 P a b c d Output: T(2, 6)=babcd

10 T: a b a b c d b c P: a b c d k = m+k i j abcdb 0 P a b c d Output: T(3, 5)=abc T(3, 6)=abcd T(3, 7)=abcdb

11 T: a b a b c d b c P: a b c d k = m+k i j bcdbc 0 P a b c d Output: T(4, 6)=bcd

12 T: a b a b c d b c P: a b c d k = m+k i01234 j cdbc 0 P a b c d43234 Output: ψ

13 T: a b a b c d b c P: a b c d k = m+k i0123 j dbc 0 P a b c d4332 Output: ψ

14 Some algorithms try to avoid exhaustive computing in this way. For example, Navarro and Baeza-Yates Algorithm [NB2000], Fredriksson and Navarro Algorithm [FN2004], Lu and Lee Algorithm [LL2008]. We shall explain those algorithm later.

15 Another approach which does not use this sliding window approach is the Wu and Manber Algorithm [WM92]. Actually, in [WM92], the idea proposed in [NB2000] was mentioned, but barely, as if this is trivial.

16 To solve Problem 2, we may use another DP algorithm, called DP2, which will be explained as follows.

17 Given strings Y and X, computing the minimal ED(S, X) where S is a suffix of the substring Y(1, i) for all 0<i<|Y|: Let us denote this method to be DP2. Note that SE[i. 0]=i in DP1 and SE[i, 0]=0 in DP2.

18 i j Yaccgatgc 0 X a a a c g a Example:

19 i j Yaccgatgc 0 X a a a c g a Table (5,6)=2 indicates that there is a substring, namely accga, ending at Location 5 of Y, whose edit distance with P is the smallest, among all substrings of T ending at location 5.

20 i j Yaccgatgc 0 X a a a c g a Example: We need to trace back if we want to know which substring of Y is our solution.

21 i j Yaccgatgc 0 X a a a c g a If we set k=3, then from the above table, we can see that locations 4,5,6 are the solutions for Problem 2. If k=4, the solutions are 2~8.

22 Obviously, Problem 2 can be solved by DP2, and the time complexity is O(mn). It is to be noted that we do not use any window sliding method if DP2 is used directly. Again, if some of the positions of T could be ignored, this method would be more efficient. Some Algorithms try to do this. For example, Navarro and Baeza-Yates Algorithm [NB99], Tarhio and Ukkonen Algorithm [TU93] and Z. H. Pans thesis. We shall explain them later.

23 HN Algorithm ( Bit-parallel Witnesses and their Applications to Approximate String Matching, Heikki Hyyro and Gonzalo Navarro, Algorithmica, 2005 Vol 4, No 3. ) For a substring S of T, they use the DP2 method to find the minimum ED(S, P) among all substring P of P. HN Algorithm solves Problem 1. But, they also use the DP2 method. We will explain in the next slide.

24 For a window of size m-k, if there exists a substring S in this window such that its edit distance with every substring of P is greater than k, we move P to S. This rule is called Rule 5 by Lees group. S T:T: P:P: m - k HN Algorithm (This paper has not been reported yet. It is rumored Mr. Ou-Dee should study and report this. Obviously, he is busily doing some crazy things.

25 abaeeaabcdeada ababcd T P k=1 m-k dcbaba e e a3 b4 a5 DP2: P >k ababcd P In this case, both patterns and texts are reversed.

26 Both DP1 and DP2 can be improved by the LV algorithm, Fast Parallel and Serial Approximate String Matching, G. Landau and U. Vishkin, Journal of Algorithms, Vol.10 (1989), pp , which takes O(nk) time complexity. This algorithm tries not to do the entire computation of the DP table.

27 Diagonal d is defined as all of the D i,j s where d = i–j,where D i,j is the value of (i, j) in DP table. Diagonal 2 Diagonal c 101b 0000 cba i j12j12

28 This algorithm is based on the following observations: –The values of the elements on the same diagonal are non-decreasing. –The value of every element on the diagonal d is decided by the elements on diagonals d, d-1 and d+1. D i-1, j-1 D i, j-1 D i-1, j D i, j d d+1 d-1 delete insert substitution

29 –The values of the elements on the same diagonal are non-decreasing. Proof: Assume these exists a value D i, j such that D i, j >D i+1, j+1. Then, we have D i, j D i+1, j By definition of DP, we have either D i+1, j+1 = D i+1, j + 1 or D i+1, j+1 = D i, j That is, D i+1, j = D i+1, j+1 -1 or D i, j+1 = D i+1, j Thus, D i, j - D i+1, j 2 or D i, j - D i, j+1 2 The two cases all contradict to another property that the value of any location in the DP table can be only 1 larger than that of its neighbors.

30 D i, j D i+1, j D i, j+1 D i+1, j

31 Besides, the value of any location in the DP table can be only 1 larger than that of its neighbors. D i-1, j-1 D i, j-1 D i-1, j D i, j d d+1 d-1 delete insert substitution

32 Let us consider the following table. Question: Assuming that we have already found all locations of i and j such that D i, j =0, what is largest j on diagonal 1 such that D i, j =1? j1234j1234 d =1 i c ?0t 00t 0000g atctggg

33 Let us consider the following table. Certainly D 4, 3 0 because we have found all 0s. D 4, 3 must be greater than 0 and can be only 1 larger than D 4, 2. Thus D 4, 3 =1. j1234j1234 d =1 i c ?0t 00t 0000g atctggg

34 Question: Can D(5,4)=1? –Since T 5 =P 4, D 5,4 =D 4,3 =1. j1234j1234 d =1 i ?0c 10t 00t 0000g atctggg This step can be found by a lowest common ancestor query which takes O(1) time [BF2000]. We explain it in next slide.

a 0a 0a 10t 100t 0000g ctctggg i j12345j12345 d=3 Question: What is the longest common prefix of tac and taa? Answer: It is ta whose length is 2. This means that D 6, 3 and D 7, 4 are all 1. We find this longest common prefix by using a suffix tree.

a 0a 0a 10t 100t 0000g ctctggg i j12345j12345 d=3 S1S1 S2S2 We concatenate the two strings gggtctac and gttaa and construct its suffix tree for finding the LCA of S 1 and S 2. S 1 =taa S 2 =tacgttaa

37 S= gggtctacgttac S2S1 ta is found.

38 Algorithms using the DP1 method to solve Problem 1.

39 A Hybrid Indexing Method for Approximate String Matching Journal of Discrete Algorithms, No. 1, Vol. 1, 2000, pp , Gonzalo Navarro and Ricardo Baeza-Yates Advisor: Prof. R. C. T. Lee Speaker: Y. K. Shieh

40 Lemma 1 Let A and B are two strings such that ed(A,B) k. Let, for any j 1. Then at least one string appears in B with at most errors. By the above lemma, when j = k+1, we want to find whether any piece of P exactly appears in T or not. We divide P into several pieces. After the pattern is divided, it has a property as shown in the following Lemma 1.

41 After we find all probable positions in T, we verify every substring of those positions. The probable positions of T are: 3, 10, 13, 15 and 16 We use DP1 with window size m+k to verify whether any approximate string matching occurs between T and P at the above locations. P 1 = CA P 2 = AG T P k = 1

42 The algorithm would open windows whose sizes are all equal to m+k step by step. If a window does not contain any piece of P, that window is ignored. Thus it avoids an exhaustive search.

43 The probable positions of T are 3, 10, 13, 15, 16 m+k GACAC C A A G k = 1 No approximate matching with k=1 found. T P i=1. Window size=m+k DP1 is used.

44 m+k ACACG C A A G The probable positions of T are: 3, 10, 13, 15, 16 k = 1 No approximate matching with k=1 found. T P i=2. DP1 is used.

45 m+k CACGG C A A G The probable positions of T are: 3, 10, 13, 15, 16 CACG is found. k = 1 T P i=3. DP1 is used.

46 m+k The probable positions of T are: 3, 10, 13, 15, 16. k=1. This window does not include any probable position. Therefore we can ignore this window. T P i=4.

47 m+k The probable positions of T are: 3, 10, 13, 15, 16. The window does not include any probable position. Therefore we can shift the window directly. T P i=5.

48 m+k AAGCA C A A G The probable positions of T are: 3, 10, 13, 15, 16 k = 1. AAG is found. T P i=12. DP1 is used.

49 Average-Optimal Multiple Approximate String Matching Kimmo Fredriksson, Gonzalo Navarro ACM Journal of Experimental Algorithmics, Vol 9, Article No. 1.4,2004, Pages 1-47 Professor R.C.T Lee Speaker K.W.Liu

50 This algorithm uses a checking window. For a checking window of size m-k, if there exists a substring S in this window such that its edit distance with every substring of P is greater than k, we move P to S. We are using Rule 5 now. Our algorithm scans from the right as shown below: S T:T: P:P: m - k

51 Note that the way to move the window ensures us that we do not miss anything and the size of the window needs only to be m+k. Besides, during the checking phase, the window size is m-k. But, how do we find such an S? We use a very useful lemma.

52 Lemma Consider string Q and P. Let Q be divided into q 1,q 2,…,q n as shown below: qnqn …q2q2 q1q1 For each q i, let p i be the substring in P such that ED(q i,p i ) is the smallest, among all substrings in P.

53 Divide a window of T into pieces as shown below T:T: P:P: …t2t2 t1t1 p1p1 p2p2

54 To apply this lemma, we open a checking window in T with size m-k, according to Rule 5. We now divide this window into substrings with length 2, called 2- grams. Note that for 2-grams, the Hamming distance is equal to edit distance. T:T: P:P: …t2t2 t1t1 p1p1 p2p2 m-km-k

55 Example T = ctagggaataatttacaatt P = ttaatatat k = 1 ctagggaataatttacaatt m-k Smallest edit distance between aa and all substrings of P = 0 Smallest edit distance between gg and all substrings of P = 2 > k already. According to Rule 5, we move P after S. S

56 Example T = ctagggaataatttacaatt P = ttaatatat k = 1 ctagggaataatttacaatt m-k Smallest edit distance between tt and all substrings of P = 0 Smallest edit distance between aa and all substrings of P = 0 Smallest edit distance between at and all substrings of P = 0 Smallest edit distance between ga and all substrings of P = 1 == k ctagggaataatttacaatt m-k No S is found. We extend the window to size m+k and examine whether there is a prefix of the window whose edit distance with P is smaller than or equal to k by using DP1. S

57 ctagggaataatttacaatt m-k m+k Example T = ctagggaataatttacaatt P = ttaatatat k = 1 No prefix of the extended window whose edit distance with P is smaller than or equal to k can be found. After checking, no matter whether a solution is found or not, we can only move P one step.

58 Example T = ctagggaataatttacaatt P = ttaatatat k = 1 ctagggaataatttacaatt m-k ctagggaataatttacaatt

59 An Approximate String Matching Algorithm Based upon the Candidate Elimination Method C. W. Lu Thesis

60 T i x Example: T aaaacaacabacbaca aaaa

61 T i m-k m+k For every location i of T, we only consider the substrings. We use DP1 to decide whether any prefix of the window whose edit distance with P is smaller than or equal to k exists. The solution size must be between m-k and m+k. Therefore, any solution starting from i must end in the range of i+m-1-k and i+m-1+k. The window size is therefore m+k. Solution end points Window

62 In the following, we shall show that we may determine that no solution can be found in a window.

63 If N c ( ) = y, N c ( ) y, N c ( ) y, … and N c ( ) y. T aaaacaacabacbaca aaaa m-2 Example:m=9 and k = 2. N a (T(1, 7)) = 6, N b (T(1, 7)) = 0, N c (T(1, 7)) = 1. m-1 m m+1 m+2

64 Let C 1 be the set of all alphabets c such that. If, then ED(, P) > k, for. Lemma 2. Thus, to use Lemma 2, we use a checking window whose size is m-k.

65 Example T babbabbcabacbaca aaaa P accacabcb k = m-k N a (P) = 3, N b (P) = 2, N c (P) = 4. N a (T(1, 7)) = 2, N b (T(1, 7)) = 5, N c (T(1, 7)) = 0. The number of the character b in T(1, 7) is larger than that in P. N b (T(1, 7)) – N b (P) = 5 – 2 = 3 > k. Thus, the edit distances of all substrings starting at location 1 with P are larger than k.

66 T aaaacaacabacbaca aaaa m-2 Example: N a (T(1, 11)) = 8, N b (T(1, 11)) = 1, N c (T(1, 11)) = 2. m-1 m m+1 m+2 If N c ( ) = y, N c ( ) y, N c ( ) y, …, N c ( ) y. m=9 and k = 2.

67 Let C 2 be the set of all alphabets c such that. If, then ED(, P) > k, for. Lemma 3. To apply this lemma, the checking window size is now m+k.

68 Example T aaaacaacabacbaca aaaa P accacabcb k = m+k N a (P) = 3, N b (P) = 2, N c (P) = 4. N a (T(1, 11)) = 8, N b (T(1, 11)) = 1, N c (T(1, 11)) = 2. The numbers of the characters b and c in T(1, 11) are smaller than in P. [N b (P) – N b (T(1, 11))] + [N c (P) – N c (T(1, 11))] = (2-1) + (4-2) = 3 > k. Thus, the edit distances of all substrings starting at location 1 with P are larger than k.

69 If or, we can eliminate position i of T. That is, we prune all substrings, for out of consideration. Theorem 1.

70 Example: T aaaacaacabacbaca aaaa P accacabcb k = m+k N a (P) = 3, N a (T(1, 7)) = 6, N a (T(1, 11)) = 8, N b (P) = 2, N b (T(1, 7)) = 0, N b (T(1, 11)) = 1, N c (P) = 4, N c (T(1, 7)) = 1, N c (T(1, 11)) = 2. m-k: N a (T(1, 7)) – N a (P) = 6-3 = 3 > k. m+k: [N b (P) – N b (T(1, 11))] + [N c (P) – N c (T(1, 11))] =(2-1) + (4-2) = 3 > k. m-k

71 Example: T aaaacaacabacbaca aaaa P accacabcb k = m+k N a (P) = 3, N a (T(2, 8)) = 5, N a (T(2, 12)) = 7, N b (P) = 2, N b (T(2, 8)) = 0, N b (T(2, 12)) = 1, N c (P) = 4, N c (T(2, 8)) = 2, N c (T(2, 12)) = 3. m-k: N a (T(2, 8)) – N a (P) = 5-3 = 2 k. m+k: [N b (P) – N b (T(2, 12))] + [N c (P) – N c (T(2, 12))] =(2-1) + (4-3) = 2 k. m-k DP1 is to be used now.

72 aaacaacabac a c c a c a b c b T(2, 12) P > k k = 2 ED(, P) > k, DP1

73 Example: T aaaacaacabacbaca aaaa P accacabcb k = m+k N a (P) = 3, N a (T(4, 10)) = 4, N a (T(4, 14)) = 6, N b (P) = 2, N b (T(4, 10)) = 1, N b (T(4, 14)) = 2, N c (P) = 4, N c (T(4, 10)) = 2, N c (T(4, 14)) = 3. m-k: N a (T(2, 8)) – N a (P) = 4-3 = 1 k. m+k: N c (P) – N c (T(2, 12)) = 4-3 = 1 k. m-k

74 acaacabacba a c c a c a b c b T(4, 14) P k k = 2 ED(, P) = 2 k. DP1

75 Example: T aaaacaacabacbaca aaaa P accacabcb k = m+k m-k ED(T(4, 13), P) = 2 k.

76 Algorithms using the DP2 method to Solve Problem 2.

77 Very fast and simple approximate string matching Information Processing Letters, 72:65-70, G. Navarro and R. Baeza-Yates Advisor: Prof. R. C. T. Lee Speaker: H. M. Chen

78 Lemma 1 Let A and B are two strings such that ed(A,B) k. Let, for any j 1. Then at least one string appears in B with at most errors. By the above lemma, when j = k+1, we want to find whether any piece of P exactly appears in T or not. Lemma 1 is used again. We divide P into several pieces. After the pattern is divided, it has a property as shown in the following Lemma 1.

79 Suppose the window with size m exactly matches with P, we can extend it to the left to i-k and to the right to i+m-1+k. Thus, the approximate solution lies in the window with size m+2k. We now use DP2 to decide whether any substring in the window whose edit distance with P is smaller than or equal to k exists.

80 Although this algorithm looks like the NB Algorithm [NB2000], it is actually different from it because we are now solving Problem 2 while NB Algorithm solves Problem 1. In the NB Algorithm, DP1 is used. The window size must be m+k and the window is moved step by step.

81 But, in this algorithm, we solve Problem 2. Thus DP2 is used and the examination window size must be m+2k as shown below:

82 A full example: T = GACACTAGCCACACTGATCC P = ACATCAGCC k = 1 By Lemma1, we divide P into j = k+1=1+1=2 pieces. Therefore, we obtain = ACATC, = AGCC.

T = GACACTAGCCACACTGATCC P = ACATCAGCC We then open a window T(1, 1+m-1+2k)=T(1,11) and use DP2.

84 GACACTAGCCA A C A T C A G C C From the table, we conclude that there is no solution for k=1. DP2

85 Pans Algorithm for Problem 2 In Pans thesis, he does not divide P into k+1 pieces. Instead, he picks k+1 substrings with constant length C. If for a window W, one such substring exactly appears, he then opens a window of size m+2k as shown below.

86 Pans algorithm only checks those windows which are opened. The checking is done by DP2.

87 Approximate Boyer-Moore String Matching Source : SIAM Journal on Computing, Vol. 22, No. 2, 1993, pp J. Tarhio and E. Ukkonen Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen

88 In the following figure, y in T is located at i and x in T is located in j. But, y does not appear within i-k to i+k in P and x does not appear within j-k to j+k. In this case, it can be seen that deleting any character in P will not result in an exact match. Thus, the edit distance between T and P must be larger than 1. Since k=1, it is impossible to have ED(T,P) k. k=1 i i+ki+k i-ki-k j

89 Suppose a character x of a window of T is located in i. The range of P from i-k to i+k is called the 2k-range of x.

90 In this algorithm, we always open a window with size m and check whether there exist k+1 characters in the window which do not appear in their corresponding 2k-ranges. If yes, shift the window according to some rules called The Shifting Rule explained later. If no, suppose the window starts at location i, use DP2 on a window T(i-k, i+m-1+k) of T with P. After this, shift the window according to the Shifting Rule.

91 The Shifting Rule is given in the next slides.

92 Case 1: There is one character in this (k+1)- suffix which exists in P in such a way as shown below. Move the pattern to match these characters. Note that in such a situation, there are at most k mismatches between the (k+1)-suffix and its corresponding substring in P. A x-suffix (prefix) is a suffix (prefix) with size x.

93 A very tricky point here. An approximate solution of our problem may still start from a location to the left of i. In fact, it may start from any location between i-k to i+k. Similar argument applies to the ending point. Conclusion: The examination window size must be m+2k. i

94 Case 2: No such a character exists. Move the pattern in such a way that the k-prefix of P aligns with the k-suffix of W as shown below. Under such a situation, again, there are at most k-mismatches between the k-suffix of W and k- prefix of P.

95 We perform a pre-processing similar to the bad character rule pre-processing done in BM Algorithm. For this algorithm, the checking window size is m and the examination window size is m+2k.

96 Complete example for approximate string matching For example : Let k=1, m=8, n=24 T:GCATCGCAGAGAGTATGCAGAGCG P:GCAGAGAG ΣA C G * D 1 [i=8, a] D 1 [i=7, a] Bad character rule Table.

97 Example(1/15) T:GCATCGCAGAGAGTATGCAGAGCG P:GCAGAGAG ΣA C G * D[i=8, a] D[i=7, a] k=1 >k>k t 8 =A appears in P(7,8) t 7 =C does not appear in P(6,8) t 6 =G appears in P(5,7) t 5 =C does not appear in P(4,6) Shifting is needed now. We examine the (k+1)-suffix which is a 2-suffix.

98 Example(2/15) ΣA C G * D[i=8, a] D[i=7, a] T:GCATCGCAGAGAGTATGCAGAGCG P:GCAGAGAG >k>k k=1 t 9 =G appears in P(7,8) t 8 =A appears in P(6,8) t 7 =C does not appear in P(5,7) t 6 =G appears in P(4,6) t 5 =C does not appear in P(3,5) Shifting is needed now.

99 Example(3/15) ΣA C G * D[i=8, a] D[i=7, a] T:GCATCGCAGAGAGTATGCAGAGCG P:GCAGAGAG >k>k k=1 t 11 =G appears in P(7,8) t 10 =A appears in P(6,8) t 9 =G appears in P (5,7) t 8 =A appears in P(4,6) t 7 =C does not appear in P(3,5) t 6 =G appears in P(2,4) t 5 =C appears in P(1,3) t 4 =T does not appear in P(1,2) Shifting is needed now.

T:GCATCGCAGAGAGTATGCAGAGCG P:GCAGAGAG ΣA C G * D[i=8, a] D[i=7, a] Output locations 12, 13 and 14. CGCAGAGAGT G C A G A G A G k=1 No k+1 characters in the window do not appear in their 2k-ranges. DP2 is used.

101 Summary 1. DP1 is used for Problem 1 and DP2 is used for Problem For both problems, algorithms try to avoid exhaustive search. 3. Most algorithms use checking windows to determine which region needs to be examined. 4. The NB Algorithm in [NB99] and Pan Algorithm do not use checking windows to determine regions which need to be examined.

During the examination phase when DP1 or DP2 is used, there is another window which may be called the examination window. 6. For Problem 1 where DP1 is used, the examination window size is m+k for all algorithms. 7. For Problem 2 where DP2 is used, the examination window size is m+2k for all algorithms.

103 The examination window size is m+2k when the solution may start and end as shown below:

104 Thank You