1 Rules for Approximate String Matching R.C.T. Lee
2 Rule 1 Consider two substrings A 1 and A 2 as shown below: A1A1 P1P1 S1S1 A2A2 P2P2 S2S2 If ed(A 1, A 2 ) k and S 1 =S 2, then ed(P 1, P 2 ) k.
3 Rule 1:[AKLLLR2000], [H2005], [HHLS2006], [JB2000], [LV89], [NB99], [NB2000], [S80], [TU93], and [WM92].
4 Rule 2 If ed(A, B) k, then the length of A must be between m-k and m+k. A B m
5 Rule 2: [FN2004], [NB99], [NB2000] and [TU93].
6 Rule 3 If S 1 contain S 1 completely and the distance between S 1 and any substring of P is larger than k, then ed(S 1, P)>k. S1S1 P S 1
7 Rule 3: [ALP2004].
8 Rule 4 For any substring S 1 in T, if there exists a substring S 2 in P to the left of S 1, ed(S 1, S 2 ) k and S 2 is the rightmost such substring, then move P to align S 1 and S 2. TS1S1 PS2S2 PS2S2
9 Rule 4: [ALP2004].
10 Based upon Rule 3 and Rule 2, we have Rule 5 If the window size is (m-k) and there exists a substring S 1 in the window such that the distance between S 1 and any substring of P is larger than k, then we can safely move P as follows: TS1S1 P m-k TS1S1 P
11 If Rule 5 is not satisfied, it means the following: For every substring S 1 in T, there exists a substring S 2 in P such that ed(S 1, S 2 ) k.
12 TS1S1 P m-k Rule 5-1 If Rule 5 is not satisfied, we can only move 1 step as follows: TS1S1 P m-k
13 Rule 5: [HN2005].
14 Rule 6 Hamming Distance(A, B) Edit Distance(A, B).
15 Rule 6: [AKLLLR2000], [FN2004] and [TU93].
16 Rule 7 For strings A and B, if there are k+1 characters which do not appear in B, then ed(A, B)>k. Rule 7-1 Let A and B be two strings. Let there be k+1 characters a 1, a 2, …, a k+1 in A and a i is aligned with b i in B. If every a i does not appear in B[i-k, i+k], then ed(A, B)>k.
17 Rule 7: [TU93].
18 Rule 8 Let there be two strings A and B. Let B be divided into j pieces B 1, B 2, …, B j. If ed(A, B)>k, there is at least one substring A i in A such that ed(A i, B i ).
19 Rule 8-1 Let A and B be two strings. Let B be divided into j pieces B 1, B 2, …, B j. If for every B i and every substring S of A, ed(S, B i ), ed(A, B)>k.
20 Rule 8-2 Let A and B be two strings. Let the lengths of A and B be m+k and m repsectively. Let B be divided into j pieces B 1, B 2, …, B j. Let AP be a prefix of A. If for every B i and every substring S of A, ed(S, B i ), ed(AP, B)>k.
21 Rule 8: [NB99] and [NB2000].
22 Rule 9 Let A and B be two strings with lengths m+k and m respectively. Let A be the prefix of A with length m-k. Let there be j characters a 1, a 2, …, a j in A. Let the number of times that a i appears in A and B be N(A, a i ) and N(B, a i ) respectively. Let C i =N(A, a i )-N(B, a i ). Let AP be any prefix of A. If, ed(AP, B)>k.
23 Rule 9-1 Let A and B be two strings with lengths m+k and m respectively. Let there be j characters a 1, a 2, …, a j in A. Let the number of times that a i appears in A and B be N(A, a i ) and N(B, a i ) respectively. Let C i =N(B, a i )-N(A, a i ). Let AP be any prefix of A. If, ed(AP, B)>k.
24 Rule 10 Let P and T be two strings with lengths m and n respectively. If P matches with a substring P of T at position i, any substring S of T[i-k, i+m+k] has the probability of ed(S, P) k. TP P m+2k ii-ki+m+k
25 Rule 10: [NB99].
26 Rule 11 Let P and Q be two strings. Let P be divided as follows: P1P1 P2P2 … PnPn Let Q i be the substring in Q and that ed(P i, Q i ) is the smallest. P1P1 P2P2 PnPn … Q2Q2 QNQN … Q1Q1 If
27 Application of Rule 11 T W tntn t2t2 PnPn P1 P1 P2P2 t1t1 … ed(t i,P i ) is the smallest. If for some n,
28 [AKLLLR2000] Text Indexing and Dictionary Matching with One Error, Amir, A., Keselman, D., Landau, G. M., Lewenstein, M., Lewenstein, N. and Rodeh, M., Journal of Algorithms, Vol. 37, 2000, pp [ALP2004] Faster Algorithms for String Matching with k Mismatches, Amir, A., Lewenstein, and Porat, E. Journal of Algorithms, Vol. 50, 2004, pp [FN2004] Average-Optimal Multiple Approximate String Matching, Kimmo Fredriksson, Gonzalo Navarro, ACM Journal of Experimental Algorithmics, Vol 9, Article No. 1.4,2004, pp
29 [GG86] Improved String Matching with k Mismatches, Galil, Z. and Giancarlo, R.,SIGACT News, Vol. 17, No. 4, 1986, pp [H2005] Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp [HHLS2006] Approximate String Matching Using Compressed Suffix Arrays, Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol. 352, 2006, pp
30 [HN2005] Bit-parallel Witnesses and their Applications to Approximate String Matching, Heikki Hyyro and Gonzalo Navarro, Algorithmica, Vol 4, No. 3, 2005, pp [JB2000] Approximate string matching using factor automata, Jan Holub, Borivoj Melichar, Theoretical Computer Science 249, 2000, pp [LV86] String Matching with k Mismatches by Using Kangaroo Method, Landau, G.M., and Vishkin, U., Theoret. Comput Sci 43, 1986, pp
31 [LV89] Fast Parallel and Serial Approximate String Matching, G. Landau and U. Vishkin, Journal of algorithms, 10, 1989, pp [NB99] Very fast and simple approximate string matching, G. Navarro and R. Baeza- Yates, Information Processing Letters, Vol. 72, 1999, pp [NB2000] A Hybrid Indexing Method for Approximate String Matching, Gonzalo Navarro and Ricardo Baeza-Yates, 2000, No.1, Vol.1, pp
32 [S80] String Matching with Errors, Sellers, P. H., Journal of Algorithms, Vol. 20, No. 1, 1980, pp [TU93] Approximate Boyer-Moore String Matching, J. Tarhio and E. Ukkonen, SIAM Journal on Computing, Vol. 22, No. 2, 1993, pp [WM92] Fast Text Searching: Allowing Errors, Sun Wu and Udi Manber, Communications of the ACM, Vol. 35, 1992, pp