1 The Colussi Algorithm Advisor: Prof. R. C. T. Lee Speaker: Y. L. Chen Correctness and Efficiency of Pattern Matching Algorithms Information and Computation, Vol, 95, 1991, pp Colussi, L.
2 The main principle of Colussi Algorithm 1.We point out that there are positions where large number of jumps are allowed. 2.We first process the positions where only small number of jumps are allow. It is obviously safe to do so. Beside, we may look into the future this way.
3 The Colussi Algorithm is a modification of the KMP Algorithm. In the KMP Algorithm, we always construct the KMP function. For instance, for the case of ATCATCATCA, the KMP function is as follows :
4 Condition for KMP[i] = -1 Condition A: p 0 = p i Condition B: p 0, j is a suffix of p 0, i-1 Condition C: p j+1 = p i KMP[i] = -1 :
5 There is no suffix of p(0, 3) which is equal to a prefix of p 0, 3. p 0 = p 4. A KMP[4] = -1 because it satisfies the condition. 5 b 6 c 7 b 8 a 9 e 10 b 11 c 12 b 1 b 2 c 3 b 4 a 0 1 P 13 a i 14 b 15 c 16 ba 0
6 There are two suffixes of p 0, 14 which are equal to a prefix of p 0, 14 : p 0, 1 = p 13, 14 and p 0, 5 = p 9, 14 For p 0, 5, we have p 6 = p 15 ; For p 0, 1, we have p 2 = p 15. ( ) p 0 = p 15. ( A ) KMP[15] = -1 because it satisfies the condition. 5 b 6 c 7 b 8 a 9 e 10 b 11 c 12 b 1 b 2 c 3 b 4 a P 13 a i 14 b 15 c 16 ba
7 First, construct the preprocess tables. It contains Kmp 、 Kmin 、 Rmin and Shift functions. Second, the set of pattern positions is divided into two disjoint subsets. Then each attempt consists in two phases : In the first phase the comparisons are performed from left to right with text characters aligned with pattern position for which the value of the kmp function is strictly greater than -1. These positions are called noholes ; If all noholes exactly match we will go to second phase. If a mismatch happens in the first phase we would move by shift functions. The second phase consists in comparing the remaining positions (called holes) from right to left. If a mismatch happens in the second phase we would move by shift functions.
8 If, j must be larger than -1. The number of steps moved is i – j < i + 1. Consider any location i, where Kmp[i] = -1. If a mismatch occurs at this point, the KMP Algorithm shifts i–(-1) = i + 1 steps.
9 If we ignore the location i then Kmp[i] = -1, it is safe because we will move smaller number of steps.
10 Ex: The pattern is “ATCATATCA”. The Colussi algorithm uses three other preprocessing functions: namely Kmin, Rmin and Shift.. Let us first recall the Kmp function as follows.
11 The Kmin function Kmin[i] = i –Kmp[i]. ( The number of jumps in KMP Algorithm.) If i is a nohole we would set Kmin[i]. 1 – (0) = 1
12 The Kmin function Kmin[i] = i –Kmp[i]. ( The number of jumps in KMP Algorithm.) If i is a nohole we would set Kmin[i].
13 Definition of Period An integer k is a period of a pattern p if for any i, 0 <= i < m - k, p i = p i + k. In other words, p k, i-1 = p 0, i–k-1. According to the above definition, given a pattern p, there are many periods. For instance, for the case of ATCATCATCA, there are three periods, namely 5, 8, and 9. For instance, we can verify that p i+5 = p i for i = 0 to 8. Note that the length of a pattern is trivially a period of it.
14 The Rmin function If i is a hole, Rmin[i] is the smallest period of p greater than i. (The number of jumps for holes under the conduction then we have already matched all characters after i.) Rmin implies that we can look into the future in Colussi Algorithm. We set Rmin[0] = 5. period = 5,8 and 9.
15 The Rmin function If i is a hole, Rmin[i] is the smallest period of p greater than i. We set Rmin[3] = 5. period = 5,8 and 9
16 The Rmin function If i is a hole, Rmin[i] is the smallest period of p greater than i. We set Rmin[8] = 9. period = 5,8 and 9
17 The shift function If (Kmp[i] = -1) shift[i] = Rmin[i] ; else shift[i] = Kmin[i] ;
18 The shift function Then, we can set shift[1] = 5 Kmp[1] = -1 so shift[1] = Rmin[1]
19 The shift function
20 We give two kinds of examples where Kmp[i] = -1 to explain Rmin[i]. The condition is satisifed in this case. 5 b 6 c 7 b 8 a 9 e 10 b 11 c 12 b 1 b 2 c 3 b 4 a 0 1 P 13 a i 14 b 15 c 16 ba 0 If mismatch occurs at p 4, we jump 4 steps for the MP algorithm, we jump 5 steps for the KMP algorithm, and we jump 9 steps for the Colussi algorithm because Rmin[i] = 9. PrefixSuffix Already matched But we must understand that for Colussi Algorithm, all points after p 4 have already been matched. Then we can look into the future.
21 We give two kinds of examples which are Kmp[i] = -1 to explain Rmin[i]. The condition is satisfied in this case. 5 b 6 c 7 b 8 a 9 e 10 b 11 c 12 b 1 b 2 c 3 b 4 a P 13 a i 14 b 15 c 16 ba If mismatch occurs at p 15, we jump 15 steps for the MP algorithm, we jump 16 steps for the KMP algorithm, and we jump 17 steps for the Colussi algorithm because Rmin[i] = 17.
22 The Colussi Algorithm uses the Rmin function. Actually, it is using the suffix to prefix rule Implicitly. We shall explain this point in the following slides.
23 Note that the Rmin is used when all of the locations where have been processed and have been found matched. For a location where we know that we may jump steps. But, for Colussi algorithm, we use Rmin and Rmin is always larger than. Why?
24 Note that Rmin[i] is defined as the smallest period of p which is larger than i. Case 1: Rmin is lager than the length of p. In this case, we know that no suffix of p is equal to a prefix. Case 2: Rmin is smaller than the length of p. In this case, there is a suffix of p which is equal to a prefix.
25 Furthermore, Rmin is used when we scan from right to left. That is, all locations after location i have already been matched. Therefore, we may use the suffix to prefix rule now.
26 Rule 1: The Suffix to Prefix Rule For a window to have any chance to match a pattern, in some way, there must be a suffix of the window which is equal to a prefix of the pattern. T P
27 The Implication of Rule 1: Find the longest suffix U of the window which is equal to some prefix of P. Skip the pattern as follows:
28 Example T = GCATCGACAGACTATACAGTACG P = GACGGAGAC ∵ The longest suffix of the window which is equal to a prefix of P is “GAC” = p 1, 3, slide the window by 6. T = GCATCGACAGACTATACAGTACG P = GACGGAGAC
29 Example First attempt : Text : ATATCCTATCATATCA Pattern : ATCATATCA If it happens to mismatch in the first phase, we can base on the shift[i] to move. If all noholes exactly match we can run the second phase. match
30 Example First attempt : Text : ATATCCTATCATATCA Pattern : ATCATATCA mismatch
31 Shift[2] = 2 Example First attempt : Text : ATATCCTATCATATCA Pattern : ATCATATCA
32 Example Second attempt : Text : ATATCCTATCATATCA Pattern : ATCATATCA match
33 Example Second attempt : Text : ATATCCTATCATATCA Pattern : ATCATATCA match
34 Example Second attempt : Text : ATATCCTATCATATCA Pattern : ATCATATCA match
35 Example Second attempt : Text : ATATCCTATCATATCA Pattern : ATCATATCA match
36 Example Second attempt : Text : ATATCCTATCATATCA Pattern : ATCATATCA match
37 Example Second attempt : Text : ATATCCTATCATATCA Pattern : ATCATATCA match
38 If it happens to mismatch in the second phase, we can base on the shif[i] to move. If all holes exactly match we can move the shift[0] values. Example Second attempt : Text : ATATCCTATCATATCA Pattern : ATCATATCA match
39 Example Second attempt : Text : ATATCCTATCATATCA Pattern : ATCATATCA mismatch
40 Example Second attempt : Text : ATATCCTATCATATCA Pattern : ATCATATCA Shift[3] = 5, Prefix of the pattern ATCA
41 Example Third attempt : Text : ATATCCTATCATATCA Pattern : ATCATATCA match
42 Example Third attempt : Text : ATATCCTATCATATCA Pattern : ATCATATCA match
43 Example Third attempt : Text : ATATCCTATCATATCA Pattern : ATCATATCA match
44 Example Third attempt : Text : ATATCCTATCATATCA Pattern : ATCATATCA match
45 Example Third attempt : Text : ATATCCTATCATATCA Pattern : ATCATATCA match
46 Example Third attempt : Text : ATATCCTATCATATCA Pattern : ATCATATCA Shift[0] = 5, Prefix of the pattern ATCA
47 The preprocessing phase can be done in O( m ) space and time. The searching phase can then be done in O( n ) time complexity and furthermore at most n text character comparisons are performed during the searching phase. Colussi Algorithm Time complexity
48 References 1.[B92] Efficient String Algorithmics, BRESLAUER, D., Ph. D. Thesis, Report CU , Computer Science Department, Columbia University, New York, NY, )[C91]Correctness and efficiency of the pattern matching algorithms, COLUSSI L., Information and Computation 95(2):, 1991, pp )[CGG90]On the exact complexity of string matching, COLUSSI, L., GALIL, Z., GIANCARLO, R., in Proceedings of the 31st IEEE Annual Symposium on Foundations of Computer Science, 1990, pp )[GG92] On the exact complexity of string matching: upper bounds, SIAM Journal on Computing, GALIL, Z., GIANCARLO, R, Vol.21, No.3, 1992, pp