1 The Colussi Algorithm Advisor: Prof. R. C. T. Lee Speaker: Y. L. Chen Correctness and Efficiency of Pattern Matching Algorithms Information and Computation,

Slides:



Advertisements
Similar presentations
1 Average Case Analysis of an Exact String Matching Algorithm Advisor: Professor R. C. T. Lee Speaker: S. C. Chen.
Advertisements

Tuned Boyer Moore Algorithm
Suffix Trees Construction and Applications João Carreira 2008.
Advisor: Prof. R. C. T. Lee Speaker: C. W. Lu
1 The MaxSuffix-Matching Algorithm On maximal suffixes and constant-space versions of KMPalgorithm LATIN 2002: Theoretical Informatics : 5th Latin American.
Exact String Search Lecture 7: September 22, 2005 Algorithms in Biosequence Analysis Nathan Edwards - Fall, 2005.
String Searching Algorithms Problem Description Given two strings P and T over the same alphabet , determine whether P occurs as a substring in T (or.
Lecture 27. String Matching Algorithms 1. Floyd algorithm help to find the shortest path between every pair of vertices of a graph. Floyd graph may contain.
1 A simple fast hybrid pattern- matching algorithm Department of Computer Science and Information Engineering National Cheng Kung University, Taiwan R.O.C.
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search, part 1)
1 Morris-Pratt algorithm Advisor: Prof. R. C. T. Lee Reporter: C. S. Ou A linear pattern-matching algorithm, Technical Report 40, University of California,
Advisor: Prof. R. C. T. Lee Reporter: Z. H. Pan
Advisor: Prof. R. C. T. Lee Speaker: Y. L. Chen
1 Reverse Factor Algorithm Advisor: Prof. R. C. T. Lee Speaker: L. C. Chen Speeding up on two string matching algorithms, Algorithmica, Vol.12, 1994, pp
A Fast String Matching Algorithm The Boyer Moore Algorithm.
1 Advisor: Prof. R. C. T. Lee Speaker: G. W. Cheng Two exact string matching algorithms using suffix to prefix rule.
1 String Matching Algorithms Based upon the Uniqueness Property Advisor : Prof. R. C. T. Lee Speaker : C. W. Lu C. W. Lu and R. C. T. Lee, 2007, String.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 2: KMP Algorithm Lecturer:
Boyer-Moore string search algorithm Book by Dan Gusfield: Algorithms on Strings, Trees and Sequences (1997) Original: Robert S. Boyer, J Strother Moore.
Knuth-Morris-Pratt Algorithm left to right scan like the naïve algorithm one main improvement –on a mismatch, calculate maximum possible shift to the right.
1 Rules in Exact String Matching Algorithms 李家同. 2 The Exact String Matching Problem: We are given a text string and a pattern string and we want to find.
1 Two Way Algorithm Advisor: Prof. R. C. T. Lee Speaker: C. C. Yen Two-way string-matching Journal of the ACM 38(3): , 1991 Crochemore M., Perrin.
Boyer-Moore Algorithm 3 main ideas –right to left scan –bad character rule –good suffix rule.
1 A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber Tech. Rep. TR94-17,Department of Computer Science, University of Arizona, May 1994.
1 KMP Skip Search Algorithm Advisor: Prof. R. C. T. Lee Speaker: Z. H. Pan Very Fast String Matching Algorithm for Small Alphabets and Long Patterns, Christian,
Smith Algorithm Experiments with a very fast substring search algorithm, SMITH P.D., Software - Practice & Experience 21(10), 1991, pp Adviser:
1 KMP algorithm Advisor: Prof. R. C. T. Lee Reporter: C. W. Lu KNUTH D.E., MORRIS (Jr) J.H., PRATT V.R.,, Fast pattern matching in strings, SIAM Journal.
Quick Search Algorithm A very fast substring search algorithm, SUNDAY D.M., Communications of the ACM. 33(8),1990, pp Adviser: R. C. T. Lee Speaker:
1 Rules in Exact String Matching Algorithms 李家同. 2 The Exact String Matching Problem: We are given a text string and a pattern string and we want to find.
Exact and Approximate Pattern in the Streaming Model Presented by - Tanushree Mitra Benny Porat and Ely Porat 2009 FOCS.
1 The Galil-Giancarlo algorithm Advisor: Prof. R. C. T. Lee Speaker: S. Y. Tang On the exact complexity of string matching: upper bounds, SIAM Journal.
The Zhu-Takaoka Algorithm
Reverse Colussi algorithm
Backward Nondeterministic DAWG Matching Algorithm
Raita Algorithm T. RAITA Advisor: Prof. R. C. T. Lee
Aho-Corasick Algorithm Generalizes KMP to handle sets of strings New ideas –keyword trees –failure functions/links –output links.
The Galil-Giancarlo algorithm
1 Exact Matching Charles Yan Na ï ve Method Input: P: pattern; T: Text Output: Occurrences of P in T Algorithm Naive Align P with the left end.
A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.
1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
KMP String Matching Prepared By: Carlens Faustin.
1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,
Advisor: Prof. R. C. T. Lee Speaker: T. H. Ku
Advanced Algorithm Design and Analysis (Lecture 3) SW5 fall 2004 Simonas Šaltenis E1-215b
20/10/2015Applied Algorithmics - week31 String Processing  Typical applications: pattern matching/recognition molecular biology, comparative genomics,
Boyer Moore Algorithm Idan Szpektor. Boyer and Moore.
MCS 101: Algorithms Instructor Neelima Gupta
Exact String Matching Algorithms: A Survey Mehreen Ali, Hina Naz Khan, Shumaila Sayyab, Nadeem Iftikhar Department of Bio-Science Mohammad Ali Jinnah University,
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 1: Exact String Matching.
Faster Algorithm for String Matching with k Mismatches (II) Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp
Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.
MCS 101: Algorithms Instructor Neelima Gupta
Exact String Matching Algorithms Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU.
CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
String Searching 2 of 2. String search Simple search –Slide the window by 1 t = t +1; KMP –Slide the window faster t = t + s – M[s] –Never recheck the.
Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)
CSG523/ Desain dan Analisis Algoritma
Source : Practical fast searching in strings
13 Text Processing Hongfei Yan June 1, 2016.
Rabin & Karp Algorithm.
Adviser: R. C. T. Lee Speaker: C. W. Cheng National Chi Nan University
Pattern Matching 1/14/2019 8:30 AM Pattern Matching Pattern Matching.
Pattern Matching 2/15/2019 6:17 PM Pattern Matching Pattern Matching.
Knuth-Morris-Pratt Algorithm.
Chap 3 String Matching 3 -.
Pattern Matching Pattern Matching 5/1/2019 3:53 PM Spring 2007
Pattern Matching 4/27/2019 1:16 AM Pattern Matching Pattern Matching
2019/5/14 New Shift table Algorithm For Multiple Variable Length String Pattern Matching Author: Punit Kanuga Presenter: Yi-Hsien Wu Conference: 2015.
Presentation transcript:

1 The Colussi Algorithm Advisor: Prof. R. C. T. Lee Speaker: Y. L. Chen Correctness and Efficiency of Pattern Matching Algorithms Information and Computation, Vol, 95, 1991, pp Colussi, L.

2 The main principle of Colussi Algorithm 1.We point out that there are positions where large number of jumps are allowed. 2.We first process the positions where only small number of jumps are allow. It is obviously safe to do so. Beside, we may look into the future this way.

3  The Colussi Algorithm is a modification of the KMP Algorithm. In the KMP Algorithm, we always construct the KMP function. For instance, for the case of ATCATCATCA, the KMP function is as follows :

4 Condition for KMP[i] = -1 Condition A: p 0 = p i Condition B: p 0, j is a suffix of p 0, i-1 Condition C: p j+1 = p i KMP[i] = -1 :

5  There is no suffix of p(0, 3) which is equal to a prefix of p 0, 3.  p 0 = p 4. A  KMP[4] = -1 because it satisfies the condition. 5 b 6 c 7 b 8 a 9 e 10 b 11 c 12 b 1 b 2 c 3 b 4 a 0 1 P 13 a i 14 b 15 c 16 ba 0

6  There are two suffixes of p 0, 14 which are equal to a prefix of p 0, 14 : p 0, 1 = p 13, 14 and p 0, 5 = p 9, 14 For p 0, 5, we have p 6 = p 15 ; For p 0, 1, we have p 2 = p 15. ( )  p 0 = p 15. ( A )  KMP[15] = -1 because it satisfies the condition. 5 b 6 c 7 b 8 a 9 e 10 b 11 c 12 b 1 b 2 c 3 b 4 a P 13 a i 14 b 15 c 16 ba

7  First, construct the preprocess tables. It contains Kmp 、 Kmin 、 Rmin and Shift functions.  Second, the set of pattern positions is divided into two disjoint subsets. Then each attempt consists in two phases :  In the first phase the comparisons are performed from left to right with text characters aligned with pattern position for which the value of the kmp function is strictly greater than -1. These positions are called noholes ;  If all noholes exactly match we will go to second phase. If a mismatch happens in the first phase we would move by shift functions.  The second phase consists in comparing the remaining positions (called holes) from right to left. If a mismatch happens in the second phase we would move by shift functions.

8  If, j must be larger than -1. The number of steps moved is i – j < i + 1.  Consider any location i, where Kmp[i] = -1. If a mismatch occurs at this point, the KMP Algorithm shifts i–(-1) = i + 1 steps.

9  If we ignore the location i then Kmp[i] = -1, it is safe because we will move smaller number of steps.

10  Ex: The pattern is “ATCATATCA”.  The Colussi algorithm uses three other preprocessing functions: namely Kmin, Rmin and Shift..  Let us first recall the Kmp function as follows.

11 The Kmin function  Kmin[i] = i –Kmp[i]. ( The number of jumps in KMP Algorithm.)  If i is a nohole we would set Kmin[i]. 1 – (0) = 1

12 The Kmin function  Kmin[i] = i –Kmp[i]. ( The number of jumps in KMP Algorithm.)  If i is a nohole we would set Kmin[i].

13 Definition of Period  An integer k is a period of a pattern p if for any i, 0 <= i < m - k, p i = p i + k. In other words, p k, i-1 = p 0, i–k-1.  According to the above definition, given a pattern p, there are many periods. For instance, for the case of ATCATCATCA, there are three periods, namely 5, 8, and 9. For instance, we can verify that p i+5 = p i for i = 0 to 8. Note that the length of a pattern is trivially a period of it.

14 The Rmin function  If i is a hole, Rmin[i] is the smallest period of p greater than i. (The number of jumps for holes under the conduction then we have already matched all characters after i.)  Rmin implies that we can look into the future in Colussi Algorithm.  We set Rmin[0] = 5. period = 5,8 and 9.

15 The Rmin function  If i is a hole, Rmin[i] is the smallest period of p greater than i.  We set Rmin[3] = 5. period = 5,8 and 9

16 The Rmin function  If i is a hole, Rmin[i] is the smallest period of p greater than i.  We set Rmin[8] = 9. period = 5,8 and 9

17 The shift function  If (Kmp[i] = -1) shift[i] = Rmin[i] ; else shift[i] = Kmin[i] ;

18 The shift function  Then, we can set shift[1] = 5 Kmp[1] = -1 so shift[1] = Rmin[1]

19 The shift function

20  We give two kinds of examples where Kmp[i] = -1 to explain Rmin[i]. The condition is satisifed in this case. 5 b 6 c 7 b 8 a 9 e 10 b 11 c 12 b 1 b 2 c 3 b 4 a 0 1 P 13 a i 14 b 15 c 16 ba 0 If mismatch occurs at p 4, we jump 4 steps for the MP algorithm, we jump 5 steps for the KMP algorithm, and we jump 9 steps for the Colussi algorithm because Rmin[i] = 9. PrefixSuffix Already matched But we must understand that for Colussi Algorithm, all points after p 4 have already been matched. Then we can look into the future.

21  We give two kinds of examples which are Kmp[i] = -1 to explain Rmin[i]. The condition is satisfied in this case. 5 b 6 c 7 b 8 a 9 e 10 b 11 c 12 b 1 b 2 c 3 b 4 a P 13 a i 14 b 15 c 16 ba If mismatch occurs at p 15, we jump 15 steps for the MP algorithm, we jump 16 steps for the KMP algorithm, and we jump 17 steps for the Colussi algorithm because Rmin[i] = 17.

22  The Colussi Algorithm uses the Rmin function. Actually, it is using the suffix to prefix rule Implicitly. We shall explain this point in the following slides.

23  Note that the Rmin is used when all of the locations where have been processed and have been found matched.  For a location where we know that we may jump steps.  But, for Colussi algorithm, we use Rmin and Rmin is always larger than. Why?

24 Note that Rmin[i] is defined as the smallest period of p which is larger than i. Case 1: Rmin is lager than the length of p. In this case, we know that no suffix of p is equal to a prefix. Case 2: Rmin is smaller than the length of p. In this case, there is a suffix of p which is equal to a prefix.

25  Furthermore, Rmin is used when we scan from right to left. That is, all locations after location i have already been matched. Therefore, we may use the suffix to prefix rule now.

26 Rule 1: The Suffix to Prefix Rule For a window to have any chance to match a pattern, in some way, there must be a suffix of the window which is equal to a prefix of the pattern. T P

27 The Implication of Rule 1: Find the longest suffix U of the window which is equal to some prefix of P. Skip the pattern as follows:

28 Example T = GCATCGACAGACTATACAGTACG P = GACGGAGAC ∵ The longest suffix of the window which is equal to a prefix of P is “GAC” = p 1, 3, slide the window by 6. T = GCATCGACAGACTATACAGTACG P = GACGGAGAC

29  Example First attempt : Text : ATATCCTATCATATCA Pattern : ATCATATCA  If it happens to mismatch in the first phase, we can base on the shift[i] to move. If all noholes exactly match we can run the second phase. match

30  Example First attempt : Text : ATATCCTATCATATCA Pattern : ATCATATCA mismatch

31 Shift[2] = 2  Example First attempt : Text : ATATCCTATCATATCA Pattern : ATCATATCA

32  Example Second attempt : Text : ATATCCTATCATATCA Pattern : ATCATATCA match

33  Example Second attempt : Text : ATATCCTATCATATCA Pattern : ATCATATCA match

34  Example Second attempt : Text : ATATCCTATCATATCA Pattern : ATCATATCA match

35  Example Second attempt : Text : ATATCCTATCATATCA Pattern : ATCATATCA match

36  Example Second attempt : Text : ATATCCTATCATATCA Pattern : ATCATATCA match

37  Example Second attempt : Text : ATATCCTATCATATCA Pattern : ATCATATCA match

38  If it happens to mismatch in the second phase, we can base on the shif[i] to move. If all holes exactly match we can move the shift[0] values.  Example Second attempt : Text : ATATCCTATCATATCA Pattern : ATCATATCA match

39  Example Second attempt : Text : ATATCCTATCATATCA Pattern : ATCATATCA mismatch

40  Example Second attempt : Text : ATATCCTATCATATCA Pattern : ATCATATCA Shift[3] = 5, Prefix of the pattern ATCA

41  Example Third attempt : Text : ATATCCTATCATATCA Pattern : ATCATATCA match

42  Example Third attempt : Text : ATATCCTATCATATCA Pattern : ATCATATCA match

43  Example Third attempt : Text : ATATCCTATCATATCA Pattern : ATCATATCA match

44  Example Third attempt : Text : ATATCCTATCATATCA Pattern : ATCATATCA match

45  Example Third attempt : Text : ATATCCTATCATATCA Pattern : ATCATATCA match

46  Example Third attempt : Text : ATATCCTATCATATCA Pattern : ATCATATCA Shift[0] = 5, Prefix of the pattern ATCA

47  The preprocessing phase can be done in O( m ) space and time.  The searching phase can then be done in O( n ) time complexity and furthermore at most n text character comparisons are performed during the searching phase. Colussi Algorithm Time complexity

48 References 1.[B92] Efficient String Algorithmics, BRESLAUER, D., Ph. D. Thesis, Report CU , Computer Science Department, Columbia University, New York, NY, )[C91]Correctness and efficiency of the pattern matching algorithms, COLUSSI L., Information and Computation 95(2):, 1991, pp )[CGG90]On the exact complexity of string matching, COLUSSI, L., GALIL, Z., GIANCARLO, R., in Proceedings of the 31st IEEE Annual Symposium on Foundations of Computer Science, 1990, pp )[GG92] On the exact complexity of string matching: upper bounds, SIAM Journal on Computing, GALIL, Z., GIANCARLO, R, Vol.21, No.3, 1992, pp