1 The MaxSuffix-Matching Algorithm On maximal suffixes and constant-space versions of KMPalgorithm LATIN 2002: Theoretical Informatics : 5th Latin American.

Slides:



Advertisements
Similar presentations
1 Very fast and simple approximate string matching Information Processing Letters, 72:65-70, G. Navarro and R. Baeza-Yates Advisor: Prof. R. C. T.
Advertisements

1 Average Case Analysis of an Exact String Matching Algorithm Advisor: Professor R. C. T. Lee Speaker: S. C. Chen.
Speaker: C. C. Lin Adviser: R. C. T. Lee
Boosting Textual Compression in Optimal Linear Time.
Advisor: Prof. R. C. T. Lee Speaker: C. W. Lu
String Searching Algorithms Problem Description Given two strings P and T over the same alphabet , determine whether P occurs as a substring in T (or.
Yangjun Chen 1 String Matching String matching problem - prefix - suffix - automata - String-matching automata - prefix function - Knuth-Morris-Pratt algorithm.
Prefix & Suffix Example W = ab is a prefix of X = abefac where Y = efac. Example W = cdaa is a suffix of X = acbecdaa where Y = acbe A string W is a prefix.
1 Morris-Pratt algorithm Advisor: Prof. R. C. T. Lee Reporter: C. S. Ou A linear pattern-matching algorithm, Technical Report 40, University of California,
Finite Automata Great Theoretical Ideas In Computer Science Anupam Gupta Danny Sleator CS Fall 2010 Lecture 20Oct 28, 2010Carnegie Mellon University.
Advisor: Prof. R. C. T. Lee Reporter: Z. H. Pan
6-1 String Matching Learning Outcomes Students are able to: Explain naïve, Rabin-Karp, Knuth-Morris- Pratt algorithms Analyse the complexity of these algorithms.
Advisor: Prof. R. C. T. Lee Speaker: Y. L. Chen
1 The Colussi Algorithm Advisor: Prof. R. C. T. Lee Speaker: Y. L. Chen Correctness and Efficiency of Pattern Matching Algorithms Information and Computation,
1 Reverse Factor Algorithm Advisor: Prof. R. C. T. Lee Speaker: L. C. Chen Speeding up on two string matching algorithms, Algorithmica, Vol.12, 1994, pp
1 Advisor: Prof. R. C. T. Lee Speaker: G. W. Cheng Two exact string matching algorithms using suffix to prefix rule.
1 Rules in Exact String Matching Algorithms 李家同. 2 The Exact String Matching Problem: We are given a text string and a pattern string and we want to find.
1 String Matching Algorithms Based upon the Uniqueness Property Advisor : Prof. R. C. T. Lee Speaker : C. W. Lu C. W. Lu and R. C. T. Lee, 2007, String.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 2: KMP Algorithm Lecturer:
Knuth-Morris-Pratt Algorithm left to right scan like the naïve algorithm one main improvement –on a mismatch, calculate maximum possible shift to the right.
1 Rules in Exact String Matching Algorithms 李家同. 2 The Exact String Matching Problem: We are given a text string and a pattern string and we want to find.
1 Two Way Algorithm Advisor: Prof. R. C. T. Lee Speaker: C. C. Yen Two-way string-matching Journal of the ACM 38(3): , 1991 Crochemore M., Perrin.
Boyer-Moore Algorithm 3 main ideas –right to left scan –bad character rule –good suffix rule.
A Fast String Searching Algorithm Robert S. Boyer, and J Strother Moore. Communication of the ACM, vol.20 no.10, Oct
1 KMP Skip Search Algorithm Advisor: Prof. R. C. T. Lee Speaker: Z. H. Pan Very Fast String Matching Algorithm for Small Alphabets and Long Patterns, Christian,
1 Morris-Pratt Algorithm Advisor: Prof. R. C. T. Lee Speaker: C. W. Lu A linear pattern-matching algorithm, Technical Report 40, University of California,
1 KMP algorithm Advisor: Prof. R. C. T. Lee Reporter: C. W. Lu KNUTH D.E., MORRIS (Jr) J.H., PRATT V.R.,, Fast pattern matching in strings, SIAM Journal.
Quick Search Algorithm A very fast substring search algorithm, SUNDAY D.M., Communications of the ACM. 33(8),1990, pp Adviser: R. C. T. Lee Speaker:
1 String Matching The problem: Input: a text T (very long string) and a pattern P (short string). Output: the index in T where a copy of P begins.
Reverse Colussi algorithm
Backward Nondeterministic DAWG Matching Algorithm
Aho-Corasick Algorithm Generalizes KMP to handle sets of strings New ideas –keyword trees –failure functions/links –output links.
1 Turbo-BM Algorithm Adviser: R. C. T. Lee Speaker: H. M. Chen Deux méthodes pour accélérer l'algorithme de Boyer-Moore, Théorie des Automates et Applications.,
1 Boyer-Moore Charles Yan Exact Matching Boyer-Moore ( worst-case: linear time, Typical: sublinear time ) Aho-Corasik ( A set of pattern )
1 Exact Matching Charles Yan Na ï ve Method Input: P: pattern; T: Text Output: Occurrences of P in T Algorithm Naive Align P with the left end.
Great Theoretical Ideas in Computer Science.
KMP String Matching Prepared By: Carlens Faustin.
1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,
Advisor: Prof. R. C. T. Lee Speaker: T. H. Ku
20/10/2015Applied Algorithmics - week31 String Processing  Typical applications: pattern matching/recognition molecular biology, comparative genomics,
Boyer Moore Algorithm Idan Szpektor. Boyer and Moore.
Great Theoretical Ideas in Computer Science.
MCS 101: Algorithms Instructor Neelima Gupta
Strings and Pattern Matching Algorithms Pattern P[0..m-1] Text T[0..n-1] Brute Force Pattern Matching Algorithm BruteForceMatch(T,P): Input: Strings T.
MCS 101: Algorithms Instructor Neelima Gupta
Exact String Matching Algorithms Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU.
Chapter 8 Maximum Flows: Additional Topics All-Pairs Minimum Value Cut Problem  Given an undirected network G, find minimum value cut for all.
Chapter 8 Properties of Context-free Languages These class notes are based on material from our textbook, An Introduction to Formal Languages and Automata,
Great Theoretical Ideas In Computer Science John LaffertyCS Fall 2005 Lecture 10Sept Carnegie Mellon University b b a b a a a b a b One.
Great Theoretical Ideas In Computer Science John LaffertyCS Fall 2006 Lecture 22 November 9, 2006Carnegie Mellon University b b a b a a a b a b.
Great Theoretical Ideas in Computer Science for Some.
Finite Automata Great Theoretical Ideas In Computer Science Victor Adamchik Danny Sleator CS Spring 2010 Lecture 20Mar 30, 2010Carnegie Mellon.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Great Theoretical Ideas In Computer Science Steven RudichCS Spring 2005 Lecture 9Feb Carnegie Mellon University b b a b a a a b a b One Minute.
 2004 SDU Uniquely Decodable Code 1.Related Notions 2.Determining UDC 3.Kraft Inequality.
Alternative Algorithms for Lyndon Factorization
Advanced Algorithms Analysis and Design
13 Text Processing Hongfei Yan June 1, 2016.
String Processing.
Boyer and Moore Algorithm
String-Matching Algorithms (UNIT-5)
Adviser: R. C. T. Lee Speaker: C. W. Cheng National Chi Nan University
Pattern Matching 12/8/ :21 PM Pattern Matching Pattern Matching
Pattern Matching 1/14/2019 8:30 AM Pattern Matching Pattern Matching.
KMP String Matching Donald Knuth Jim H. Morris Vaughan Pratt 1997.
Pattern Matching 2/15/2019 6:17 PM Pattern Matching Pattern Matching.
Knuth-Morris-Pratt Algorithm.
String Processing.
Pattern Matching Pattern Matching 5/1/2019 3:53 PM Spring 2007
Pattern Matching 4/27/2019 1:16 AM Pattern Matching Pattern Matching
Presentation transcript:

1 The MaxSuffix-Matching Algorithm On maximal suffixes and constant-space versions of KMPalgorithm LATIN 2002: Theoretical Informatics : 5th Latin American Symposium, Cancun, Mexico, April 3-6, Proceedings. Rytter, W. Advisor: Prof. R. C. T. Lee Reporter: L. Y. Huang

2 Maximal Suffix A maximal suffix of a string is a suffix which is lexicographically maximal of all suffixes of a string. The maximal suffix of string w is denoted by MaxSuf(w) Ex: Consider string w = abaaba The set of its suffixes : {a, ba, aba, aaba, baaba, abaaba} The set of its sorted suffixes:{a, aaba, aba, abaaba, ba, baaba} Thus we can find that MaxSuf(w) = baaba.

3 Self-Maximal String A string w is said to be self-maximal if MaxSuf(w) = w. Ex: Consider strings w = abaaba, x = baaba. –The MaxSuf(w) = baaba. –The MaxSuf(x) = baaba. Hence, we say that x is a self-maximal string but w is not.

4 Important Properties of Self-Maximal Strings By definition, we have the following observation about self-maximal strings: For a self-maximal string P, suppose a prefix P 1,P 2,…,P i of P is equal to a substring, P k,P k+1,…, P k+i-1, of P, then P i+1 >=P k+i. xy  x > y uu P …

5 Example: TCATBTCATA is a self-maximal string. But, TBATATBATB is not a self-maximal string because B after the substring TBAT is lexically larger than A after prefix TBAT.

6 The Period of a String A period of a string w is an integer p,, such that : Ex: Consider string w = bbabbabbabba –bbabbabbabba → period = 3 and period =6. –abcdefg →period=word length=7 –abcdeab →period=5 We define period(w) as the smallest period of w. If w = bbabbabbabba, period(w) is 3.

7 Given a string P, we are actually interested in the period of every prefix. i P abcaabcab period prefix i-prefix(i) Note that the period of i-prefix(i) in the MP-algorithm which is the number of steps which we can move the pattern. (The index starts from 1 in this case.)

8 Why are we interested in the period function? If the period function is actually the same as the prefix function of the MP_algorithm, why are we interested in it? To calculate the prefix function, we must use pointers which point back to some characters way back. In the following, we shall introduce a naïve period function which never looks back.

9 Naive-Period Function Function Naive-Period can be used to compute the period of a string if this string is self- maximal. For a general string, the Naive-Period function will not work. This is why our algorithm only works for the self-maximal strings.

10 Function Naive-Period (j); { computes the period of self-maximal pat} period (1):= 1; for i := 2 to j do if then period (i):= i ; else period(i) := period(i - 1) return period; Algorithm of Naive-Period Function

11 An Example of Naive-Period Function w bbabbabba b i 12 i-period(i-1) 01 period 11 Function Naive-Period (j); { computes the period of self-maximal pat} period (1):= 1; for i := 2 to j do if then period (i):= i ; else period(i) := period(i - 1) return period;

12 An Example of Naive-Period Function Consider a string w = bbabbabbab –w is a self-maximal string and period(w)=3. w bbabbabba b i 123 i-period(i-1) 012 period 113

13 An Example of Naive-Period Function Consider a string w = bbabbabbab –w is a self-maximal string and period(w)=3. w bbabbabba b i 1234 i-period(i-1) 0121 period 1133

14 An Example of Naive-Period Function Consider a string w = bbabbabbab –w is a self-maximal string and period(w)=3. w bbabbabba b i i-period(i-1) period 11333

15 An Example of Naive-Period Function Consider a string w = bbabbabbab –w is a self-maximal string and period(w)=3. w bbabbabba b i i-period(i-1) period

16 An Example of Naive-Period Function Consider a string w = bbabbabbab –w is a self-maximal string and period(w)=3. w bbabbabba b i i-period(i-1) period

17 An Example of Naive-Period Function Consider a string w = bbabbabbab –w is a self-maximal string and period(w)=3. w bbabbabba b i i-period(i-1) period

18 An Example of Naive-Period Function Consider a string w = bbabbabbab –w is a self-maximal string and period(w)=3. w bbabbabba b i i-period(i-1) period

19 An Example of Naive-Period Function Consider a string w = bbabbabbab –w is a self-maximal string and period(w)=3. w bbabbabba b i i-period(i-1) Period(i)

20 Given any pattern P, let k be the length of the longest proper suffix of P[1, i-1] equal to a prefix P[1, k] of a P[1, i-1]. Let k’ be the length of the longest proper suffix of P[1, i] equal to a prefix P[1, k’] of a P[1, i]. For any i, we consider the following possibilities: Why can Naïve period work in the self-maximal string? i i-1 k’ k k P P

21 1. k ≠ 0 and P[k + 1] = P[i] : Period(i) = Period(i - 1) 2. k ≠ 0, P[k + 1] ≠ P[i] and k’ ≠ 0 : Period(i) = i – k’ 3. k ≠ 0, P[k + 1] ≠ P[i] and k’ = 0 : Period(i) = i 4. k = 0 and k’ ≠ 0 : Period(i) = i – k’ 5. k = 0 and k’ = 0 : Period(i) = i

22 1. k ≠ 0 and P[k + 1] = P[i] : Period(i) = Period(i - 1) i P abcaabca period For i = 8, the substring “abc” of length 3 (k = 3) is the longest suffix of P(1, 7) which equals to a prefix of P(1, 7) and P(8) = P(4)  period(8) = period(7)=4.

23 2. k ≠ 0, P[k + 1] ≠ P[i] and k’ ≠ 0 : Period(i) = i – k’ i P abcaabcab period For i = 9, the substring “abca” of length 4 (k = 4) is the longest suffix of P(1, 8) which equals to a prefix of P(1, 8) and P(9) ≠ P(5) There is a suffix of P(1, 9) which equals to a prefix of P(1, 9), P(1, 2) = ab of length 2 (k’ = 2)  period(9) = i - | P(1, 2) | = =7.

24 3. k ≠ 0, P[k + 1] ≠ P[i] and k’ = 0 : Period(i) = i i P abccabccb period For i = 9, the substring “abcc” of length 4 (k = 4) is the longest suffix of P(1, 8) which equals to a prefix of P(1, 8) and P(9) ≠ P(5) There is no suffix of P(1, 9) which equals to a prefix of P(1, 9), (k’ = 0).  period(9) = i = 9.

25 4. k = 0 and k’ ≠ 0 : Period(i) = i – k’ i P abccbbcca period For i = 9, the is no suffix of P(1, 8) which equals to a prefix of P(1, 8), (k = 0) The substring “a” of length 1 (k’ = 1) is a suffix of P(1, 9) which equals to a prefix of P(1, 9), P(1, 1) = a.  period(9) = i - |P(1, 1)| = 9-1 = 8.

26 5. k = 0 and k’ = 0 : Period(i) = i i P abccbbccb period For i = 9, there is no suffix of P(1, 8) which equals to a prefix of P(1, 8), (k = 0). There is no suffix of P(1, 9) which equals to a prefix of P(1, 9), (k’ = 0).  period(9) = i = 9.

27 Assume that the conditions 2 & 4 holds. There must be a suffix which is equal to a prefix. Let u be the such a longest suffix. But, the conditions 2 (k ≠ 0, P[k + 1] ≠ P[i] and k’ ≠ 0) and 4 (k = 0 and k’ ≠ 0) do not exist in self- maximal suffix. Why?

28 2. k ≠ 0, P[k + 1] ≠ P[i] and k’ ≠ 0 xy i j period uu Suppose that P is self-maximal. Since P[i]=y≠P[j]=x holds, x >y. Since k’ ≠ 0, there is a v+y which is the longest suffix of P(1,i) equal to a prefix of P(1,i) as shown above. P vyxvy i period u u P

29 vyvxvyvy i j period uu P vyxvy i uu P Since k ≠ 0, we must have the following. Since P is a self-maximal string, from the prefix u, we may conclude that y>x. Contradiction! k ≠ 0, P[k + 1] ≠ P[i] and k’ ≠ 0 cannot hold for self-maximal strings.

30 Using similar reasoning, we can prove that for self-maximal strings, k = 0 and k’ ≠ 0 does not hold. Thus we may have the following: For self-maximal strings, Period(i)=Period(i - 1) or Period(i)=i. That is, the naïve period function works for Self-maximal strings.

31 What is the advantage of the naïve-period function? It is linear and we never need to look back to some characters way back, as we need in calculating the prefix function in MP- algorithm.

32 For a string which is not self-maximal, we use the following algorithm, called the Max-Suffix Matching Algorithm.

33 MaxSuffix-Matching Algorithm First, we decompose the pattern string P to be u · v, where v= MaxSuf(P) and u is the other part of P. Note that v is unique in the string P, and this is a very important property. Property 1: No suffix of u is equal to a prefix of v., because v is uniqueness. Example : P = dababdadad MaxSuf(P) = dadad P = u·v = dabab ·dadad

34 MaxSuffix-Matching Algorithm If v is found in T, we next find the part u of P which occurs in the left of v by a naive testing way. Assume i is the location of an occurrence of v in T and the string before i is denoted as prev because of Property 1. Text v v i prev

35 Maxsuffix-Matching Algorithm Algorithm Maxsuffix-Matching i:= 0; j:=0; period:=1;prev:=0; while i ≤ n - |v| do begin while j < |v| and v[i+1]= T[i+j+1] do begin j=j+1; if j > period and v[j] ≠ v[j -period] then period:=j end; {MATCH OF v} if j = |u| then begin if i − prev > |u| and u = T[i − |u| + 1… i] then report match at i − |u|; prev := i; end i := i + period; if j ≥ 2 ・ period then j := j − period else begin j:= 0; period := 1 end; end; Naive-Period Function Test u by using any algorithm

36 Example Text = adadaddadabababadada P = u·v = abababa · dada case1 –If i < |u|, that there is no occurrence of u·v at beginning. a d a d a d d a d a b a b a b d a d a Text d a d a i

37 Example Text = adadaddadabababadada P = u·v = abababa · dada Case2 –If i – prev <|u|, then there is no occurrence of u·v at position i - |u|. This is because the maximal suffix v of P only start at one position on P. d a d a a d a d a d d a d a b a b a b d a d aText d a d a d a i = 7, |u| = 7, prev =2

38 Example Text = adadaddadabababadada P = u·v = abababa · dada So, we only need to check whether u exists in the left of third v in this example. d a d a a d a d a d d a d a b a b a b d a d aText d a d a d a First occurrence Second occurrence Third occurrence

39 Time Complexity and Space Complexity Hence, the MaxSuffix-Matching Algorithm can find all occurrences of a pattern in O(1) space (i, j, period) and linear time complexity.

40 Reference Maxime Crochemore, String-matching on ordered alphabets, Theoretical Computer Science, v.92 n.1, p.33-47, Jan. 6, 1992 Maxime Crochemore, Dominique Perrin, Two-way string- matching, Journal of the ACM (JACM), v.38 n.3, p , July 1991 Maxime Crochemore, Wojcjech Rvtter, Text algorithms, Oxford University Press, Inc.,New York, NY, 1994 M. Crochemore, W. Rytter, Cubes, squares and time space efficient string matching, Algorithmica 13 (5) (1995) J.-P. Duval, Factorizing words over an ordered alphabet, J. Algorithms 4 (1983)

41 Reference Z Galil, J. Seiferas, Time-space-optimal string matching, J. Comput. System Sci. 26 (1983) L. Gasieniec, W. Plandowski, W. Rytter, Constant-space string matching with smaller number of comparisons: sequential sampling, in: Z. Galil, E. Ukkonen (Eds.), Combinatorial Pattern Matching, 6th Annual Symposium, CPM gs, Lecture Notes in Computer Science, Vol. 937, Springer, Berlin, 1995, pp Leszek Gasieniec, Woiciech Plandowski, Woiciech Rytter, The zooming method: a recursive approach to time-space efficient string-matching, Theoretical Computer Science, v. 147 n. 1-2, p , Aug. 7, 1995 D.E. Knuth, J.H. Morris, V.R. Pratt, Fast pattern matching in strings, SIAM J. Comput. 6 (1977) M. Lothaire, Combinatorics on Words, Addison-Wesley, Reading, MA, USA, 1983.

42 ~Thank You~