1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK, S., LECROQ, T., PLANDOWSKI, W. and RYTTER, W. Algorithmica, Vol.12, 1994, pp.247-267

2 Problem Definition Input : A text T and a pattern P. Output : Find all occurrences of P in T

3 Rule 1: The Suffix to Prefix Rule For a window to have any chance to match a pattern, in some way, there must be a suffix of the window which is equal to a prefix of the pattern.

4 Basic Ideas Open a window W with size |P| in the text. T |P||P| W p Find the longest suffix of W is also the prefix of pattern. T |P||P| p W Match! Case 1:

5 T |P||P| W p Case 2: T |P||P| W p T |P||P| W p Case 3: |P||P| If there is no such suffix, we move W with length |P|.

6 Preprocessing phase T=GCATCGGCGAGAGTATACAGTACG P=GCAGAGAG L(S): a set contains all prefixes of the pattern. 087654321 G A G A G GA C C C CA We construct the suffix automaton of P. Suffix Automaton

7 Preprocessing: Construct a Suffix Tree The reversal string of P. Suffix tree for

8 GCATCGCA GCAGAGAG W P We want to find the longest suffix of W which is equal to a prefix of P. ACGCTACG GAGAGACG Suffix tree for We find that ACG (a prefix of, a suffix of W) is a suffix of (a prefix of P). Thus ACG is the longest suffix of W which is equal to a prefix of P. Example 1

9 GTATACAG GCAGAGAG W P GACATATG GAGAGACG Suffix tree for We find that GAC is the longest prefix of (thus the longest suffix of W) which is equal to a substring of. But GAC is not a suffix of and GACA is not a suffix of either. Example 2

10 GACATATG GAGAGACG Luckily, a prefix of GACG, namely G, is also a suffix of. G can be found by finding the lowest common ancestor of G and GACG. Thus G is the longest prefix of (suffix of W) which is equal to a suffix of (prefix of P).

11 Let X be the longest prefix of (suffix of W) which is equal to a substring of, but not a suffix of. Let Y be a prefix of X (a suffix of W) which is equal to a suffix of (prefix of P). Then Y is the longest suffix of W equal to a prefix of P.

12 Z is a suffix of which can be found in the suffix tree of. Y may not exist. If it exists, it must be in the suffix tree of and must have been found before X is found because Y is a prefix of X.

13 Preprocessing phase: the worst case of the time complexity is O (m). Searching phase: the worst case of the time complexity is O (mn). But it needs time O( ) in average case where r is the size of the alphabet as shown in this paper.

14 About the average case analysis of RF algorithm, assume that the text is a random sequence over a size r alphabet and is preserved such that m must be enough large. This assumption is reasonable. Let m=16, r=4.

15 Theorem. The expected average time of the RF algorithm is O( ). Proof. Note that r>1, and. For a pattern with length m, there are no more than m substrings. Thus, there are at most m substrings with length.

16 Let L i be the length of the shift in the ith attempt of RF algorithm and Let X i and Y i be the X and the Y in ith attempt respectively. Let S i be the length of the longest prefix of which appear in in the ith attempt. That is, S i =|X i |. Let A i =|Y i | such that because Y i is a prefix of X i.

17 In the first attempt of RF algorithm,

18 Let us call the ith shift long if and only if and short otherwise. (It implies that L i is long if.)

19 When at least new symbols are being read at the current attempt, with probability there are at most characters of the suffix of the window can match a substring of P, which causes a long shift.

20 We divide all attempts into phases. Each phase ends on the first long shift. In other words, there is exactly one long shift in each phase.

21 There are two main ideas in the paper: (1)The number of all phases is. (2)We calculate the expected number of comparison of each phase. An expected number of comparison of each phase is. We shall discuss above two ideas in the next slides.

The number of all phases is. We know that the length of long shift is Then The number of all phases is

23 Claim 1: Assume that L i and L i+1 are both short. Proof. Suppose L i and L i+1 <, then the pattern is of the form where, w,. Then. That is, L i+2 is the end of a phase. Next, we calculate the expected number of comprison of each phase.

24 Note that Y i denotes a longest suffix of the window W i which is equal to a prefix of the pattern, where W i is a window of the text of length m in the ith attempt. Let B i be the set of new symbols to read in the ith attempt. Note that the pattern is of the form. Then,,.

25 Let B i+1 be because there exists an overlap between Y i and Y i+1, and

26 Example: T=bbcabcabcabcabcadc P=cabcabcabcadd, w=ab,v=c, s=a,z=dd. Then,, When P shifts L i+1, the overlap of Y i and Y i+1 is

27. If there exists a word such that, then because is a minimal period of. Without loss of generality we can assume that is a minimal period of. Hence,

28 Example: P=abcabcabcabcabcabcabbc, w=cabc,v=ab, s=b,z=c. w 1 v 1 is a minimal period of P.

29 We can also assume (eventually changing wv and k) that and sz do not have a common prefix. We may therefore obtain a new fragment s 1 z 1 such that

A suffix of the read part of the text is of the form, and we have at least C=min(L i+1, L i ) new symbols to read in the (i+2)th attempt. Let e be a random word of length C to be read part of the text such that. 30

31 Note that If |B i |>|B i+1 |, then, otherwise,,.

32 We give an example when |B i |>|B i+1 |. T=bbbaaaaaaaacda P=aaaaaaaabc, w=a,v=a, s=a, z=bc.

33 We give another example when T=bbcabcabcabcabcadc P=cabcabcabcadd, w=ab,v=c, s=a, z=dd.

34 It is easy to see that if w 1 v 1 s 1 e is a substring of, then y must be either equal to pref(z 1 ) if, or otherwise.

35 In other words, by the above condition, if, w 1 v 1 s 1 e would only appear to the end of P. Therefore, e=pref(z 1 ). otherwise, w 1 v 1 s 1 e may appear to any position of P. Therefore,

36 The probability that reading e new symbols leads to a long (longer than L i +L i+1 which is less than ) substring of the pattern is no greater than. Note that

37 Therefore,

38 By Claim 1, the assumptions say that when the (k-1)th and (k-2)th shifts are both short, the kth shift is long with probability. It implies that the kth shift of the phase is short with probability for

39 Let F be the random variable which is the number of short shifts in the phase. What can we say about the probability distribution of F?

40 By claim 1, we know when (k-2)th and (k-1)th are both short,.

41 Let G be the random variable which is the number of comparison of the phase and let L be the number of comparison of a long shift of the phase. Then The problem is on how to find L.

42 For the number of comparison of a long shift of the phase, we know and. Note that S i is the length of the substring of the pattern that is matched in W i. Hence,

43 For the expected number of comparison of each phase, we have

According to above discussion, we know that there are phases in the algorithm and an expected number of comparison of each phase is. Therefore, the expected time of the RF algorithm is.

45 In this paper, they use X to analyze the average case of RF algorithm finally note that X is the longest suffix of W which is equal to a substring of P. In fact, the main idea of RF algorithm is to find out Y, but not X. Therefore, we may re-analyze the expected length of Y i. Note that the L i =shift is equal to L i =m － |Y i |=m － A i. If A i is small, L i is large. We expect A i to be very small.

46 Given a window W i of T in the ith attempt and a pattern P, the expected length of the longest suffix of W i equal to a prefix of P is …..(1) …..(2)

47 (2) － (1)

48 We can deduce that

49 We randomly generate some texts and patterns using Knuth’s random generating function in the first experiment. Data source The length of string Alphabe t size r The number of total compari son with matched The number of window The expected number of comparison per window r/(r-1 ) 2 The number of average comparison per window textpattern Random 100030417330.44440.515152 100003041513380.44440.446746 100000304131633770.44440.389695 100000030413074337690.44440.38716 10005056200.31250.3 10000505642010.31250.318408 10000050564520120.31250.320577 10000005056045201200.31250.300447

50 Data source The length of string Alphab et size r The number of total compari son with matche d The number of window The expected number of comparison per window r/(r-1 ) 2 The number of average comparison per window text patter n Random 100030105330.12340.151515 100003010303340.12340.08982 100000301034633440.12340.103469 100000030103930334630.12340.117443 100010074100.19440.4 100001007141000.19440.14 100000100719510010.19440.194805 100000010071865100180.19440.186165

51 In the second experiment, we take news reports from CNN site as T and randomly obtain a word as P. Data source The length of string Alphab et size r The number of total compar ison with matche d The number of window The expected number of comparison per window r(r-1 ) 2 The number of average comparison per window textpattern CNN news 3715735325350.03020.0598 2222144021580.02620.0126

52 In the 3 rd experiment, we take three fragments from human chromosome as T. The pattern is taken from the part of T. Data source The length of string Alphabet size r The number of total compariso n with matched The number of window The expected number of compariso n per window r/(r-1 ) 2 The number of average comparison per window textpattern Human Chromosome 21 NT_011512.10 16271057048942233720.44440.3826 Human Chromosome 22 NT_011515.11 343723170424648494550.44440.4984 Human Chromosome X NT_033330.7 7540047045029108430.44440.4638

53 Data source The length of string r The distribution length of the longest suffix of the window which is equal to a prefix of the pattern TP 0123456789103070 Random 1000304 23451000000000 10000304 24060277400000000 100000304 242966122548950000000 1000000304 24650620021475871284111410000 1000505 16301000000000 10000505 1504092000000000 100000505 15083958815150000000 1000000505 1531437998121632750000000 10003010 28500000000000 100003010 3052810000000000 1000003010 3027289271000000000 10000003010 29959311935327600000000 10001007 7210000000000 100001007 861400000000000 1000001007 841131236000000000

54 Data source The length of string r The distribution length of the longest suffix of the window which is equal to a prefix of the pattern TP 0123456789103070 CNN news 3715735503320000000 00 0 0 2222144015620000000 00 0 0 Human Chromosome 21 NT_011512.10 1627105704162655823967213681613321001 Human Chromosome 22 NT_011515.11 34372317043226911843399089129511145621101 Human Chromosome X NT_033330.7 7540047047177272270116454177000001

55 We calculate the distribution length of the longest suffix of the window which is equal to a prefix of the pattern in above experiments. We find that almost all A i are smaller than 5. Therefore, we conclude that the probability of finding large A i is very small.

56 Reference [ A90]Algorithms for finding patterns in strings, A. V. Aho, Handbook of Theoretical Computer Science, Vol. A, Elsevier, Amsterdam, 1990, pp.255-300. [A85]The myriad virtues of suffix trees, Apostolico, A., Combinatorial Algorithms on words, NATO Advanced Science Institutes, Series F, Vol. 12, 1985, pp.85-96 [AG86]The Boyer-Moore-Galil string searching strategies revisited, Apostolico, A. and Giancarlo, R., SIAM, Comput. 15, 1986, pp98-105. [BR92]Average running time of the Boyer-Moore-Horspool algorithm, Baeza-Yates, R. A. and Regnier, M. Theoret. Comput. Sci., 1992, pp.19-31. [BKR91]Analysis of algorithms and Data Structures, Banachowski, L., Kreczmar, A. and Rytter, W., Addison-Wesley. Reading, MA,1991. [BM77] A fast string searching algorithm. Boyer, R. S. and Moore, J. S., Communications of the ACM, Vol. 20, 1977, pp.762-772. [C99]Tight bounds on the complexity of the Boyer-Moore pattern string searching algorithm, Cole, R. Proceedings of the second annual ACM- SIAM symposium on Discrete algorithms, 1999, pp.224-233.

57 [C86] Transducers and repetitions, Crochemore, M., Theoret. Comput. Sci., Vol. 45, 1986, pp.63-86. [G79] On improving the worst case running time of the Boyer-Moore string searching algorithm, Galil, Z., Comm. ACM, Vol.22, 1979, pp.505- 508. [G80] A new proof of the linearity of the Boyer-Moore string searching algorithm, Guibas, L. J. and Odlyzko, A. M., SIAM J. Comput., Vol. 9, 1980, pp. 672-682. [H80] Practical fast searching in strings, Horspool, R. N., Software- Practice and Experience, Vol.10, 1980, pp. 501-506. [HS80] Fast string searching, Hume, A. and Sunday, D. M.,Software- Practice and Experience, 1980, pp. 1221-1248. [KMP77] Fast pattern matching in strings, D.E. Knuth, J.H. Morris and V.R. Pratt, SIAM Journal on Computing, Vol. 6, No.2, 1977, pp 323- 35 0. [L92] A variation on Boyer-Moore algorithm, Lecroq, T.,Theorer. Comput. Sci., Vol.92, 1992, pp.119-144. [R80] A correct prprocessing algorithm for Boyer-Moore string searching, SIAM Journal on Computing, Rytter, W.,Vol.9, 1980, pp.509-512. [Y79] The complexity of pattern matching for a random string, Yao, A. C.,SIAM Journal on Computing, Vol. 8, 1979, pp.368-387.

1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

Similar presentations

Presentation on theme: "1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,

Similar presentations

Presentation on theme: "1 Speeding up on two string matching algorithms Advisor: Prof. R. C. T. Lee Speaker: Kuei-hao Chen, CROCHEMORE, M., CZUMAJ, A., GASIENIEC, L., JAROMINEK,"— Presentation transcript:

Similar presentations

About project

Feedback