1 Very fast and simple approximate string matching Information Processing Letters, 72:65-70, 1999. G. Navarro and R. Baeza-Yates Advisor: Prof. R. C. T.

Slides:



Advertisements
Similar presentations
Advisor: Prof. R. C. T. Lee Speaker: L. C. Chen
Advertisements

You have been given a mission and a code. Use the code to complete the mission and you will save the world from obliteration…
Using Matrices in Real Life
Advanced Piloting Cruise Plot.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
1 Copyright © 2010, Elsevier Inc. All rights Reserved Fig 2.1 Chapter 2.
ALGEBRA Number Walls
1 Average Case Analysis of an Exact String Matching Algorithm Advisor: Professor R. C. T. Lee Speaker: S. C. Chen.
1 Fast text searching: allowing errors Sun Wu and Udi Manber, Communications of the ACM, Vol. 35, 1992, pp Advisor: Prof. R. C. T. Lee Reporter:
1 Fast Parallel and Serial Approximate String Matching Journal of Algorithms, Vol.10 (1989), pp G. Landau and U. Vishkin Advisor: Prof. R. C.
Speaker: C. C. Lin Adviser: R. C. T. Lee
1 Rules for Approximate String Matching R.C.T. Lee.
On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.
1 String Matching with Errors The Theory and Computation of Evolutionary Distances: Pattern Recognition, Sellers, P. H., Journal of Algorithms, Vol. 20,
1 Faster algorithms for string matching with k mismatches Adviser : R. C. T. Lee Speaker: C. C. Yen Journal of Algorithms, Volume 50, Issue 2, February.
1 Two Different Approximate String Matching Problems and Their Algorithms Speakers: C. W. Lu and Y. K. Shie Advisor: Richard Chia-Tung Lee.
1 Approximate string matching using factor automata J. Holub and B. Melichar Theoretical Computer Science vol.249 p Speaker: L. C. Chen Advisor:
Business Transaction Management Software for Application Coordination 1 Business Processes and Coordination.
We need a common denominator to add these fractions.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Title Subtitle.
My Alphabet Book abcdefghijklm nopqrstuvwxyz.
Multiplying binomials You will have 20 seconds to answer each of the following multiplication problems. If you get hung up, go to the next problem when.
0 - 0.
DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
MULT. INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Addition Facts
ZMQS ZMQS
CSCI 3130: Formal Languages and Automata Theory Tutorial 5
The basics for simulations
Tuned Boyer Moore Algorithm
ABC Technology Project
1 Undirected Breadth First Search F A BCG DE H 2 F A BCG DE H Queue: A get Undiscovered Fringe Finished Active 0 distance from A visit(A)
O X Click on Number next to person for a question.
Factors, Prime Numbers & Composite Numbers
Squares and Square Root WALK. Solve each problem REVIEW:
1..
Lets play bingo!!. Calculate: MEAN Calculate: MEDIAN
GG Consulting, LLC I-SUITE. Source: TEA SHARS Frequently asked questions 2.
Addition 1’s to 20.
25 seconds left…...
Week 1.
Analyzing Genes and Genomes
We will resume in: 25 Minutes.
Essential Cell Biology
Intracellular Compartments and Transport
A SMALL TRUTH TO MAKE LIFE 100%
O X Click on Number next to person for a question.
PSSA Preparation.
Essential Cell Biology
How Cells Obtain Energy from Food
Energy Generation in Mitochondria and Chlorplasts
1 Non Deterministic Automata. 2 Alphabet = Nondeterministic Finite Accepter (NFA)
Chapter 6 Languages: finite state machines
all-pairs shortest paths in undirected graphs
Advisor: Prof. R. C. T. Lee Speaker: Y. L. Chen
1 The Colussi Algorithm Advisor: Prof. R. C. T. Lee Speaker: Y. L. Chen Correctness and Efficiency of Pattern Matching Algorithms Information and Computation,
1 Advisor: Prof. R. C. T. Lee Speaker: G. W. Cheng Two exact string matching algorithms using suffix to prefix rule.
1 String Matching Algorithms Based upon the Uniqueness Property Advisor : Prof. R. C. T. Lee Speaker : C. W. Lu C. W. Lu and R. C. T. Lee, 2007, String.
Quick Search Algorithm A very fast substring search algorithm, SUNDAY D.M., Communications of the ACM. 33(8),1990, pp Adviser: R. C. T. Lee Speaker:
Reverse Colussi algorithm
Backward Nondeterministic DAWG Matching Algorithm
Raita Algorithm T. RAITA Advisor: Prof. R. C. T. Lee
A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.
String Matching (Chap. 32)
Presentation transcript:

1 Very fast and simple approximate string matching Information Processing Letters, 72:65-70, G. Navarro and R. Baeza-Yates Advisor: Prof. R. C. T. Lee Speaker: H. M. Chen

2 Our approximate string matching problem is defined as follows: Given a pattern string P of length m and a text string T of length n, and a maximal number k of errors allowed, find all text positions that match P with up to edit distance equal to k.

3 This paper is based upon the following lemma presented by the same authors in [A Hybird indexing method for approximate string matching]. Lemma: Let T and P be two strings. Let P be divided into j pieces p 1, p 2, …, p j. If ed(T,P) k, then there exists at least one p i and a substring S in T such that ed(S,p i ).

4 If we let j=k+1, then. In this case, if ed(T,P) k, then at least one p i occurs in T exactly. If, in a certain window, we find an exact matching of a p i inside the window, we use the dynamic programming approach to determine whether there exists an approximate matching of P allowing k errors in this window.

5 If, in a window, we cannot find any exact matching of p i inside the window, we ignore the window. That is, we do not have to check whether there is an approximate matching inside the window.

6 Question: How large can the window be? Answer: The largest window size which is allowed to produce an approximate matching with edit distance smaller than or equal to k is m+2k where m is the length of the pattern. This can be explained in the following slide.

7 Consider the following case. Suppose P exactly matches a substring S in T. We may extend S k characters to the right and k characters to the left. This forms a window of size m+2k. Any substring obtained by extending S to the right and to the left is an approximate matching with P with edit distance less than or equal to k. T P m S kk

8 Let us consider the case where we limit the error to be less than k. Then we split the pattern P into k+1 pieces. Since each piece is rather small, there is a high probability that it appears exactly in T. Thus, when the pieces are small, us in this case, we cannot eliminate many substrings.

9 Our think is as follows: After determining the occurrences of exact matching of small pieces, we start to determine the occurrences of larger piece of P in T. AAABBBCCCDDD AAABBB CCCDDD BBBCCCDDDAAA k = 3

10 bc table The only thing we want to do is to construct a table of each piece of P as follow. Let x be a character in the alphabet. We record the position of the last x, if it exists in piece of P, we record the position of x from the right end. If x does not exist in piece of P, we record it as m+1.

11 Suppose we have P = ATCCTC with k = 2. We divide P into three pieces : p 1 = AT, p 2 = CC and p 3 = TC. To search for exact matching, we actually perform an exhaustive search. Let us assume that we search for AT. Note that there are three cases: Case 1 : X = A. We move AT 2 steps. Case 2 : X = T. We move AT 1 steps. Case 3 : XA and XT, we move AT 3 steps. T X AT

12 Let us assume that we search for CC. Case 1 : X = C. We move CC 1 step. Case 2 : X C. We move CC 3 steps. T X CC

13 Let us assume that we search for TC. Case 1 : X = T. We move TC 2 steps. Case 2 : X = C. We move TC 1 step. Case 3 : XT and XC, we move TC 3 steps. T X TC

14 Based upon three above discussions, we choose the minimum values of each character and have the following shift table: p 1 = AT p 2 = CCp 3 = TC AT* 213 TC* 213 C* 13 ATC* 2113 shift table bc table

15 T = TCCAAGTTATAGCTC p 1 = AT, p 2 = CC, p 3 = TC First step: We open a window with length 2 to compare with AT, CC and TC. We found that it has a exact matching with p 3. Then shift the window according to shift table value of next position. Second step: We found CC has a exact matching with p 2. Then we shift the window 2 positions. Third step: We cannot find AA among p 1, p 2 and p 3. Then shift the window 3 positions and continue to compare. ATC* 2113 shift table TCCAAGTTATAGCTC TCCAAGTTATAGCTC TCCAAGTTATAGCTC

16 T = TCCAAGTTATAGCTC P = ATCCTC Using this shift table, we may have the following. We will find AT occurring at 9 in T, CC occurring at 2 in T and TC occurring at 1 and 14 in T. Table d contains all text positions of Ps pieces. AT 9 CC 2 TC 1,14 Table d ATC* 2113 shift table

17 T = CAABCAAABDAACB P = ABCACABCDDCA k = 3 ABCACABCDDCA ABCACABCDDCA ABCACABCD DCA

18 T = CAABCAAABDAACB P = ABCACABCDDCA k = 3 Table d ABC 3 ACA NULL BCD NULL DCA NULL shift table ABCD* 12114

19 T = CAABCAAABDAACB P = ABCACABCDDCA 1. Found ABC in T. Search for ABCACA in with k=1. Now the length of m is six. So the window length is eight. found! CAABCAAABDAACB

20 T = CAABCAAABDAACB P = ABCACABCDDCA 2. Search for ABCACABCDDCA with k=3 in But we cant find ABCACABCDDCA in T with k=3. Stop comparing. CAABCAAABDAACB

21 Time complexity search cost in O(kn/m) = O(αn)time complexity. Error level α= k / m.

22 References [1] R. Baeza-Yates, G. Gonnet, A new approach to text searching, Comm. ACM 35 (10) (1992) 74–82. [2] R. Baeza-Yates, G. Navarro, Faster approximate string matching, Algorithmica 23 (2) (1999) 127–158. Preliminary version in: Proc. CPM96. [3] R. Baeza-Yates, C. Perleberg, Fast and practical approximate pattern matching, Inform. Process. Lett. 59 (1996) 21–27. [4] G. Myers, A fast bit-vector algorithm for approximate pattern matching based on dynamic programming, in: Proc. CPM98, Lecture Notes in Computer Sci., Vol. 1448, Springer, Berlin, 1998, pp. 1–13. [5] G. Navarro, Approximate text searching, Ph.D. Thesis, Department of Computer Science, University of Chile, December Tech. Report TR/DCC [6] G. Navarro, A guided tour to approximate string matching, Technical Report TR/DCC-99-5, Department of Computer Science, University of Chile, Submitted. ftp:// ftp.dcc.uchile.cl/pub/users/gnavarro/survasm.ps.gz.

23 References [7] G. Navarro, R. Baeza-Yates, Improving an algorithm for approximate string matching, 1998, submitted. [8] G. Navarro, M. Raffinot, A bit-parallel approach to suffix automata: Fast extended string matching, in: Proc. CPM98, Lecture Notes in Computer Sci., Vol. 1448, Springer, Berlin, 1998, pp. 14–33. [9] P. Sellers, The theory and computation of evolutionary distances: pattern recognition, J. Algorithms 1 (1980) 359–373. [10] D. Sunday, A very fast substring search algorithm, Comm. ACM 33 (8) (1990) 132–142. [11] S. Wu, U. Manber, Agrep – a fast approximate pattern-matching tool, in: Proc. of USENIX Technical Conference, 1992, pp. 153–162. [12] S. Wu, U. Manber, Fast text searching allowing errors, Comm. ACM 35 (10) (1992) 83–91.

24 Thank You