1 Rules for Approximate String Matching R.C.T. Lee.

Slides:



Advertisements
Similar presentations
Numbers Treasure Hunt Following each question, click on the answer. If correct, the next page will load with a graphic first – these can be used to check.
Advertisements

AP STUDY SESSION 2.
Advanced Piloting Cruise Plot.
1
1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Chapter 1 The Study of Body Function Image PowerPoint
Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.
Author: Julia Richards and R. Scott Hawley
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
ALGEBRA Number Walls
1 Very fast and simple approximate string matching Information Processing Letters, 72:65-70, G. Navarro and R. Baeza-Yates Advisor: Prof. R. C. T.
1 Average Case Analysis of an Exact String Matching Algorithm Advisor: Professor R. C. T. Lee Speaker: S. C. Chen.
1 Fast text searching: allowing errors Sun Wu and Udi Manber, Communications of the ACM, Vol. 35, 1992, pp Advisor: Prof. R. C. T. Lee Reporter:
1 Fast Parallel and Serial Approximate String Matching Journal of Algorithms, Vol.10 (1989), pp G. Landau and U. Vishkin Advisor: Prof. R. C.
Speaker: C. C. Lin Adviser: R. C. T. Lee
1 Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol.
1 Faster algorithms for string matching with k mismatches Adviser : R. C. T. Lee Speaker: C. C. Yen Journal of Algorithms, Volume 50, Issue 2, February.
UNITED NATIONS Shipment Details Report – January 2006.
1 RA I Sub-Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Casablanca, Morocco, 20 – 22 December 2005 Status of observing programmes in RA I.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Properties of Real Numbers CommutativeAssociativeDistributive Identity + × Inverse + ×
CALENDAR.
My Alphabet Book abcdefghijklm nopqrstuvwxyz.
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Year 6 mental test 10 second questions
REVIEW: Arthropod ID. 1. Name the subphylum. 2. Name the subphylum. 3. Name the order.
Break Time Remaining 10:00.
Turing Machines.
PP Test Review Sections 6-1 to 6-6
1 Undirected Breadth First Search F A BCG DE H 2 F A BCG DE H Queue: A get Undiscovered Fringe Finished Active 0 distance from A visit(A)
CS 6143 COMPUTER ARCHITECTURE II SPRING 2014 ACM Principles and Practice of Parallel Programming, PPoPP, 2006 Panel Presentations Parallel Processing is.
Exarte Bezoek aan de Mediacampus Bachelor in de grafische en digitale media April 2014.
VOORBLAD.
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
1 RA III - Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Buenos Aires, Argentina, 25 – 27 October 2006 Status of observing programmes in RA.
Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
1..
© 2012 National Heart Foundation of Australia. Slide 2.
Lets play bingo!!. Calculate: MEAN Calculate: MEDIAN
MaK_Full ahead loaded 1 Alarm Page Directory (F11)
Understanding Generalist Practice, 5e, Kirst-Ashman/Hull
Before Between After.
Model and Relationships 6 M 1 M M M M M M M M M M M M M M M M
25 seconds left…...
Januar MDMDFSSMDMDFSSS
1 hi at no doifpi me be go we of at be do go hi if me no of pi we Inorder Traversal Inorder traversal. n Visit the left subtree. n Visit the node. n Visit.
Analyzing Genes and Genomes
We will resume in: 25 Minutes.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
Essential Cell Biology
Clock will move after 1 minute
Intracellular Compartments and Transport
PSSA Preparation.
Immunobiology: The Immune System in Health & Disease Sixth Edition
Essential Cell Biology
Immunobiology: The Immune System in Health & Disease Sixth Edition
Physics for Scientists & Engineers, 3rd Edition
Immunobiology: The Immune System in Health & Disease Sixth Edition
Energy Generation in Mitochondria and Chlorplasts
Select a time to count down from the clock above
Distributed Computing 9. Sorting - a lower bound on bit complexity Shmuel Zaks ©
Presentation transcript:

1 Rules for Approximate String Matching R.C.T. Lee

2 Rule 1 Consider two substrings A 1 and A 2 as shown below: A1A1 P1P1 S1S1 A2A2 P2P2 S2S2 If ed(A 1, A 2 ) k and S 1 =S 2, then ed(P 1, P 2 ) k.

3 Rule 1:[AKLLLR2000], [H2005], [HHLS2006], [JB2000], [LV89], [NB99], [NB2000], [S80], [TU93], and [WM92].

4 Rule 2 If ed(A, B) k, then the length of A must be between m-k and m+k. A B m

5 Rule 2: [FN2004], [NB99], [NB2000] and [TU93].

6 Rule 3 If S 1 contain S 1 completely and the distance between S 1 and any substring of P is larger than k, then ed(S 1, P)>k. S1S1 P S 1

7 Rule 3: [ALP2004].

8 Rule 4 For any substring S 1 in T, if there exists a substring S 2 in P to the left of S 1, ed(S 1, S 2 ) k and S 2 is the rightmost such substring, then move P to align S 1 and S 2. TS1S1 PS2S2 PS2S2

9 Rule 4: [ALP2004].

10 Based upon Rule 3 and Rule 2, we have Rule 5 If the window size is (m-k) and there exists a substring S 1 in the window such that the distance between S 1 and any substring of P is larger than k, then we can safely move P as follows: TS1S1 P m-k TS1S1 P

11 If Rule 5 is not satisfied, it means the following: For every substring S 1 in T, there exists a substring S 2 in P such that ed(S 1, S 2 ) k.

12 TS1S1 P m-k Rule 5-1 If Rule 5 is not satisfied, we can only move 1 step as follows: TS1S1 P m-k

13 Rule 5: [HN2005].

14 Rule 6 Hamming Distance(A, B) Edit Distance(A, B).

15 Rule 6: [AKLLLR2000], [FN2004] and [TU93].

16 Rule 7 For strings A and B, if there are k+1 characters which do not appear in B, then ed(A, B)>k. Rule 7-1 Let A and B be two strings. Let there be k+1 characters a 1, a 2, …, a k+1 in A and a i is aligned with b i in B. If every a i does not appear in B[i-k, i+k], then ed(A, B)>k.

17 Rule 7: [TU93].

18 Rule 8 Let there be two strings A and B. Let B be divided into j pieces B 1, B 2, …, B j. If ed(A, B)>k, there is at least one substring A i in A such that ed(A i, B i ).

19 Rule 8-1 Let A and B be two strings. Let B be divided into j pieces B 1, B 2, …, B j. If for every B i and every substring S of A, ed(S, B i ), ed(A, B)>k.

20 Rule 8-2 Let A and B be two strings. Let the lengths of A and B be m+k and m repsectively. Let B be divided into j pieces B 1, B 2, …, B j. Let AP be a prefix of A. If for every B i and every substring S of A, ed(S, B i ), ed(AP, B)>k.

21 Rule 8: [NB99] and [NB2000].

22 Rule 9 Let A and B be two strings with lengths m+k and m respectively. Let A be the prefix of A with length m-k. Let there be j characters a 1, a 2, …, a j in A. Let the number of times that a i appears in A and B be N(A, a i ) and N(B, a i ) respectively. Let C i =N(A, a i )-N(B, a i ). Let AP be any prefix of A. If, ed(AP, B)>k.

23 Rule 9-1 Let A and B be two strings with lengths m+k and m respectively. Let there be j characters a 1, a 2, …, a j in A. Let the number of times that a i appears in A and B be N(A, a i ) and N(B, a i ) respectively. Let C i =N(B, a i )-N(A, a i ). Let AP be any prefix of A. If, ed(AP, B)>k.

24 Rule 10 Let P and T be two strings with lengths m and n respectively. If P matches with a substring P of T at position i, any substring S of T[i-k, i+m+k] has the probability of ed(S, P) k. TP P m+2k ii-ki+m+k

25 Rule 10: [NB99].

26 Rule 11 Let P and Q be two strings. Let P be divided as follows: P1P1 P2P2 … PnPn Let Q i be the substring in Q and that ed(P i, Q i ) is the smallest. P1P1 P2P2 PnPn … Q2Q2 QNQN … Q1Q1 If

27 Application of Rule 11 T W tntn t2t2 PnPn P1 P1 P2P2 t1t1 … ed(t i,P i ) is the smallest. If for some n,

28 [AKLLLR2000] Text Indexing and Dictionary Matching with One Error, Amir, A., Keselman, D., Landau, G. M., Lewenstein, M., Lewenstein, N. and Rodeh, M., Journal of Algorithms, Vol. 37, 2000, pp [ALP2004] Faster Algorithms for String Matching with k Mismatches, Amir, A., Lewenstein, and Porat, E. Journal of Algorithms, Vol. 50, 2004, pp [FN2004] Average-Optimal Multiple Approximate String Matching, Kimmo Fredriksson, Gonzalo Navarro, ACM Journal of Experimental Algorithmics, Vol 9, Article No. 1.4,2004, pp

29 [GG86] Improved String Matching with k Mismatches, Galil, Z. and Giancarlo, R.,SIGACT News, Vol. 17, No. 4, 1986, pp [H2005] Bit-parallel approximate string matching algorithms with transposition Heikki Hyyrö, Journal of Discrete Algorithms, Vol. 3, 2005, pp [HHLS2006] Approximate String Matching Using Compressed Suffix Arrays, Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol. 352, 2006, pp

30 [HN2005] Bit-parallel Witnesses and their Applications to Approximate String Matching, Heikki Hyyro and Gonzalo Navarro, Algorithmica, Vol 4, No. 3, 2005, pp [JB2000] Approximate string matching using factor automata, Jan Holub, Borivoj Melichar, Theoretical Computer Science 249, 2000, pp [LV86] String Matching with k Mismatches by Using Kangaroo Method, Landau, G.M., and Vishkin, U., Theoret. Comput Sci 43, 1986, pp

31 [LV89] Fast Parallel and Serial Approximate String Matching, G. Landau and U. Vishkin, Journal of algorithms, 10, 1989, pp [NB99] Very fast and simple approximate string matching, G. Navarro and R. Baeza- Yates, Information Processing Letters, Vol. 72, 1999, pp [NB2000] A Hybrid Indexing Method for Approximate String Matching, Gonzalo Navarro and Ricardo Baeza-Yates, 2000, No.1, Vol.1, pp

32 [S80] String Matching with Errors, Sellers, P. H., Journal of Algorithms, Vol. 20, No. 1, 1980, pp [TU93] Approximate Boyer-Moore String Matching, J. Tarhio and E. Ukkonen, SIAM Journal on Computing, Vol. 22, No. 2, 1993, pp [WM92] Fast Text Searching: Allowing Errors, Sun Wu and Udi Manber, Communications of the ACM, Vol. 35, 1992, pp