UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 1: Exact String Matching.

Slides:



Advertisements
Similar presentations
1 Average Case Analysis of an Exact String Matching Algorithm Advisor: Professor R. C. T. Lee Speaker: S. C. Chen.
Advertisements

© 2004 Goodrich, Tamassia Pattern Matching1. © 2004 Goodrich, Tamassia Pattern Matching2 Strings A string is a sequence of characters Examples of strings:
Longest Common Subsequence
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter : k-difference.
15-853Page : Algorithms in the Real World Suffix Trees.
296.3: Algorithms in the Real World
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Data Structures and Algorithms (AT70.02) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: CLRS “Intro.
Refining Edits and Alignments Υλικό βασισμένο στο κεφάλαιο 12 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University.
1 A simple fast hybrid pattern- matching algorithm Department of Computer Science and Information Engineering National Cheng Kung University, Taiwan R.O.C.
Pattern Matching1. 2 Outline and Reading Strings (§9.1.1) Pattern matching algorithms Brute-force algorithm (§9.1.2) Boyer-Moore algorithm (§9.1.3) Knuth-Morris-Pratt.
Goodrich, Tamassia String Processing1 Pattern Matching.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12: Refining Core String.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11 sections4-7 Lecturer:
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 2: Boyer-Moore Algorithm.
8 TECHNIQUES OF INTEGRATION. In defining a definite integral, we dealt with a function f defined on a finite interval [a, b] and we assumed that f does.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 2: KMP Algorithm Lecturer:
Boyer-Moore string search algorithm Book by Dan Gusfield: Algorithms on Strings, Trees and Sequences (1997) Original: Robert S. Boyer, J Strother Moore.
Knuth-Morris-Pratt Algorithm left to right scan like the naïve algorithm one main improvement –on a mismatch, calculate maximum possible shift to the right.
1 A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber Tech. Rep. TR94-17,Department of Computer Science, University of Arizona, May 1994.
String Matching COMP171 Fall String matching 2 Pattern Matching * Given a text string T[0..n-1] and a pattern P[0..m-1], find all occurrences of.
Knuth-Morris-Pratt Algorithm Prepared by: Mayank Agarwal Prepared by: Mayank Agarwal Nitesh Maan Nitesh Maan.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11: Core String Edits.
Reverse Colussi algorithm
Pattern Matching COMP171 Spring Pattern Matching / Slide 2 Pattern Matching * Given a text string T[0..n-1] and a pattern P[0..m-1], find all occurrences.
Raita Algorithm T. RAITA Advisor: Prof. R. C. T. Lee
DAST 2005 Week 4 – Some Helpful Material Randomized Quick Sort & Lower bound & General remarks…
CS5371 Theory of Computation Lecture 8: Automata Theory VI (PDA, PDA = CFG)
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.3: Exclusion Methods.
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
Pattern Matching1. 2 Outline Strings Pattern matching algorithms Brute-force algorithm Boyer-Moore algorithm Knuth-Morris-Pratt algorithm.
1 Exact Matching Charles Yan Na ï ve Method Input: P: pattern; T: Text Output: Occurrences of P in T Algorithm Naive Align P with the left end.
A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.
1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Exact String Matching Algorithms.
KMP String Matching Prepared By: Carlens Faustin.
Semi-Numerical String Matching. All the methods we’ve seen so far have been based on comparisons. We propose alternative methods of computation such as:
MA/CSSE 473 Day 24 Student questions Quadratic probing proof
20/10/2015Applied Algorithmics - week31 String Processing  Typical applications: pattern matching/recognition molecular biology, comparative genomics,
Boyer Moore Algorithm Idan Szpektor. Boyer and Moore.
MCS 101: Algorithms Instructor Neelima Gupta
Application: String Matching By Rong Ge COSC3100
Timothy J. Ham Western Michigan University April 23, 2010.
Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.
MCS 101: Algorithms Instructor Neelima Gupta
String Searching CSCI 2720 Spring 2007 Eileen Kraemer.
Exact String Matching Algorithms Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU.
06/12/2015Applied Algorithmics - week41 Non-periodicity and witnesses  Periodicity - continued If string w=w[0..n-1] has periodicity p if w[i]=w[i+p],
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
ICS220 – Data Structures and Algorithms Analysis Lecture 14 Dr. Ken Cosh.
Dipankar Ranjan Baisya, Mir Md. Faysal & M. Sohel Rahman CSE, BUET Dhaka 1000 Degenerate String Reconstruction from Cover Arrays (Extended Abstract) 1.
MA/CSSE 473 Day 25 Student questions Boyer-Moore.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
CSG523/ Desain dan Analisis Algoritma
NUMBER SYSTEMS.
Modeling Arithmetic, Computation, and Languages
Perturbation method, lexicographic method
13 Text Processing Hongfei Yan June 1, 2016.
JinJu Lee & Beatrice Seifert CSE 5311 Fall 2005 Week 10 (Nov 1 & 3)
Chapter 7 Space and Time Tradeoffs
Pattern Matching 1/14/2019 8:30 AM Pattern Matching Pattern Matching.
KMP String Matching Donald Knuth Jim H. Morris Vaughan Pratt 1997.
Pattern Matching 2/15/2019 6:17 PM Pattern Matching Pattern Matching.
Bioinformatics Algorithms and Data Structures
Pattern Matching Pattern Matching 5/1/2019 3:53 PM Spring 2007
Pattern Matching 4/27/2019 1:16 AM Pattern Matching Pattern Matching
Presentation transcript:

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 1: Exact String Matching Instructor: Dr. Rose January 14, 2003

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Exact String Matching Types of Abstract Problems: –Pattern matching, i.e., find pattern P in string S –Similarity comparison, i.e., what is the longest common substring in S1 and S2? –Can we find P´ ~ P in S? We can think of P´ as a mutation of P. –What are the regions of similarity in S1 and S2? We can also do this with mutations

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Exact String Matching Q: What is the underlying theme common to these abstract problems? A: Correlation, i.e., correlation between two signals, strings, etc.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Exact String Matching Q: What is the simplest way to compare two strings? A: Look for a mapping of one string into the other.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Exact String Matching Given two strings S1 and S2, Where length(S1) <= length(S2) : –Start at the beginning of S1 and S2 –Compare corresponding characters, i.e., S1[1] & S2[1], S1[2] & S2[2], etc.. Continue until either: 1)All the characters in S1 have been matched or 2)A mismatch is found

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Exact String Matching If there is a mismatch shift 1 character position along S2 and start over, e.g., compare S1[1]& S2[2], S1[2]& S2[3], etc.. Keep doing this until a match is found or the possible starting positions in S2 are exhausted.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Exact String Matching Example: S1 = adab, S2=abaracadabara a b a r a c a d a b a r a 1: a dd != a 2: _ aa != b 3: __ a dd != r 4: ____aa != r 5: ______a dd != c 6: _______ aa != c 7: _________ a d a b Finally!!!! Q: How many comparisons? A: 13, looks ok?

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Exact String Matching Example 2: S1 = aaaaab, length(S1) = 6 S2 = aaaaaaaaaaab, length(S2) = 12 a a a a a a a a a a a b 1: a a a a a bb != a 2: _ a a a a a bb != a 3: ___a a a a a bb != a 4: ____ a a a a a bb != a 6: ______a a a a a bb != a 7: _______ a a a a a bb != a

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Exact String Matching Example 2 continued from previous slide a a a a a a a a a a a b 8: ________ a a a a a bFinally!!! Q: How many comparisons were made? A: 42 = 7 X 6 = (12 – 6 + 1) X 6 = (N – M + 1) X M Where length(S2) = N and length(S1) = M Q: Where did this come from? A: There are N – M + 1 possible match positions in S2

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Exact String Matching Bottom line, the time complexity is  (NM) Observation: Notice that many of the characters in S2 are involved in multiple comparisons. WHY??? A: Because the naïve approach doesn’t learn from previous matches. By the time the first mismatch occurs, we know what the first 6 characters of S1 and S2 are.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Exact String Matching Note: A smarter approach would not involve the first 6 characters of S2 in subsequent comparisons. Fast matching algorithms take advantage of this insight. Q: Where does this insight come from? A: Preprocessing either S1 or S2.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Exact String Matching Insight: if a match fails 1) don’t make redundant comparisons 2) skip to the first next possible match position. Note: the next possible match position may not be 1 character shift away. Let’s consider both of these ideas with respect to examples 1 and 2

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Exact String Matching Let’s review example 2: a a a a a a a a a a a b  S2 1: a a a a a b  S1b != a, we have seen the first 6 characters a a a a a 2: _a a a a a bb != a, we already know the a’s match, we only need to try to match the ‘b’ a a a a a 3: ___a a a a a bb != a, ditto a a a a a 4: ____ a a a a a bb != a, ditto a a a a a 6: ______a a a a a bb != a, ditto a a a a a 7: _______ a a a a a bb != a, ditto a a a a a 8: _________a a a a a b Finally!!! The number of comparisons is 12 instead of the previous 42

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Exact String Matching Let’s review example 1: S1 = adab, S2=abaracadabara a b a r a c a d a b a r a 1:a dd != b, we have seen the first 2 characters The next possible match must be at least two positions away 2: __ a dd != r, we have seen the first 4 chars of S2 The next possible match must be at least two positions away 3: _____ a dd != c, we have seen the first 6 chars of S2 The next possible match must be at least two positions away 4: _________ a d a b Finally!!!! Q: How many comparisons? A: 10. The previous approach took 13 comparisons

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Preprocessing a String Core Idea For each position i>1 in string S, identify the maximal substring that matches a prefix of S. Q: Why do we want to do this? A: We will use this information in two ways: 1) This tells us how far to skip for the next possible match. (Recall example 1) 2) Knowledge of prefix matches allows us to avoid redundant comparisons (Recall example 2) Do we need to go back and review examples 1 and 2?

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Preprocessing a String Let M(S i ) denote the maximal substring that matches a prefix of S at position i>1 Example: S = aabcaabxaaz (from textbook) M(S 2 ) = a M(S 3 ) = Ø M(S 4 ) = Ø M(S 5 ) = aab

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Preprocessing a String Let Z(S i ) denote the length of the maximal substring M(S i ) starting from position i>1 that matches a prefix of S Example: S = aabcaabxaaz (from textbook) Z(S 2 ) = 1, since M(S 2 ) = a Z(S 3 ) = 0, since M(S 3 ) = Ø Z(S 4 ) = 0, since M(S 4 ) = Ø Z(S 5 ) = 3, since M(S 5 ) = aab

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Preprocessing a String Consider the figure above, depicting string S and two maximal substrings  and  from positions j and k, respectively that match prefixes of S. Z j is the length of , and Z k is the length of . Gusfield refers to these boxes as Z-boxes.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Preprocessing a String Let’s look at a concrete instance of this abstraction

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Preprocessing a String For all i>1, r i denotes the right-most endpoint of the Z-boxes containing i. Note that while i is in both  and , the rightmost endpoint of these Z-boxes is the endpoint of .

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Preprocessing a String Let’s compare the abstract depiction with our concrete example.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Preprocessing a String l i is the left end of the Z-box ending at r i.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Preprocessing a String Again, compare the abstract with the concrete.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Preprocessing a String We will now consider how to find the Z- boxes in linear time, O(|S|). We can use this to find exact matches in linear time.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Preprocessing a String We start by computing Z 2, explicitly comparing characters S[1]&S[2], etc. If Z 2 > 0, then let r = r 2 and l = l 2 = 2, o/w let r = 0 and l = 0.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Preprocessing a String Iterate to compute all subsequent Z k. When Z k is computed, all previous Z i, 1< i <= k-1 are already known.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology The Z Algorithm 1.If k > r, then k is not in any Z-box that has yet been found. –We must find Z k by comparing characters starting at position k with characters starting at position 1 in S.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology The Z Algorithm 2.If k <= r, then k is contained in a previously found Z-box, say . –Then the substring  from k to r matches a substring of  from k´ to Z l.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology The Z Algorithm Here is a concrete example where k <= r. We see that k is contained in a previously found Z-box . The substring  from k to r matches a substring of  from k´ to Z l.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology The Z Algorithm We need to check if the value of Z k´ is nonzero. Why? If Z k´ is nonzero, then there is a prefix of S starting k´. This means that k must also be the start of a prefix of S.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology The Z Algorithm Here is a concrete example where the value of Z k´ is zero. The substring starting at k´ is not a prefix of S, nor is the substring at k.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology The Z Algorithm If Z k´ is nonzero, how long is the prefix starting at k? Minimally, it is at least as long as the smaller of Z k´ and |  |. Of course it may be longer.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology The Z Algorithm The prefix starting at k is at least the smaller of Z k´ and |  |. Case 2a: If Z k´ < |  |, then its length is exactly Z k´ as depicted in the figure below. In this case, r and l remain unchanged.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology The Z Algorithm Here is a concrete example where Z k´ < |  |. In this case, 3 < 6. The length of the prefix starting at k is exactly Z k´, i.e., 3. In this case, r and l remain unchanged.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology The Z Algorithm Case 2b: If Z k´ >= |  |, then  is a prefix of S as depicted in the figure below. It could be the case that Z k > |  |. This can only be determined by extending the match past r.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology The Z Algorithm Here is a concrete example where Z k´ = |  |, i.e., 3 = 3. We can see that  is a prefix of S.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology The Z Algorithm Here is a concrete example where Z k´ > |  |. We can see that  is a prefix of S and so is this longer substring starting at k. Only by extending the match past r are we able to distinguish between Z k´ = |  | and Z k´ > |  |.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology The Z Algorithm In extending the match past r, say a mismatch occurs at q, q >= r + 1. Set Z k = q – k, r = q – 1, and l = k as shown in the figure below.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology The Z Algorithm Using our concrete example: In extending the match past r, a mismatch occurs at q, q = r + 2. Set Z k = q – k, r = q – 1, and l = k.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology The Z Algorithm Continue to iterate through the entire string. Computing subsequent Z k will entail only the cases we discussed: –Case 1: k > r, k is not in a known Z-box. Find Z k by explicitly matching with the start of S. Set r & l accordingly. –Case 2a: Z k’ < |  |, the prefix at k is wholly contained in . r & l are not changed. –Case 2b: Z k’ >= |  |,  is a prefix of S. Try to extend the match. Set l = k and r = q – 1, where q is the position of the first mismatch.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology The Z Algorithm Theorem Using algorithm Z, value Z k is correctly computed and variables r and l are correctly updated. Proof on page 9 of text. Theorem All the Z k (S) values are computed by algorithm Z in O(|S|), i.e., linear time. Proof on page 10 of text.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology A Simple Linear-Time Exact String Matching Algorithm We can use algorithm Z by itself as a simple linear-time string matching algorithm. Let S = P$T where: –T is the target string, |T| = m –P is the pattern string, |P| = n, n <= m –$ is character not appearing in either P or T. Apply algorithm Z to S.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology A Simple Linear-Time Exact String Matching Algorithm Since $ does not appear in P or T, no prefix of S can be longer than n, i.e., |P|. We only need to consider Z i (S) for i in T, i.e., i > n + 1 Any value of i, such that i > n + 1, where Z i (S) = n, indicates a match of P at position i – (n+1) in T. All Z i (S) are computed in O(m+n) = O(m)