UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.3: Exclusion Methods.

Slides:



Advertisements
Similar presentations
Chapter 5: Tree Constructions
Advertisements

Boosting Textual Compression in Optimal Linear Time.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Analysis of Algorithms CS 477/677 Linear Sorting Instructor: George Bebis ( Chapter 8 )
Suffix Trees Construction and Applications João Carreira 2008.
Designing Algorithms Csci 107 Lecture 4. Outline Last time Computing 1+2+…+n Adding 2 n-digit numbers Today: More algorithms Sequential search Variations.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter : k-difference.
1 Suffix Trees Charles Yan Suffix Trees: Motivations Substring problem: One is given a text T of length m. After O (m) preprocessing time, one.
15-853Page : Algorithms in the Real World Suffix Trees.
296.3: Algorithms in the Real World
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
3 -1 Chapter 3 String Matching String Matching Problem Given a text string T of length n and a pattern string P of length m, the exact string matching.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Tries Search for ‘bell’ O(n) by KMP algorithm O(dm) in a trie Tries
Goodrich, Tamassia String Processing1 Pattern Matching.
CS 253: Algorithms Chapter 8 Sorting in Linear Time Credit: Dr. George Bebis.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12: Refining Core String.
Bioinformatics Algorithms and Data Structures
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11 sections4-7 Lecturer:
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 2: Boyer-Moore Algorithm.
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 2: KMP Algorithm Lecturer:
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Fa05CSE 182 CSE182-L4: Scoring matrices, Dictionary Matching.
Aho-Corasick String Matching An Efficient String Matching.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter : Strings and.
Designing Algorithms Csci 107 Lecture 4.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11: Core String Edits.
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching.
Algorithms and Data Structures. /course/eleg67701-f/Topic-1b2 Outline  Data Structures  Space Complexity  Case Study: string matching Array implementation.
Aho-Corasick Algorithm Generalizes KMP to handle sets of strings New ideas –keyword trees –failure functions/links –output links.
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University.
1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.
Ch. 8 & 9 – Linear Sorting and Order Statistics What do you trade for speed?
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
CSC401 – Analysis of Algorithms Chapter 9 Text Processing
CS5263 Bioinformatics Lecture 17 Exact String Matching Algorithms.
MCS 101: Algorithms Instructor Neelima Gupta
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 1: Exact String Matching.
Discrete Structures Lecture 12: Trees Ji Yanyan United International College Thanks to Professor Michael Hvidsten.
CSC 211 Data Structures Lecture 13
Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.
Graph Colouring L09: Oct 10. This Lecture Graph coloring is another important problem in graph theory. It also has many applications, including the famous.
MCS 101: Algorithms Instructor Neelima Gupta
06/12/2015Applied Algorithmics - week41 Non-periodicity and witnesses  Periodicity - continued If string w=w[0..n-1] has periodicity p if w[i]=w[i+p],
CS5263 Bioinformatics Lecture 15 & 16 Exact String Matching Algorithms.
Ravello, Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.
COSC 3101A - Design and Analysis of Algorithms 6 Lower Bounds for Sorting Counting / Radix / Bucket Sort Many of these slides are taken from Monica Nicolescu,
Finding Regular Simple Paths Sept. 2013Yangjun Chen ACS Finding Regular Simple Paths in Graph Databases Basic definitions Regular paths Regular simple.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
1 String Matching Algorithms Mohd. Fahim Lecturer Department of Computer Engineering Faculty of Engineering and Technology Jamia Millia Islamia New Delhi,
15-853:Algorithms in the Real World
13 Text Processing Hongfei Yan June 1, 2016.
Lectures on Graph Algorithms: searching, testing and sorting
Pattern Matching 1/14/2019 8:30 AM Pattern Matching Pattern Matching.
KMP String Matching Donald Knuth Jim H. Morris Vaughan Pratt 1997.
Pattern Matching 2/15/2019 6:17 PM Pattern Matching Pattern Matching.
Pattern Matching Pattern Matching 5/1/2019 3:53 PM Spring 2007
Pattern Matching 4/27/2019 1:16 AM Pattern Matching Pattern Matching
Sequences 5/17/ :43 AM Pattern Matching.
Switching Lemmas and Proof Complexity
Analysis of Algorithms CS 477/677
Presentation transcript:

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.3: Exclusion Methods Lecturer: Dr. Rose Slides by: Dr. Rose February 20, 2003

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Overview Exclusion methods: fast expected time O(m) –Partition approaches: BYP algorithm –Aho-Corasick exact matching algorithm »Keyword trees –Back to Aho-Corasick exact matching algorithm »Algorithm for computing failure links Back to BYP algorithm

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Exclusion Methods Q: Can we improve on the  (km) time we have seen for k-mismatch and k-difference? A: On average, yes. (Are we quibbling?) We adopt a fast expected algorithm <  (km)  the worst case may not be better than  (km)

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Exclusion Methods Partition Idea: exclude much of T from the search Preliminaries: Let  = |  |, where  is the alphabet used in P and T. Let n = | P |, and m = | T |. Defn. an approximate occurrence of P is an occurrence with at most k mismatches or differences. General Partition algorithm: three phases 1.Partition phase 2.Search Phase 3.Check Phase

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Exclusion Methods 1.Partition phase Partition either T or P into r-length regions (depends on particular algorithm) 2.Search Phase Use exact matching to search T for r-length intervals These are potential targets for approximate occurrences of P. Eliminate as many intervals as possible. 3.Check Phase Use approximate matching to check for an approximate occurrence of P around each surviving interval for the search phase.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BYP Method BYP method has O(m) expected running time. Partition P into r-length regions, r =  n/(k+1)  Q: How many r-length regions of P are there? A: k+1, there may be an additional short region. Suppose there is a match of P & T with at most k differences. Q: What can we deduce about the corresponding r-length regions? A:There must be at least one r-length interval that exactly matches.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BYP Method BYP Algorithm: 1.Let P be the set of the first k+1 substrings of P’s partitioning. 2.Build a keyword tree for the set of patterns P. 3.Use Aho-Corasik to find I, the set of starting locations in T where a pattern in P occurs exactly. 4.….. Oops! We haven’t talked about keyword trees or Aho-Corasik. Sooooo let’s do that now.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Trees (section 3.4) Defn. The keyword tree for set P is a rooted directed tree K satisfying: 1.Each edge is labeled with one character 2.Any two edges out of the same node have distinct labels. 3.Every pattern P i in P maps to some node v of K s.t. the path from the root to v spells out P i 4.Every leaf in K is mapped by some pattern in P.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Trees Example: From textbook P = {potato, poetry, pottery, science, school}

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Trees (section 3.4) Observation: there is an isomorphic mapping between distinct prefixes of patterns in P and nodes in K. 1.Every node corresponds to a prefix of a pattern in P. 2.Conversely, every prefix of a pattern maps to a node in K.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Trees (section 3.4) If n is the total length of all patterns in P, then we can construct K in O(n), assuming a fixed . Let K i denote the partial keyword tree that encodes patterns P 1,.. P i of P.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Trees (section 3.4) Consider partial keyword tree K 1 –comprised of a single path of |P 1 | edges out of root r. –Each edge is labeled with one character of P 1 –Reading from the root to the leaf spells out P 1 –The leaf is labeled 1

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Trees (section 3.4) Creating K 2 from K 1 : 1.Find the longest path from the root of K 1 that matches a prefix of P 2. 2.This paths ends by a)Either exhausting the characters of P 2 or b)Ending at some existing node v in K 1 where no extending match is possible. In case 2a) label the node where the path ends 2. In case 2b) create a new path out of v, labeled by the remaining characters of P 2.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Trees (section 3.4) Example: P 1 is potato a) P 2 is pot b) P 2 is potty

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Trees (section 3.4) Use of keyword trees for matching Finding occurrences of patterns in P that occur starting at position l in T: –Starting at the root r in K, follow the unique path that matches a substring of T that starts at l. –Numbered nodes along this path indicate matched patterns in P that start at position l. –This takes time proportional to min(n, m) –Traversing K for each position l in T gives O(nm) –This can be improved!

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Tree Speedup Observation: Our naïve keyword tree is like the naïve approach to string comparison.  Every time we increment l, we start all over at the root of K  O(nm) Recall: KMP avoided O(nm) by shifting to get a speedup. Q: Is there an analogous operation we can perform in K ? A: Of course, why else would I ask a rhetorical question?

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Tree Speedup First, we assume P i  P j for all combinations P i,P j in P. (This assumption will be relaxed later in the full Aho-Corasick Alg.) Next, each node v in K is labeled with the string formed by concatenating the letters from the root to v. Defn. Let L (v) denote the label of node v. Defn. Let lp(v) denote the length of the longest proper suffix of string L (v) that is a prefix of some pattern in P.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Tree Speedup Example: L (v) = potat, lp(v) = 2, the suffix at is the prefix of P 4.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Tree Speedup Note: if  is the lp(v)-length suffix of L (v), then there is a unique node labeled . Example: at is the lp(v)-length suffix of L (v), w is the unique node labeled at.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Tree Speedup Defn: For node v of K let n v be the unique node in K labeled with the suffix of L (v) of length lp(v). When lp(v) = 0 then n v is the root of K. Defn: The ordered pair (v,n v ) is called a failure link. Example:

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Aho-Corasick (section 3.4.6) Algorithm AC search (we assume P i  P j for all combinations P i,P j in P.) l = 1; c = 1; w = root of K ; Repeat { While there is an edge (w,w´) labeled character T(c) { if w´ is numbered by pattern i then report that P i occurs in T starting at position l; w= w´ and c = c + 1; } w = n w and l = c - lp(w); } Until c > m; Note: if the root fails to match increment c and the repeat loop again.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Aho-Corasick Example: T = hotpotattach When l = 4 there is a match of pot, but the next position fails. At this point c = 9. The failure link points to the node labeled at and lp(v) = 2.  l = c – lp(v) = 9 – 2 = 7

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Computing n v in Linear Time Note: if v is the root r or 1 character away from r, then n v = r. Imagine n v has been computed for for every node that is exactly k or fewer edges from r. How can we compute n v for v, a node k+1 edges from r? (We also want L (n v ).)

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Computing n v in Linear Time We are looking for n v and L (n v ). Let v´ be the parent of v in K and x the character on the edge connecting them. n v´ is known since v´ is k edges from r. Clearly, L (n v ) must be a suffix of L (n v´ ) followed by x. –First check if there is an edge (n v´,w´) with label x. –If so, then n v = w´. –O/w L (n v ) is a proper suffix of L (n v´ ) followed by x. Examine n n v´ for an outgoing edge labeled x. If no joy, keep repeating, finally setting n v = r, if we run out of edges.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Computing n v in Linear Time Algorithm n v /* Initialization */ v´ = parent(v) in K and x the character on the edge (v´,v) w = n v´ /* computation */ While ((  edge labeled x out of node w) & (w  r)) w = n w if (  edge (w,w´) label x) n v = w´ else n v = r

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Computing n v in Linear Time Thm. Alg. n v takes O(n) when applied to all nodes in K, where n is the length of all patterns in K. Q: How can we demonstrate this? Consider pattern P in P, where t is the length of P. 1.Since lp(v)  lp(v´) + 1  lp() is increased by at most t. 2.But the assignment (w = n w ) decreases lp(). If w is assigned k times then lp(v)  lp(v´) – k.  Since lp() is never negative, the assignment (w = n w ) is bound by t.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Computing n v in Linear Time Thm. Cont. 3.The total time is proportional to the loop: While ((  edge labeled x out of node w) & (w  r)) w = n w Since the loops is bound by t, the length of P,  all failure links on the path for P are set in O(t) time. We can apply the same logic to the other patterns in P to yield a linear computation for all failure links.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Aho-Corasick (section 3.4.6) Relaxing the substring assumption  i.e., P i  P j for all combinations P i,P j in P. Consider: P = {acatt, ca}, T = acatg T is matched along K until the letter g is reached. For the current node v, L (v) = acat. There is no edge labeled g out of v. No proper suffix of acat is a prefix of acatt or ca  Therefore n v is the root.  Return to the root and set l = 5  At this point the algorithm will fail to match g.  It does not find the occurrence of ca in T. Q: What went wrong?

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Aho-Corasick (section 3.4.6) A: The problem is that the algorithm is increases l (shifts) to match the longest suffix of L (v) with a prefix of some P in P. P is not necessarily a suffix of T even if P is embedded in T. Soln: Report fully embedded patterns as they appear.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Aho-Corasick (section 3.4.6) Q: How do we find the fully embedded patterns as they appear? Lemmas & Lemma Pattern P i must occur in T ending at position c if node v is reached and there is a directed path of failure links in K from a node v to a node numbered with pattern i. Lemma If node v is reached then pattern P i occurs in T ending at position c only if v is numbered i or there is a directed path of failure links v to a node numbered i.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Aho-Corasick (section 3.4.6) Algorithm full AC search (No assumption that P i  P j for all combinations P i,P j in P.) (Report embedded patterns as they appear.) l = 1; c = 1; w = root of K ; Repeat { While there is an edge (w,w´) labeled character T(c) { if w´ is numbered by pattern i or there is a directed path of failure links from w´ to a node numbered with i then report that P i occurs in T ending at position c; w= w´ and c = c + 1; } w = n w and l = c - lp(w); } Until c > m;.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Aho-Corasick (section 3.4.6) Q: How do we recognize directed failure-link paths to pattern-numbered nodes? Idea: associate with each node its its pattern numbered node, if there is one. Q: Where should this be done?  Algorithm n v must be extended. Caveat: the time can not be more than linear!

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Aho-Corasick (section 3.4.6) Idea: associate with each node its pattern-numbered node, if there is one. Mechanism: create an output link from the node to its pattern- numbered node. How: Compute the failure link to n v for node v. If n v is a pattern-numbered node v’s set output link to n v. O/w if n v has an output link to w, set v’s output link to w. O/w v has no output link. This takes O(n) time.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Aho-Corasick (section 3.4.6) Analysis: 1.Preprocessing of patterns in P can be done in O(n) 2.All occurrences in T of P  P can be found in time O(m + k). 1.m is the length of T 2.k is the number of occurrences of patterns P  P. Here we are counting the time to output each occurrence. Overall, we get O(n+m+k) time.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BYP Method BYP method has O(m) expected running time. Partition P into r-length regions, r =  n/(k+1)  Q: How many r-length regions of P are there? A: k+1, there may be an additional short region. Lemma Suppose there is a match of P & T with at most k differences. Q: What can we deduce about the corresponding r-length regions? A:There must be at least one r-length interval that exactly matches.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BYP Method BYP Algorithm: 1.Let P be the set of the first k+1 substrings of P’s partitioning. 2.Build a keyword tree for the set of patterns P. 3.Use Aho-Corasik to find I, the set of starting locations in T where a pattern in P occurs exactly. 4.For each i  I use approximate matching to locate end points of approximate occurrences of P in T[i-n-k..i+n+k]

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BYP Method Q: Why do we choose the range i-n-k..i+n+k in T, i.e., T[i-n-k..i+n+k] ? 1.What is n? 2.What is k? 3.Why –(n + k)  (n + k) ?

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BYP Method BYP considers all possible places for an occurrence of P in T. (lemma ) Step b: Building the keyword tree takes O(n) Step c: Aho-Corasick takes O(m) (since O(m+k) is O(m)) Note: there are other approaches for steps b & c (pg 272) Step d: takes time O(n) for each index in I. Recall, that we are interested in expected running time O(m). Worst case may be larger.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BYP: Expected Time From the previous slide: steps b & c are already O(m) worst case. (Why is this true?) Ananlysis of Step d: assume The size of our alphabet is  (  = |  | ) T has uniform distribution of characters P can be any arbitrary string 1.All p  P have the same length, r. 2.What is the expected occurrence of an arbitrary length r substring in T if |T| = r? 3.A: 1/  r (see next slide for explanation)

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BYP: Expected Time A: 1/  r because we assume a uniform distribution of characters in T. 1.The probability of any specific character is 1/ . 2.The probability of any sequence of k characters is 1/  k. However, |T|  r, in fact |T|  r. Q: If there are m substrings of length r in T, what is the expected number of exact occurrences of p in T? A: m/  r Q: What are the expected number of occurrences in T of patterns in P ? A: m(k+1)/  r (recall there are k+1 patterns in P )

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BYP: Expected Time Q: How long does it take to check for a single approximate occurrence of P in T in step d? A: dynamic programming gives O(n 2 ) (global alignment) Expected checking time in step d is then n 2 m(k+1)/  r For O(m) time, we need to choose k s.t.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BYP: Expected Time To simplify, substitute n-1 for k, and solve for r in Then  r = n 3 /c, so r = log  n 3 - log  c But we are given

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BYP: Expected Time So for k = n/(log n), BYP has an expected running time of O(m) Q: Where did n/(log n) come from? A: It is a simple expression for k in terms of n that satisfies our equation. To see this, substitute n/(log n) for k into:

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BYP: Expected Time Letting  r = n 3 /c, you get: Which simplifies: