UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.3: Exclusion Methods Lecturer: Dr. Rose Slides by: Dr. Rose February 20, 2003
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Overview Exclusion methods: fast expected time O(m) –Partition approaches: BYP algorithm –Aho-Corasick exact matching algorithm »Keyword trees –Back to Aho-Corasick exact matching algorithm »Algorithm for computing failure links Back to BYP algorithm
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Exclusion Methods Q: Can we improve on the (km) time we have seen for k-mismatch and k-difference? A: On average, yes. (Are we quibbling?) We adopt a fast expected algorithm < (km) the worst case may not be better than (km)
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Exclusion Methods Partition Idea: exclude much of T from the search Preliminaries: Let = | |, where is the alphabet used in P and T. Let n = | P |, and m = | T |. Defn. an approximate occurrence of P is an occurrence with at most k mismatches or differences. General Partition algorithm: three phases 1.Partition phase 2.Search Phase 3.Check Phase
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Exclusion Methods 1.Partition phase Partition either T or P into r-length regions (depends on particular algorithm) 2.Search Phase Use exact matching to search T for r-length intervals These are potential targets for approximate occurrences of P. Eliminate as many intervals as possible. 3.Check Phase Use approximate matching to check for an approximate occurrence of P around each surviving interval for the search phase.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BYP Method BYP method has O(m) expected running time. Partition P into r-length regions, r = n/(k+1) Q: How many r-length regions of P are there? A: k+1, there may be an additional short region. Suppose there is a match of P & T with at most k differences. Q: What can we deduce about the corresponding r-length regions? A:There must be at least one r-length interval that exactly matches.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BYP Method BYP Algorithm: 1.Let P be the set of the first k+1 substrings of P’s partitioning. 2.Build a keyword tree for the set of patterns P. 3.Use Aho-Corasik to find I, the set of starting locations in T where a pattern in P occurs exactly. 4.….. Oops! We haven’t talked about keyword trees or Aho-Corasik. Sooooo let’s do that now.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Trees (section 3.4) Defn. The keyword tree for set P is a rooted directed tree K satisfying: 1.Each edge is labeled with one character 2.Any two edges out of the same node have distinct labels. 3.Every pattern P i in P maps to some node v of K s.t. the path from the root to v spells out P i 4.Every leaf in K is mapped by some pattern in P.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Trees Example: From textbook P = {potato, poetry, pottery, science, school}
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Trees (section 3.4) Observation: there is an isomorphic mapping between distinct prefixes of patterns in P and nodes in K. 1.Every node corresponds to a prefix of a pattern in P. 2.Conversely, every prefix of a pattern maps to a node in K.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Trees (section 3.4) If n is the total length of all patterns in P, then we can construct K in O(n), assuming a fixed . Let K i denote the partial keyword tree that encodes patterns P 1,.. P i of P.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Trees (section 3.4) Consider partial keyword tree K 1 –comprised of a single path of |P 1 | edges out of root r. –Each edge is labeled with one character of P 1 –Reading from the root to the leaf spells out P 1 –The leaf is labeled 1
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Trees (section 3.4) Creating K 2 from K 1 : 1.Find the longest path from the root of K 1 that matches a prefix of P 2. 2.This paths ends by a)Either exhausting the characters of P 2 or b)Ending at some existing node v in K 1 where no extending match is possible. In case 2a) label the node where the path ends 2. In case 2b) create a new path out of v, labeled by the remaining characters of P 2.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Trees (section 3.4) Example: P 1 is potato a) P 2 is pot b) P 2 is potty
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Trees (section 3.4) Use of keyword trees for matching Finding occurrences of patterns in P that occur starting at position l in T: –Starting at the root r in K, follow the unique path that matches a substring of T that starts at l. –Numbered nodes along this path indicate matched patterns in P that start at position l. –This takes time proportional to min(n, m) –Traversing K for each position l in T gives O(nm) –This can be improved!
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Tree Speedup Observation: Our naïve keyword tree is like the naïve approach to string comparison. Every time we increment l, we start all over at the root of K O(nm) Recall: KMP avoided O(nm) by shifting to get a speedup. Q: Is there an analogous operation we can perform in K ? A: Of course, why else would I ask a rhetorical question?
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Tree Speedup First, we assume P i P j for all combinations P i,P j in P. (This assumption will be relaxed later in the full Aho-Corasick Alg.) Next, each node v in K is labeled with the string formed by concatenating the letters from the root to v. Defn. Let L (v) denote the label of node v. Defn. Let lp(v) denote the length of the longest proper suffix of string L (v) that is a prefix of some pattern in P.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Tree Speedup Example: L (v) = potat, lp(v) = 2, the suffix at is the prefix of P 4.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Tree Speedup Note: if is the lp(v)-length suffix of L (v), then there is a unique node labeled . Example: at is the lp(v)-length suffix of L (v), w is the unique node labeled at.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Tree Speedup Defn: For node v of K let n v be the unique node in K labeled with the suffix of L (v) of length lp(v). When lp(v) = 0 then n v is the root of K. Defn: The ordered pair (v,n v ) is called a failure link. Example:
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Aho-Corasick (section 3.4.6) Algorithm AC search (we assume P i P j for all combinations P i,P j in P.) l = 1; c = 1; w = root of K ; Repeat { While there is an edge (w,w´) labeled character T(c) { if w´ is numbered by pattern i then report that P i occurs in T starting at position l; w= w´ and c = c + 1; } w = n w and l = c - lp(w); } Until c > m; Note: if the root fails to match increment c and the repeat loop again.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Aho-Corasick Example: T = hotpotattach When l = 4 there is a match of pot, but the next position fails. At this point c = 9. The failure link points to the node labeled at and lp(v) = 2. l = c – lp(v) = 9 – 2 = 7
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Computing n v in Linear Time Note: if v is the root r or 1 character away from r, then n v = r. Imagine n v has been computed for for every node that is exactly k or fewer edges from r. How can we compute n v for v, a node k+1 edges from r? (We also want L (n v ).)
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Computing n v in Linear Time We are looking for n v and L (n v ). Let v´ be the parent of v in K and x the character on the edge connecting them. n v´ is known since v´ is k edges from r. Clearly, L (n v ) must be a suffix of L (n v´ ) followed by x. –First check if there is an edge (n v´,w´) with label x. –If so, then n v = w´. –O/w L (n v ) is a proper suffix of L (n v´ ) followed by x. Examine n n v´ for an outgoing edge labeled x. If no joy, keep repeating, finally setting n v = r, if we run out of edges.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Computing n v in Linear Time Algorithm n v /* Initialization */ v´ = parent(v) in K and x the character on the edge (v´,v) w = n v´ /* computation */ While (( edge labeled x out of node w) & (w r)) w = n w if ( edge (w,w´) label x) n v = w´ else n v = r
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Computing n v in Linear Time Thm. Alg. n v takes O(n) when applied to all nodes in K, where n is the length of all patterns in K. Q: How can we demonstrate this? Consider pattern P in P, where t is the length of P. 1.Since lp(v) lp(v´) + 1 lp() is increased by at most t. 2.But the assignment (w = n w ) decreases lp(). If w is assigned k times then lp(v) lp(v´) – k. Since lp() is never negative, the assignment (w = n w ) is bound by t.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Computing n v in Linear Time Thm. Cont. 3.The total time is proportional to the loop: While (( edge labeled x out of node w) & (w r)) w = n w Since the loops is bound by t, the length of P, all failure links on the path for P are set in O(t) time. We can apply the same logic to the other patterns in P to yield a linear computation for all failure links.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Aho-Corasick (section 3.4.6) Relaxing the substring assumption i.e., P i P j for all combinations P i,P j in P. Consider: P = {acatt, ca}, T = acatg T is matched along K until the letter g is reached. For the current node v, L (v) = acat. There is no edge labeled g out of v. No proper suffix of acat is a prefix of acatt or ca Therefore n v is the root. Return to the root and set l = 5 At this point the algorithm will fail to match g. It does not find the occurrence of ca in T. Q: What went wrong?
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Aho-Corasick (section 3.4.6) A: The problem is that the algorithm is increases l (shifts) to match the longest suffix of L (v) with a prefix of some P in P. P is not necessarily a suffix of T even if P is embedded in T. Soln: Report fully embedded patterns as they appear.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Aho-Corasick (section 3.4.6) Q: How do we find the fully embedded patterns as they appear? Lemmas & Lemma Pattern P i must occur in T ending at position c if node v is reached and there is a directed path of failure links in K from a node v to a node numbered with pattern i. Lemma If node v is reached then pattern P i occurs in T ending at position c only if v is numbered i or there is a directed path of failure links v to a node numbered i.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Aho-Corasick (section 3.4.6) Algorithm full AC search (No assumption that P i P j for all combinations P i,P j in P.) (Report embedded patterns as they appear.) l = 1; c = 1; w = root of K ; Repeat { While there is an edge (w,w´) labeled character T(c) { if w´ is numbered by pattern i or there is a directed path of failure links from w´ to a node numbered with i then report that P i occurs in T ending at position c; w= w´ and c = c + 1; } w = n w and l = c - lp(w); } Until c > m;.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Aho-Corasick (section 3.4.6) Q: How do we recognize directed failure-link paths to pattern-numbered nodes? Idea: associate with each node its its pattern numbered node, if there is one. Q: Where should this be done? Algorithm n v must be extended. Caveat: the time can not be more than linear!
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Aho-Corasick (section 3.4.6) Idea: associate with each node its pattern-numbered node, if there is one. Mechanism: create an output link from the node to its pattern- numbered node. How: Compute the failure link to n v for node v. If n v is a pattern-numbered node v’s set output link to n v. O/w if n v has an output link to w, set v’s output link to w. O/w v has no output link. This takes O(n) time.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Aho-Corasick (section 3.4.6) Analysis: 1.Preprocessing of patterns in P can be done in O(n) 2.All occurrences in T of P P can be found in time O(m + k). 1.m is the length of T 2.k is the number of occurrences of patterns P P. Here we are counting the time to output each occurrence. Overall, we get O(n+m+k) time.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BYP Method BYP method has O(m) expected running time. Partition P into r-length regions, r = n/(k+1) Q: How many r-length regions of P are there? A: k+1, there may be an additional short region. Lemma Suppose there is a match of P & T with at most k differences. Q: What can we deduce about the corresponding r-length regions? A:There must be at least one r-length interval that exactly matches.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BYP Method BYP Algorithm: 1.Let P be the set of the first k+1 substrings of P’s partitioning. 2.Build a keyword tree for the set of patterns P. 3.Use Aho-Corasik to find I, the set of starting locations in T where a pattern in P occurs exactly. 4.For each i I use approximate matching to locate end points of approximate occurrences of P in T[i-n-k..i+n+k]
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BYP Method Q: Why do we choose the range i-n-k..i+n+k in T, i.e., T[i-n-k..i+n+k] ? 1.What is n? 2.What is k? 3.Why –(n + k) (n + k) ?
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BYP Method BYP considers all possible places for an occurrence of P in T. (lemma ) Step b: Building the keyword tree takes O(n) Step c: Aho-Corasick takes O(m) (since O(m+k) is O(m)) Note: there are other approaches for steps b & c (pg 272) Step d: takes time O(n) for each index in I. Recall, that we are interested in expected running time O(m). Worst case may be larger.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BYP: Expected Time From the previous slide: steps b & c are already O(m) worst case. (Why is this true?) Ananlysis of Step d: assume The size of our alphabet is ( = | | ) T has uniform distribution of characters P can be any arbitrary string 1.All p P have the same length, r. 2.What is the expected occurrence of an arbitrary length r substring in T if |T| = r? 3.A: 1/ r (see next slide for explanation)
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BYP: Expected Time A: 1/ r because we assume a uniform distribution of characters in T. 1.The probability of any specific character is 1/ . 2.The probability of any sequence of k characters is 1/ k. However, |T| r, in fact |T| r. Q: If there are m substrings of length r in T, what is the expected number of exact occurrences of p in T? A: m/ r Q: What are the expected number of occurrences in T of patterns in P ? A: m(k+1)/ r (recall there are k+1 patterns in P )
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BYP: Expected Time Q: How long does it take to check for a single approximate occurrence of P in T in step d? A: dynamic programming gives O(n 2 ) (global alignment) Expected checking time in step d is then n 2 m(k+1)/ r For O(m) time, we need to choose k s.t.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BYP: Expected Time To simplify, substitute n-1 for k, and solve for r in Then r = n 3 /c, so r = log n 3 - log c But we are given
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BYP: Expected Time So for k = n/(log n), BYP has an expected running time of O(m) Q: Where did n/(log n) come from? A: It is a simple expression for k in terms of n that satisfies our equation. To see this, substitute n/(log n) for k into:
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BYP: Expected Time Letting r = n 3 /c, you get: Which simplifies: