UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.3: Exclusion Methods.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.3: Exclusion Methods Lecturer: Dr. Rose Slides by: Dr. Rose February 20, 2003

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Overview Exclusion methods: fast expected time O(m) –Partition approaches: BYP algorithm –Aho-Corasick exact matching algorithm »Keyword trees –Back to Aho-Corasick exact matching algorithm »Algorithm for computing failure links Back to BYP algorithm

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Exclusion Methods Q: Can we improve on the  (km) time we have seen for k-mismatch and k-difference? A: On average, yes. (Are we quibbling?) We adopt a fast expected algorithm <  (km)  the worst case may not be better than  (km)

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Exclusion Methods Partition Idea: exclude much of T from the search Preliminaries: Let  = |  |, where  is the alphabet used in P and T. Let n = | P |, and m = | T |. Defn. an approximate occurrence of P is an occurrence with at most k mismatches or differences. General Partition algorithm: three phases 1.Partition phase 2.Search Phase 3.Check Phase

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Exclusion Methods 1.Partition phase Partition either T or P into r-length regions (depends on particular algorithm) 2.Search Phase Use exact matching to search T for r-length intervals These are potential targets for approximate occurrences of P. Eliminate as many intervals as possible. 3.Check Phase Use approximate matching to check for an approximate occurrence of P around each surviving interval for the search phase.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BYP Method BYP method has O(m) expected running time. Partition P into r-length regions, r =  n/(k+1)  Q: How many r-length regions of P are there? A: k+1, there may be an additional short region. Suppose there is a match of P & T with at most k differences. Q: What can we deduce about the corresponding r-length regions? A:There must be at least one r-length interval that exactly matches.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BYP Method BYP Algorithm: 1.Let P be the set of the first k+1 substrings of P’s partitioning. 2.Build a keyword tree for the set of patterns P. 3.Use Aho-Corasik to find I, the set of starting locations in T where a pattern in P occurs exactly. 4.….. Oops! We haven’t talked about keyword trees or Aho-Corasik. Sooooo let’s do that now.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Trees (section 3.4) Defn. The keyword tree for set P is a rooted directed tree K satisfying: 1.Each edge is labeled with one character 2.Any two edges out of the same node have distinct labels. 3.Every pattern P i in P maps to some node v of K s.t. the path from the root to v spells out P i 4.Every leaf in K is mapped by some pattern in P.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Trees Example: From textbook P = {potato, poetry, pottery, science, school}

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Trees (section 3.4) Observation: there is an isomorphic mapping between distinct prefixes of patterns in P and nodes in K. 1.Every node corresponds to a prefix of a pattern in P. 2.Conversely, every prefix of a pattern maps to a node in K.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Trees (section 3.4) If n is the total length of all patterns in P, then we can construct K in O(n), assuming a fixed . Let K i denote the partial keyword tree that encodes patterns P 1,.. P i of P.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Trees (section 3.4) Consider partial keyword tree K 1 –comprised of a single path of |P 1 | edges out of root r. –Each edge is labeled with one character of P 1 –Reading from the root to the leaf spells out P 1 –The leaf is labeled 1

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Trees (section 3.4) Creating K 2 from K 1 : 1.Find the longest path from the root of K 1 that matches a prefix of P 2. 2.This paths ends by a)Either exhausting the characters of P 2 or b)Ending at some existing node v in K 1 where no extending match is possible. In case 2a) label the node where the path ends 2. In case 2b) create a new path out of v, labeled by the remaining characters of P 2.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Trees (section 3.4) Example: P 1 is potato a) P 2 is pot b) P 2 is potty

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Trees (section 3.4) Use of keyword trees for matching Finding occurrences of patterns in P that occur starting at position l in T: –Starting at the root r in K, follow the unique path that matches a substring of T that starts at l. –Numbered nodes along this path indicate matched patterns in P that start at position l. –This takes time proportional to min(n, m) –Traversing K for each position l in T gives O(nm) –This can be improved!

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Tree Speedup Observation: Our naïve keyword tree is like the naïve approach to string comparison.  Every time we increment l, we start all over at the root of K  O(nm) Recall: KMP avoided O(nm) by shifting to get a speedup. Q: Is there an analogous operation we can perform in K ? A: Of course, why else would I ask a rhetorical question?

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Tree Speedup First, we assume P i  P j for all combinations P i,P j in P. (This assumption will be relaxed later in the full Aho-Corasick Alg.) Next, each node v in K is labeled with the string formed by concatenating the letters from the root to v. Defn. Let L (v) denote the label of node v. Defn. Let lp(v) denote the length of the longest proper suffix of string L (v) that is a prefix of some pattern in P.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Tree Speedup Example: L (v) = potat, lp(v) = 2, the suffix at is the prefix of P 4.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Tree Speedup Note: if  is the lp(v)-length suffix of L (v), then there is a unique node labeled . Example: at is the lp(v)-length suffix of L (v), w is the unique node labeled at.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Tree Speedup Defn: For node v of K let n v be the unique node in K labeled with the suffix of L (v) of length lp(v). When lp(v) = 0 then n v is the root of K. Defn: The ordered pair (v,n v ) is called a failure link. Example:

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Aho-Corasick (section 3.4.6) Algorithm AC search (we assume P i  P j for all combinations P i,P j in P.) l = 1; c = 1; w = root of K ; Repeat { While there is an edge (w,w´) labeled character T(c) { if w´ is numbered by pattern i then report that P i occurs in T starting at position l; w= w´ and c = c + 1; } w = n w and l = c - lp(w); } Until c > m; Note: if the root fails to match increment c and the repeat loop again.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Aho-Corasick Example: T = hotpotattach When l = 4 there is a match of pot, but the next position fails. At this point c = 9. The failure link points to the node labeled at and lp(v) = 2.  l = c – lp(v) = 9 – 2 = 7

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Computing n v in Linear Time Note: if v is the root r or 1 character away from r, then n v = r. Imagine n v has been computed for for every node that is exactly k or fewer edges from r. How can we compute n v for v, a node k+1 edges from r? (We also want L (n v ).)

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Computing n v in Linear Time We are looking for n v and L (n v ). Let v´ be the parent of v in K and x the character on the edge connecting them. n v´ is known since v´ is k edges from r. Clearly, L (n v ) must be a suffix of L (n v´ ) followed by x. –First check if there is an edge (n v´,w´) with label x. –If so, then n v = w´. –O/w L (n v ) is a proper suffix of L (n v´ ) followed by x. Examine n n v´ for an outgoing edge labeled x. If no joy, keep repeating, finally setting n v = r, if we run out of edges.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Computing n v in Linear Time Algorithm n v /* Initialization */ v´ = parent(v) in K and x the character on the edge (v´,v) w = n v´ /* computation */ While ((  edge labeled x out of node w) & (w  r)) w = n w if (  edge (w,w´) label x) n v = w´ else n v = r

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Computing n v in Linear Time Thm. Alg. n v takes O(n) when applied to all nodes in K, where n is the length of all patterns in K. Q: How can we demonstrate this? Consider pattern P in P, where t is the length of P. 1.Since lp(v)  lp(v´) + 1  lp() is increased by at most t. 2.But the assignment (w = n w ) decreases lp(). If w is assigned k times then lp(v)  lp(v´) – k.  Since lp() is never negative, the assignment (w = n w ) is bound by t.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Computing n v in Linear Time Thm. Cont. 3.The total time is proportional to the loop: While ((  edge labeled x out of node w) & (w  r)) w = n w Since the loops is bound by t, the length of P,  all failure links on the path for P are set in O(t) time. We can apply the same logic to the other patterns in P to yield a linear computation for all failure links.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Aho-Corasick (section 3.4.6) Relaxing the substring assumption  i.e., P i  P j for all combinations P i,P j in P. Consider: P = {acatt, ca}, T = acatg T is matched along K until the letter g is reached. For the current node v, L (v) = acat. There is no edge labeled g out of v. No proper suffix of acat is a prefix of acatt or ca  Therefore n v is the root.  Return to the root and set l = 5  At this point the algorithm will fail to match g.  It does not find the occurrence of ca in T. Q: What went wrong?

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Aho-Corasick (section 3.4.6) A: The problem is that the algorithm is increases l (shifts) to match the longest suffix of L (v) with a prefix of some P in P. P is not necessarily a suffix of T even if P is embedded in T. Soln: Report fully embedded patterns as they appear.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Aho-Corasick (section 3.4.6) Q: How do we find the fully embedded patterns as they appear? Lemmas 3.4.2 & 3.4.3 Lemma 3.4.2 Pattern P i must occur in T ending at position c if node v is reached and there is a directed path of failure links in K from a node v to a node numbered with pattern i. Lemma 3.4.3 If node v is reached then pattern P i occurs in T ending at position c only if v is numbered i or there is a directed path of failure links v to a node numbered i.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Aho-Corasick (section 3.4.6) Algorithm full AC search (No assumption that P i  P j for all combinations P i,P j in P.) (Report embedded patterns as they appear.) l = 1; c = 1; w = root of K ; Repeat { While there is an edge (w,w´) labeled character T(c) { if w´ is numbered by pattern i or there is a directed path of failure links from w´ to a node numbered with i then report that P i occurs in T ending at position c; w= w´ and c = c + 1; } w = n w and l = c - lp(w); } Until c > m;.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Aho-Corasick (section 3.4.6) Q: How do we recognize directed failure-link paths to pattern-numbered nodes? Idea: associate with each node its its pattern numbered node, if there is one. Q: Where should this be done?  Algorithm n v must be extended. Caveat: the time can not be more than linear!

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Aho-Corasick (section 3.4.6) Idea: associate with each node its pattern-numbered node, if there is one. Mechanism: create an output link from the node to its pattern- numbered node. How: Compute the failure link to n v for node v. If n v is a pattern-numbered node v’s set output link to n v. O/w if n v has an output link to w, set v’s output link to w. O/w v has no output link. This takes O(n) time.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Aho-Corasick (section 3.4.6) Analysis: 1.Preprocessing of patterns in P can be done in O(n) 2.All occurrences in T of P  P can be found in time O(m + k). 1.m is the length of T 2.k is the number of occurrences of patterns P  P. Here we are counting the time to output each occurrence. Overall, we get O(n+m+k) time.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BYP Method BYP method has O(m) expected running time. Partition P into r-length regions, r =  n/(k+1)  Q: How many r-length regions of P are there? A: k+1, there may be an additional short region. Lemma 12.3.1 Suppose there is a match of P & T with at most k differences. Q: What can we deduce about the corresponding r-length regions? A:There must be at least one r-length interval that exactly matches.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BYP Method BYP Algorithm: 1.Let P be the set of the first k+1 substrings of P’s partitioning. 2.Build a keyword tree for the set of patterns P. 3.Use Aho-Corasik to find I, the set of starting locations in T where a pattern in P occurs exactly. 4.For each i  I use approximate matching to locate end points of approximate occurrences of P in T[i-n-k..i+n+k]

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BYP Method Q: Why do we choose the range i-n-k..i+n+k in T, i.e., T[i-n-k..i+n+k] ? 1.What is n? 2.What is k? 3.Why –(n + k)  (n + k) ?

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BYP Method BYP considers all possible places for an occurrence of P in T. (lemma 12.3.1) Step b: Building the keyword tree takes O(n) Step c: Aho-Corasick takes O(m) (since O(m+k) is O(m)) Note: there are other approaches for steps b & c (pg 272) Step d: takes time O(n) for each index in I. Recall, that we are interested in expected running time O(m). Worst case may be larger.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BYP: Expected Time From the previous slide: steps b & c are already O(m) worst case. (Why is this true?) Ananlysis of Step d: assume The size of our alphabet is  (  = |  | ) T has uniform distribution of characters P can be any arbitrary string 1.All p  P have the same length, r. 2.What is the expected occurrence of an arbitrary length r substring in T if |T| = r? 3.A: 1/  r (see next slide for explanation)

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BYP: Expected Time A: 1/  r because we assume a uniform distribution of characters in T. 1.The probability of any specific character is 1/ . 2.The probability of any sequence of k characters is 1/  k. However, |T|  r, in fact |T|  r. Q: If there are m substrings of length r in T, what is the expected number of exact occurrences of p in T? A: m/  r Q: What are the expected number of occurrences in T of patterns in P ? A: m(k+1)/  r (recall there are k+1 patterns in P )

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BYP: Expected Time Q: How long does it take to check for a single approximate occurrence of P in T in step d? A: dynamic programming gives O(n 2 ) (global alignment) Expected checking time in step d is then n 2 m(k+1)/  r For O(m) time, we need to choose k s.t.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BYP: Expected Time To simplify, substitute n-1 for k, and solve for r in Then  r = n 3 /c, so r = log  n 3 - log  c But we are given

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BYP: Expected Time So for k = n/(log n), BYP has an expected running time of O(m) Q: Where did n/(log n) come from? A: It is a simple expression for k in terms of n that satisfies our equation. To see this, substitute n/(log n) for k into:

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BYP: Expected Time Letting  r = n 3 /c, you get: Which simplifies:

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.3: Exclusion Methods.

Similar presentations

Presentation on theme: "UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.3: Exclusion Methods."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.3: Exclusion Methods.

Similar presentations

Presentation on theme: "UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.3: Exclusion Methods."— Presentation transcript:

Similar presentations

About project

Feedback