Presentation is loading. Please wait.

Presentation is loading. Please wait.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.3: Exclusion Methods.

Similar presentations


Presentation on theme: "UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.3: Exclusion Methods."— Presentation transcript:

1 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.3: Exclusion Methods Lecturer: Dr. Rose Slides by: Dr. Rose February 20, 2003

2 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Overview Exclusion methods: fast expected time O(m) –Partition approaches: BYP algorithm –Aho-Corasick exact matching algorithm »Keyword trees –Back to Aho-Corasick exact matching algorithm »Algorithm for computing failure links Back to BYP algorithm

3 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Exclusion Methods Q: Can we improve on the  (km) time we have seen for k-mismatch and k-difference? A: On average, yes. (Are we quibbling?) We adopt a fast expected algorithm <  (km)  the worst case may not be better than  (km)

4 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Exclusion Methods Partition Idea: exclude much of T from the search Preliminaries: Let  = |  |, where  is the alphabet used in P and T. Let n = | P |, and m = | T |. Defn. an approximate occurrence of P is an occurrence with at most k mismatches or differences. General Partition algorithm: three phases 1.Partition phase 2.Search Phase 3.Check Phase

5 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Exclusion Methods 1.Partition phase Partition either T or P into r-length regions (depends on particular algorithm) 2.Search Phase Use exact matching to search T for r-length intervals These are potential targets for approximate occurrences of P. Eliminate as many intervals as possible. 3.Check Phase Use approximate matching to check for an approximate occurrence of P around each surviving interval for the search phase.

6 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BYP Method BYP method has O(m) expected running time. Partition P into r-length regions, r =  n/(k+1)  Q: How many r-length regions of P are there? A: k+1, there may be an additional short region. Suppose there is a match of P & T with at most k differences. Q: What can we deduce about the corresponding r-length regions? A:There must be at least one r-length interval that exactly matches.

7 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BYP Method BYP Algorithm: 1.Let P be the set of the first k+1 substrings of P’s partitioning. 2.Build a keyword tree for the set of patterns P. 3.Use Aho-Corasik to find I, the set of starting locations in T where a pattern in P occurs exactly. 4.….. Oops! We haven’t talked about keyword trees or Aho-Corasik. Sooooo let’s do that now.

8 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Trees (section 3.4) Defn. The keyword tree for set P is a rooted directed tree K satisfying: 1.Each edge is labeled with one character 2.Any two edges out of the same node have distinct labels. 3.Every pattern P i in P maps to some node v of K s.t. the path from the root to v spells out P i 4.Every leaf in K is mapped by some pattern in P.

9 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Trees Example: From textbook P = {potato, poetry, pottery, science, school}

10 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Trees (section 3.4) Observation: there is an isomorphic mapping between distinct prefixes of patterns in P and nodes in K. 1.Every node corresponds to a prefix of a pattern in P. 2.Conversely, every prefix of a pattern maps to a node in K.

11 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Trees (section 3.4) If n is the total length of all patterns in P, then we can construct K in O(n), assuming a fixed . Let K i denote the partial keyword tree that encodes patterns P 1,.. P i of P.

12 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Trees (section 3.4) Consider partial keyword tree K 1 –comprised of a single path of |P 1 | edges out of root r. –Each edge is labeled with one character of P 1 –Reading from the root to the leaf spells out P 1 –The leaf is labeled 1

13 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Trees (section 3.4) Creating K 2 from K 1 : 1.Find the longest path from the root of K 1 that matches a prefix of P 2. 2.This paths ends by a)Either exhausting the characters of P 2 or b)Ending at some existing node v in K 1 where no extending match is possible. In case 2a) label the node where the path ends 2. In case 2b) create a new path out of v, labeled by the remaining characters of P 2.

14 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Trees (section 3.4) Example: P 1 is potato a) P 2 is pot b) P 2 is potty

15 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Trees (section 3.4) Use of keyword trees for matching Finding occurrences of patterns in P that occur starting at position l in T: –Starting at the root r in K, follow the unique path that matches a substring of T that starts at l. –Numbered nodes along this path indicate matched patterns in P that start at position l. –This takes time proportional to min(n, m) –Traversing K for each position l in T gives O(nm) –This can be improved!

16 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Tree Speedup Observation: Our naïve keyword tree is like the naïve approach to string comparison.  Every time we increment l, we start all over at the root of K  O(nm) Recall: KMP avoided O(nm) by shifting to get a speedup. Q: Is there an analogous operation we can perform in K ? A: Of course, why else would I ask a rhetorical question?

17 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Tree Speedup First, we assume P i  P j for all combinations P i,P j in P. (This assumption will be relaxed later in the full Aho-Corasick Alg.) Next, each node v in K is labeled with the string formed by concatenating the letters from the root to v. Defn. Let L (v) denote the label of node v. Defn. Let lp(v) denote the length of the longest proper suffix of string L (v) that is a prefix of some pattern in P.

18 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Tree Speedup Example: L (v) = potat, lp(v) = 2, the suffix at is the prefix of P 4.

19 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Tree Speedup Note: if  is the lp(v)-length suffix of L (v), then there is a unique node labeled . Example: at is the lp(v)-length suffix of L (v), w is the unique node labeled at.

20 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Keyword Tree Speedup Defn: For node v of K let n v be the unique node in K labeled with the suffix of L (v) of length lp(v). When lp(v) = 0 then n v is the root of K. Defn: The ordered pair (v,n v ) is called a failure link. Example:

21 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Aho-Corasick (section 3.4.6) Algorithm AC search (we assume P i  P j for all combinations P i,P j in P.) l = 1; c = 1; w = root of K ; Repeat { While there is an edge (w,w´) labeled character T(c) { if w´ is numbered by pattern i then report that P i occurs in T starting at position l; w= w´ and c = c + 1; } w = n w and l = c - lp(w); } Until c > m; Note: if the root fails to match increment c and the repeat loop again.

22 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Aho-Corasick Example: T = hotpotattach When l = 4 there is a match of pot, but the next position fails. At this point c = 9. The failure link points to the node labeled at and lp(v) = 2.  l = c – lp(v) = 9 – 2 = 7

23 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Computing n v in Linear Time Note: if v is the root r or 1 character away from r, then n v = r. Imagine n v has been computed for for every node that is exactly k or fewer edges from r. How can we compute n v for v, a node k+1 edges from r? (We also want L (n v ).)

24 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Computing n v in Linear Time We are looking for n v and L (n v ). Let v´ be the parent of v in K and x the character on the edge connecting them. n v´ is known since v´ is k edges from r. Clearly, L (n v ) must be a suffix of L (n v´ ) followed by x. –First check if there is an edge (n v´,w´) with label x. –If so, then n v = w´. –O/w L (n v ) is a proper suffix of L (n v´ ) followed by x. Examine n n v´ for an outgoing edge labeled x. If no joy, keep repeating, finally setting n v = r, if we run out of edges.

25 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Computing n v in Linear Time Algorithm n v /* Initialization */ v´ = parent(v) in K and x the character on the edge (v´,v) w = n v´ /* computation */ While ((  edge labeled x out of node w) & (w  r)) w = n w if (  edge (w,w´) label x) n v = w´ else n v = r

26 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Computing n v in Linear Time Thm. Alg. n v takes O(n) when applied to all nodes in K, where n is the length of all patterns in K. Q: How can we demonstrate this? Consider pattern P in P, where t is the length of P. 1.Since lp(v)  lp(v´) + 1  lp() is increased by at most t. 2.But the assignment (w = n w ) decreases lp(). If w is assigned k times then lp(v)  lp(v´) – k.  Since lp() is never negative, the assignment (w = n w ) is bound by t.

27 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Computing n v in Linear Time Thm. Cont. 3.The total time is proportional to the loop: While ((  edge labeled x out of node w) & (w  r)) w = n w Since the loops is bound by t, the length of P,  all failure links on the path for P are set in O(t) time. We can apply the same logic to the other patterns in P to yield a linear computation for all failure links.

28 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Aho-Corasick (section 3.4.6) Relaxing the substring assumption  i.e., P i  P j for all combinations P i,P j in P. Consider: P = {acatt, ca}, T = acatg T is matched along K until the letter g is reached. For the current node v, L (v) = acat. There is no edge labeled g out of v. No proper suffix of acat is a prefix of acatt or ca  Therefore n v is the root.  Return to the root and set l = 5  At this point the algorithm will fail to match g.  It does not find the occurrence of ca in T. Q: What went wrong?

29 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Aho-Corasick (section 3.4.6) A: The problem is that the algorithm is increases l (shifts) to match the longest suffix of L (v) with a prefix of some P in P. P is not necessarily a suffix of T even if P is embedded in T. Soln: Report fully embedded patterns as they appear.

30 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Aho-Corasick (section 3.4.6) Q: How do we find the fully embedded patterns as they appear? Lemmas 3.4.2 & 3.4.3 Lemma 3.4.2 Pattern P i must occur in T ending at position c if node v is reached and there is a directed path of failure links in K from a node v to a node numbered with pattern i. Lemma 3.4.3 If node v is reached then pattern P i occurs in T ending at position c only if v is numbered i or there is a directed path of failure links v to a node numbered i.

31 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Aho-Corasick (section 3.4.6) Algorithm full AC search (No assumption that P i  P j for all combinations P i,P j in P.) (Report embedded patterns as they appear.) l = 1; c = 1; w = root of K ; Repeat { While there is an edge (w,w´) labeled character T(c) { if w´ is numbered by pattern i or there is a directed path of failure links from w´ to a node numbered with i then report that P i occurs in T ending at position c; w= w´ and c = c + 1; } w = n w and l = c - lp(w); } Until c > m;.

32 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Aho-Corasick (section 3.4.6) Q: How do we recognize directed failure-link paths to pattern-numbered nodes? Idea: associate with each node its its pattern numbered node, if there is one. Q: Where should this be done?  Algorithm n v must be extended. Caveat: the time can not be more than linear!

33 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Aho-Corasick (section 3.4.6) Idea: associate with each node its pattern-numbered node, if there is one. Mechanism: create an output link from the node to its pattern- numbered node. How: Compute the failure link to n v for node v. If n v is a pattern-numbered node v’s set output link to n v. O/w if n v has an output link to w, set v’s output link to w. O/w v has no output link. This takes O(n) time.

34 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Aho-Corasick (section 3.4.6) Analysis: 1.Preprocessing of patterns in P can be done in O(n) 2.All occurrences in T of P  P can be found in time O(m + k). 1.m is the length of T 2.k is the number of occurrences of patterns P  P. Here we are counting the time to output each occurrence. Overall, we get O(n+m+k) time.

35 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BYP Method BYP method has O(m) expected running time. Partition P into r-length regions, r =  n/(k+1)  Q: How many r-length regions of P are there? A: k+1, there may be an additional short region. Lemma 12.3.1 Suppose there is a match of P & T with at most k differences. Q: What can we deduce about the corresponding r-length regions? A:There must be at least one r-length interval that exactly matches.

36 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BYP Method BYP Algorithm: 1.Let P be the set of the first k+1 substrings of P’s partitioning. 2.Build a keyword tree for the set of patterns P. 3.Use Aho-Corasik to find I, the set of starting locations in T where a pattern in P occurs exactly. 4.For each i  I use approximate matching to locate end points of approximate occurrences of P in T[i-n-k..i+n+k]

37 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BYP Method Q: Why do we choose the range i-n-k..i+n+k in T, i.e., T[i-n-k..i+n+k] ? 1.What is n? 2.What is k? 3.Why –(n + k)  (n + k) ?

38 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BYP Method BYP considers all possible places for an occurrence of P in T. (lemma 12.3.1) Step b: Building the keyword tree takes O(n) Step c: Aho-Corasick takes O(m) (since O(m+k) is O(m)) Note: there are other approaches for steps b & c (pg 272) Step d: takes time O(n) for each index in I. Recall, that we are interested in expected running time O(m). Worst case may be larger.

39 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BYP: Expected Time From the previous slide: steps b & c are already O(m) worst case. (Why is this true?) Ananlysis of Step d: assume The size of our alphabet is  (  = |  | ) T has uniform distribution of characters P can be any arbitrary string 1.All p  P have the same length, r. 2.What is the expected occurrence of an arbitrary length r substring in T if |T| = r? 3.A: 1/  r (see next slide for explanation)

40 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BYP: Expected Time A: 1/  r because we assume a uniform distribution of characters in T. 1.The probability of any specific character is 1/ . 2.The probability of any sequence of k characters is 1/  k. However, |T|  r, in fact |T|  r. Q: If there are m substrings of length r in T, what is the expected number of exact occurrences of p in T? A: m/  r Q: What are the expected number of occurrences in T of patterns in P ? A: m(k+1)/  r (recall there are k+1 patterns in P )

41 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BYP: Expected Time Q: How long does it take to check for a single approximate occurrence of P in T in step d? A: dynamic programming gives O(n 2 ) (global alignment) Expected checking time in step d is then n 2 m(k+1)/  r For O(m) time, we need to choose k s.t.

42 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BYP: Expected Time To simplify, substitute n-1 for k, and solve for r in Then  r = n 3 /c, so r = log  n 3 - log  c But we are given

43 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BYP: Expected Time So for k = n/(log n), BYP has an expected running time of O(m) Q: Where did n/(log n) come from? A: It is a simple expression for k in terms of n that satisfies our equation. To see this, substitute n/(log n) for k into:

44 UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology BYP: Expected Time Letting  r = n 3 /c, you get: Which simplifies:


Download ppt "UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12.3: Exclusion Methods."

Similar presentations


Ads by Google