March 2006Vineet Bafna Designing Spaced Seeds
March 2006Vineet Bafna Project/Exam deadlines May 2 – Send to me with a title of your project May 9 – Each student/group gives a 10 min. presentation on their proposed project. – Show preliminary computations. What is the test plan? What is the data like, and how much is there. Last week of classes: – A 20 min. presentation from each group – A written report on the project – A take home exam, due electronically on the date of the final exam
March 2006Vineet Bafna Accuracy Consider a 64bp sequence that is 70% similar to the query. Pr(an 11 mer matches) = 0.3 Pr(A spaced seed Matches) = This non-intuitive result leads to selection of spaced words that are an order of magnitude faster for identical specificity and sensitivity Implemented in PATTERNHUNTER
March 2006Vineet Bafna How to compute a spaced seed No good algorithm is known. Iterate over all (M choose W) seeds. – Use a computation to decide Pr(match) – Choose the seed that maximizes probability.
March 2006Vineet Bafna Prob. Computation for Spaced Seeds Given a specific seed Q(M,W), compute the probability of a hit in a sequence of length L. We can assume that there is a probability p of match. The match mismatch string is a binary string with probability p of 1 1L
March 2006Vineet Bafna Prob. Computation for Spaced Seeds Given a specific seed Q(M,W), compute the probability of a hit in a sequence of length L. – Q is a binary string of length M, with W 1s We try to match the binary ‘match string’ S which is a random binary string with probability p of success. 110…0.1…1..0 M 1L P Q = Prob. (Q matches random S at some location) How can we compute P Q ?
March 2006Vineet Bafna Computing F(i,b) For a specific string b, define F(i,b) = Prob. (Q matches a random string S of length i, s.t. S ends in b) 1 b i Why is it sufficient to compute f(I,b) for all I, b? – P Q = f(L, )
March 2006Vineet Bafna Computing f(i,b) Define B 1 as the set of all strings b that match a suffix of Q b We have two possibilities: b B 1 : b is consistent with a suffix of Q. b B 0 = B-B Q
March 2006Vineet Bafna Computing f(i,b) Case b B 0 – f(i,b) = f(i-1,b>>1) Case b B 1 and |b| = M – f(i,b) = 1 b Q
March 2006Vineet Bafna Computing f(i,b) Case b B 1, |b|<M f(i,b) = pf(i-1,1b) + (1-p) pf(i-1,0b) – Note that if b B 1, then 1b B 1 – However, it is possible that 0b B 1 We want to iterate only over b B 1 Find smallest j s.t. 0b>>j B 1 f(i,b) = pf(i-1,1b) + (1-p) f(i-j,0b>>j) b Q
March 2006Vineet Bafna Efficiency |B 1 | = M2 M-W The iteration proceeds for all i, and all b B 1, and each comparison needs O(M) steps O(M2 M-W L M) = O(M 2 2 M-W L)
March 2006Vineet Bafna More efficient algorithm for spaced seed design Due to Buhler, Keich, and Sun Consider seed (weight w, span s). Let Q be the set of all possible 2 s-w strings matching : :
March 2006Vineet Bafna Trie construction Our goal is to make an automaton that accepts all strings which contain a string from Q. Make a trie T from Q . T is a DFA that precisely accepts Q Can we convert T to an DFA that accepts all strings that matches a string from Q as a suffix? Ex: =1001
March 2006Vineet Bafna Failure links Use of failure links allows us to traverse any string till Q is reached. Note: the DFA has special structure. Does it help? No failure links when outgoing edge is 0. Therefore, we fail only when we see a Ex: =1001
March 2006Vineet Bafna Substring automaton We started with a Trie that only accepts Q Next, we use failure links to accept any string with a suffix from Q . Finally, make every accepting state an absorbing one, to accept all strings containing a string from Q as suffix Ex: =1001 0,1
March 2006Vineet Bafna Computing sensitivity of Compute the probability that a ‘random’ string of length l will match ? Equivalently: What is the probability that a random string of length l that starts at the begin node will end in an accepting state of A . Case 1: Each bit of S is 1 with probability p P(q,t)=Probability that we reach q after reading the first t bits.
March 2006Vineet Bafna Complexity Size of the Automaton W2 M-W What is the in-degree? Claimed complexity = (W2 M-W L) – O(M 2 /W) faster then the previous algorithm
March 2006Vineet Bafna Generalizing the match string The match string may have a different distribution – Errors do not fall independently at random Instead of independent bernoulli trials, we can have a higher order markovian process generating the match string. The algorithm of Keich et al. Cannot deal with this extension, but it is natural in Mandala
March 2006Vineet Bafna Experimental Results with Mandala 428 human/mouse genomic aligned regions. Repeat mask the alignments and separate into coding/non-coding regions. A total of similarities (alignments) were pulled. These are used to check for sensitivity (accuracy) of filters.
March 2006Vineet Bafna Effect of Span Solid line: 0-th order model Dashed line: 5-th order model. W=11 throughout: larger span implies more gaps, span=11 implies ungapped (BLASTN) seed
March 2006Vineet Bafna Accuracy of different seeds Non-coding Coding
March 2006Vineet Bafna Model order Non-Coding: solid line coding: dashed line
March 2006Vineet Bafna What about multiple keywords All of the analysis is for ungapped alignments. With indels, multiple words might be more sensitive. Mandala works for multiple keywords also. Can we make the algorithm more efficient? In particular, there is an explosion of states in making a deterministic automaton? Can we match a non- deterministic automaton?
March 2006Vineet Bafna Regular Expressions Concise representation of a set of strings over alphabet . Described by a string over R is a r.e. if and only if
March 2006Vineet Bafna Regular Expression Q: Let ={A,C,E} – Is (A+C)*EEC* a regular expression? – *(A+C)? – AC*..E? Q: When is a string s in a regular expression? – R =(A+C)*EEC* – Is CEEC in R? – AEC? – ACEE?
March 2006Vineet Bafna Regular Expression & Automata Every R.E can be expressed by an automaton (a directed graph) with the following properties: – The automaton has a start and end node – Each edge is labeled with a symbol from , or Suppose R is described by automaton A S R if and only if there is a path from start to end in A, labeled with s.
March 2006Vineet Bafna Examples: Regular Expression & Automata (A+C)*EEC* CA C startend EE
March 2006Vineet Bafna Constructing automata from R.E R = { } R = { }, R = R 1 + R 2 R = R 1 · R 2 R = R 1 *
March 2006Vineet Bafna Regular Expression Matching Given a database D, and a regular expression R, is a substring of D in R? Is there a string D[l..c] that is accepted by the automaton of R? Simpler Q: Is D[1..c] accepted by the automaton of R?
March 2006Vineet Bafna Alg. For matching R.E. If D[1..c] is accepted by the automaton R A – There is a path labeled D[1]…D[c] that goes from START to END in R A D[1] D[2] D[c]
March 2006Vineet Bafna Alg. For matching R.E. If D[1..c] is accepted by the automaton R A – There is a path labeled D[1]…D[c] that goes from START to END in R A – There is a path labeled D[1]..D[c-1] from START to node u, and a path labeled D[c] from u to the END D[1].. D[c-1] D[c] u
March 2006Vineet Bafna D.P. to match regular expression Define: – A[u, ] = Automaton node reached from u after reading – Eps(u): set of all nodes reachable from node u using epsilon transitions. – N[c] = subset of nodes reachable from START node after reading D[1..c] – Q: when is v N[c]? u v u Eps(u)
March 2006Vineet Bafna Q: when is v N[c]? A: If for some u N[c-1], w = A[u,D[c]], v {w}+ Eps(w) D.P. to match regular expression D[1].. D[c-1] D[c] u w
March 2006Vineet Bafna Algorithm
March 2006Vineet Bafna The final step We have answered the question: – Is D[1..c] accepted by R? – Yes, if END N[c] We need to answer – Is D[l..c] (for some l, and some c) accepted by R
March 2006Vineet Bafna