Finding Subtle Motifs by Branching from Sample Strings Xuan Qi Computer Science Dept. Utah State Univ.
Preface This presentation is based on paper “Finding subtle motifs by branching from sample strings” by Alkes Price, Sriram Ramabhadran and Pavel A. Pevzner.
Outline Motif finding problem. Methods that have been proposed to address this problem. The contribution of the method presented in this paper. The algorithms proposed in this paper. Experiment results. Discussion of the advantages and disadvantages of the method proposed in this paper. Future research direction.
Motif Finding Problem Given a set of DNA sequences, find a set of l-mers, one from each sequence, that maximizes the consensus score. Input: A t*n matrix of DNA, and l, the length of the pattern to find. Output: An array of t starting positions s = (s1, s2, …, st) maximizing Score(s, DNA). Subtle motif: low score, not significant pattern among the sequences, and thus more difficult to identify
Methods Proposed Category1: Searching possible starting points of the motif Methods: CONSENSUS, GibbsSampling Disadvantages: Search space is very large. They are not always capable to find optimal motifs. Category2: Searching possible samples of the motif Methods: Vanet et al. 2000, Marsan and Sagot 2000, Pavesi et al. 2001, Apostolico et al. 2002, Eskin and Pevzner 2002 Advantages: Reduce down the search space. Disadvantages: Still have high computational cost especially for long motifs. The selected sample may only converge to local optima instead of global optimal point. An alternative: extended sample-driven approach Search the neighbors of all samples with exhaustive search.
Contribution of This Paper Basic idea: branching from the sample strings Contribution: Much more efficient than previous algorithms. Very powerful to find subtle motifs.
Comparison between the Methods
The Algorithms Proposed Two ways to model a motif: 1. as a pattern 2. as a profile: 4*l matrix Two algorithms proposed: 1. Pattern-Branching algorithm 2. Profile-Branching algorithm
Pattern-Branching Algorithm Distance between M and a sample A 0 : d(M, A 0 ) = k D = k (A 0 ): a set of patterns of distance exactly k from A 0 Neighbor: D = 1 (A 0 ), changing a single nucleotide of A E.g., ATTGCCAG, ATTGCCTG, GTTGCCAG Score of a pattern: total distance from the sequences 1. For each sequence s i, d(A, s i ) = min{d(A, P)|P s i }, p is a l- mer (a pattern of length n). 2. The total distance of A from S is d(A, S) = ∑ s i S d(A, s i ) BestNeighbor(A): the pattern B D = 1 (A 0 ) with the lowest total distance d(B, S)
Pattern-Branching Algorithm Input: A set of sequences S, the length of the motif l and * of mutations k. Output: motif of length l with k mutations. Algorithm: PatternBranching(S, l, k) 1. Motif M arbitrary motif pattern 2. Get a set of samples of M in the sequences (S) 3. For each l-mer A 0 in S 4. For j 0 to k 5. { 6. if d(A j, S) < d(M, S) 7. M A j 8. A j+1 Bestneighbour(A j ) 9. Output M 10. }
Profile-Branching Algorithm Similar to Pattern-Branching Some changes: 1. convert each sample string to a profile X(A 0 ) 2. generalize the scoring method to score profiles 3. modify the branching method to apply to profiles 4. use the top-scoring profile we find as a seed to the EM algorithm
Profile-Branching Algorithm Convert a sample string to a profile X(A 0 ): ATGCCAT A1/21/6 1/21/6 T 1/21/6 1/2 G1/6 1/21/6 C 1/2 1/6
Profile-Branching Algorithm Use entropy to score profiles: Given a profile X = (x vw ) and a pattern P = p 1 … p l, let e(X, P) be the log probability of sampling P from X, i.e. e(X, P) = ∑ w log(x p w w ). ATGCCAT A1/21/6 1/21/6 T 1/21/6 1/2 G1/6 1/21/6 C 1/2 1/6 G T G A C A T 1/6 1/2 1/2 1/6 1/2 1/2 1/2
Profile-Branching Algorithm For each sequence S i in the sample S = {S 1, …, S n }, let e(X, S i ) = max{e(X, P)|P S i }. Then the entropy score of X is e(X, S) = ∑ s i S e(X, s i ). Intuitively, e(X, S) describes how well X matches its best occurrence in each sequence of the sample.
Profile-Branching Algorithm Branching from the sample string: 1. Amplify only one column in the profile (which corresponds to one position in the sample pattern), and we only amplify a nucleotide v if x vw < Make sure that the relative entropy ∑ v x vw log(x’ vm /x vm ) = . We use = ATGCCAT A1/21/6 1/21/6 T 1/21/6 1/2 G1/6 1/21/6 C 1/2 1/6 ATGCCAT A0.271/6 1/21/6 T0.551/21/6 1/2 G0.091/61/21/6 C0.091/6 1/2 1/6
Profile-Branching Algorithm Algorithm: ProfileBranching(S, l, k) 1. M arbitrary motif profile 2. For each l-mer A 0 in S 3. { 4. X 0 X(A 0 ) 5. For j 0 to k 6. { 7. if e(X j, S) > e(Motif, S) 8. Motif Xj 9. X j+1 BestNeighbor(X j ) 10. } 11. Run EM algorithm with Motif as seed
Results on Implanted Motifs Pattern-Branching algorithm VS previous pattern-based motif finding algorithms WINNOWER, SP-STAR: unable to find subtle motifs PROJECTION, MITRA, MULTIPROFILER
Results on Implanted Motifs Profile-Branching algorithm VS previous profile-based motif finding algorithms Performance coefficient: Let k be the set of n implanted motifs found, and let p be the set of predicted motif positions,the performance coefficient is defined to be |K ∩ P|/|K ∪ P|.
Results on Biological Samples Pattern-Branching Algorithm: Profile-Branching Algorithm: The pattern returned by profile-branching matches the reference motif.
Discussion Advantages: Much more efficient than previous algorithms. Very powerful to find subtle motifs. Disadvantages: 1. Pattern-Branching has difficulty finding motifs with many degenerate positions. But profile-Branching works well on it. 2. Profile-Branching is very powerful to find subtle motifs but is comparatively slow.
Future Work Apply Pattern-Branching and Profile-Branching algorithms to more challenging biological samples 1. Larger samples 2. Corrupted samples Extend the algorithms to address the motif finding problem which involves not only A, T, G, C, but purine(R), pryrimidine(Y), weak bond(W) and strong bond(S).