Presentation is loading. Please wait.

Presentation is loading. Please wait.

Finding Subtle Motifs by Branching from Sample Strings Xuan Qi Computer Science Dept. Utah State Univ.

Similar presentations


Presentation on theme: "Finding Subtle Motifs by Branching from Sample Strings Xuan Qi Computer Science Dept. Utah State Univ."— Presentation transcript:

1 Finding Subtle Motifs by Branching from Sample Strings Xuan Qi Computer Science Dept. Utah State Univ.

2 Preface  This presentation is based on paper “Finding subtle motifs by branching from sample strings” by Alkes Price, Sriram Ramabhadran and Pavel A. Pevzner.

3 Outline  Motif finding problem.  Methods that have been proposed to address this problem.  The contribution of the method presented in this paper.  The algorithms proposed in this paper.  Experiment results.  Discussion of the advantages and disadvantages of the method proposed in this paper.  Future research direction.

4 Motif Finding Problem  Given a set of DNA sequences, find a set of l-mers, one from each sequence, that maximizes the consensus score. Input: A t*n matrix of DNA, and l, the length of the pattern to find. Output: An array of t starting positions s = (s1, s2, …, st) maximizing Score(s, DNA).  Subtle motif: low score, not significant pattern among the sequences, and thus more difficult to identify

5 Methods Proposed  Category1: Searching possible starting points of the motif Methods: CONSENSUS, GibbsSampling Disadvantages: Search space is very large. They are not always capable to find optimal motifs.  Category2: Searching possible samples of the motif Methods: Vanet et al. 2000, Marsan and Sagot 2000, Pavesi et al. 2001, Apostolico et al. 2002, Eskin and Pevzner 2002 Advantages: Reduce down the search space. Disadvantages: Still have high computational cost especially for long motifs. The selected sample may only converge to local optima instead of global optimal point. An alternative: extended sample-driven approach Search the neighbors of all samples with exhaustive search.

6 Contribution of This Paper  Basic idea: branching from the sample strings  Contribution: Much more efficient than previous algorithms. Very powerful to find subtle motifs.

7 Comparison between the Methods

8 The Algorithms Proposed  Two ways to model a motif: 1. as a pattern 2. as a profile: 4*l matrix  Two algorithms proposed: 1. Pattern-Branching algorithm 2. Profile-Branching algorithm

9 Pattern-Branching Algorithm  Distance between M and a sample A 0 : d(M, A 0 ) = k  D = k (A 0 ): a set of patterns of distance exactly k from A 0  Neighbor: D = 1 (A 0 ), changing a single nucleotide of A E.g., ATTGCCAG, ATTGCCTG, GTTGCCAG  Score of a pattern: total distance from the sequences 1. For each sequence s i, d(A, s i ) = min{d(A, P)|P  s i }, p is a l- mer (a pattern of length n). 2. The total distance of A from S is d(A, S) = ∑ s i  S d(A, s i )  BestNeighbor(A): the pattern B  D = 1 (A 0 ) with the lowest total distance d(B, S)

10 Pattern-Branching Algorithm  Input: A set of sequences S, the length of the motif l and * of mutations k.  Output: motif of length l with k mutations.  Algorithm: PatternBranching(S, l, k) 1. Motif M  arbitrary motif pattern 2. Get a set of samples of M in the sequences (S) 3. For each l-mer A 0 in S 4. For j  0 to k 5. { 6. if d(A j, S) < d(M, S) 7. M  A j 8. A j+1  Bestneighbour(A j ) 9. Output M 10. }

11 Profile-Branching Algorithm  Similar to Pattern-Branching  Some changes: 1. convert each sample string to a profile X(A 0 ) 2. generalize the scoring method to score profiles 3. modify the branching method to apply to profiles 4. use the top-scoring profile we find as a seed to the EM algorithm

12 Profile-Branching Algorithm  Convert a sample string to a profile X(A 0 ): ATGCCAT A1/21/6 1/21/6 T 1/21/6 1/2 G1/6 1/21/6 C 1/2 1/6

13 Profile-Branching Algorithm  Use entropy to score profiles: Given a profile X = (x vw ) and a pattern P = p 1 … p l, let e(X, P) be the log probability of sampling P from X, i.e. e(X, P) = ∑ w log(x p w w ). ATGCCAT A1/21/6 1/21/6 T 1/21/6 1/2 G1/6 1/21/6 C 1/2 1/6 G T G A C A T 1/6 1/2 1/2 1/6 1/2 1/2 1/2

14 Profile-Branching Algorithm  For each sequence S i in the sample S = {S 1, …, S n }, let e(X, S i ) = max{e(X, P)|P  S i }.  Then the entropy score of X is e(X, S) = ∑ s i  S e(X, s i ).  Intuitively, e(X, S) describes how well X matches its best occurrence in each sequence of the sample.

15 Profile-Branching Algorithm  Branching from the sample string: 1. Amplify only one column in the profile (which corresponds to one position in the sample pattern), and we only amplify a nucleotide v if x vw < 0.5. 2. Make sure that the relative entropy ∑ v x vw log(x’ vm /x vm ) = . We use  = -0.3. ATGCCAT A1/21/6 1/21/6 T 1/21/6 1/2 G1/6 1/21/6 C 1/2 1/6 ATGCCAT A0.271/6 1/21/6 T0.551/21/6 1/2 G0.091/61/21/6 C0.091/6 1/2 1/6

16 Profile-Branching Algorithm  Algorithm: ProfileBranching(S, l, k) 1. M  arbitrary motif profile 2. For each l-mer A 0 in S 3. { 4. X 0  X(A 0 ) 5. For j  0 to k 6. { 7. if e(X j, S) > e(Motif, S) 8. Motif  Xj 9. X j+1  BestNeighbor(X j ) 10. } 11. Run EM algorithm with Motif as seed

17 Results on Implanted Motifs  Pattern-Branching algorithm VS previous pattern-based motif finding algorithms WINNOWER, SP-STAR: unable to find subtle motifs PROJECTION, MITRA, MULTIPROFILER

18 Results on Implanted Motifs  Profile-Branching algorithm VS previous profile-based motif finding algorithms  Performance coefficient: Let k be the set of n implanted motifs found, and let p be the set of predicted motif positions,the performance coefficient is defined to be |K ∩ P|/|K ∪ P|.

19 Results on Biological Samples  Pattern-Branching Algorithm:  Profile-Branching Algorithm: The pattern returned by profile-branching matches the reference motif.

20 Discussion  Advantages: Much more efficient than previous algorithms. Very powerful to find subtle motifs.  Disadvantages: 1. Pattern-Branching has difficulty finding motifs with many degenerate positions. But profile-Branching works well on it. 2. Profile-Branching is very powerful to find subtle motifs but is comparatively slow.

21 Future Work  Apply Pattern-Branching and Profile-Branching algorithms to more challenging biological samples 1. Larger samples 2. Corrupted samples  Extend the algorithms to address the motif finding problem which involves not only A, T, G, C, but purine(R), pryrimidine(Y), weak bond(W) and strong bond(S).


Download ppt "Finding Subtle Motifs by Branching from Sample Strings Xuan Qi Computer Science Dept. Utah State Univ."

Similar presentations


Ads by Google