Presentation is loading. Please wait.

Presentation is loading. Please wait.

Designing Multiple Simultaneous Seeds for DNA Similarity Search Yanni Sun, Jeremy Buhler Washington University in Saint Louis.

Similar presentations


Presentation on theme: "Designing Multiple Simultaneous Seeds for DNA Similarity Search Yanni Sun, Jeremy Buhler Washington University in Saint Louis."— Presentation transcript:

1 Designing Multiple Simultaneous Seeds for DNA Similarity Search Yanni Sun, Jeremy Buhler Washington University in Saint Louis

2 WashU. Laboratory for Computational Genomics2 Outline Problem of multi-seed design Methods  Greedy covering algorithm Compute conditional match probabilities Experiments and results Conclusion and future work

3 WashU. Laboratory for Computational Genomics3 Sequence Alignment Functional regions conserved despite DNA mutations over time Conserved region can be aligned with high score Exact solution: DP; time complexity: O(MN) Fast but heuristic solution: seeded alignment algorithm

4 WashU. Laboratory for Computational Genomics4 Seeded Alignment Algorithm BLAST is the most popular tool. Step 1: word match step 2: extend the match to find the high similarity pair TAGGACCTAACC GACCACCTTTT TAGGACCTAACC GACCACCTTTTGACCACCTTTT

5 WashU. Laboratory for Computational Genomics5 Seed and Similarity Example of a similarity and a single seed tgcagaaatgcagaggca | || | | |||| tacacaggcaccgaggag Similarity: 101101000010111100 Seed: 11*1, weight = 3, span = 4 The seed detects/matches this similarity.

6 WashU. Laboratory for Computational Genomics6 Seed Choice is Important Significant alignmentSeed match 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1

7 WashU. Laboratory for Computational Genomics7 Seed Design: Previous Work Traditional seed: word (e.g. 11111111111) Discontiguous patterns of matching bases: [CR1993]; [MTL’02] {111010010100110111} Our work on single discontiguous seed: [BKS’03]

8 WashU. Laboratory for Computational Genomics8 Multiple Simultaneous Seeds Multiple simultaneous seeds are defined as a set of seeds.  ∏= {seed 1, seed 2,…seed i,…, seed n }  ∏ detects a similarity if at least one of the component seeds detects the similarity  Example Simultaneous seeds {11*1, 1*11} detect similarities 100110100001, 1000010110001, 1101001011001

9 9 Multi-seed Design – Balance Sensitivity with Specificity Sensitivity=A / Biologically meaningful alignments Specificity=A / seed matches Increase sensitivity:  Decrease weight of single seed  Use multiple seeds Both methods hurt specificity Hypothesis: a set of multiple seeds has a better tradeoff of sensitivity vs. specificity comparing to single seed biologically meaningful alignments seed matches A

10 WashU. Laboratory for Computational Genomics10 Our Work – Design Multiple Simultaneous Seeds Efficiently Use a new local search method to optimize seed set Design an efficient algorithm to calculate conditional match probability Empirical verification that multiple simultaneous seeds have better tradeoff of sensitivity vs. specificity

11 11 Multi-seed Design Problem Input:  Ungapped alignments sampled from two genomic DNA sequences  Resource constraints of seeds: weight, span, number Goal: find a set of seeds ∏ to maximize the detection probability Pr[∏ detects S].  Pr(∏ detects S) = Pr( (seed 1 detects S) or (seed 2 detects S)…or (seed n detects S))

12 WashU. Laboratory for Computational Genomics12 Outline Problem of multi-seed Design Methods  Greedy covering algorithm Compute conditional match probabilities Experiments and results Conclusion and future work

13 WashU. Laboratory for Computational Genomics13 Computing Match Probability for Specified Seeds [BKS ’03] Learn a kth-order Markov model from similarities. Build a DFA that only accepts strings containing the given seeds Compute the probability that the DFA accepts a string chosen randomly from model M by DP.

14 WashU. Laboratory for Computational Genomics14 Seek the Locally Optimal Set of Seeds Original local search Greedy covering algorithm – a faster local search strategy  Efficient computation of conditional match probability

15 WashU. Laboratory for Computational Genomics15 Find Optimal Set of Seeds by Original Local Search Seed space with span<=8,weight=3 1*1***1, 1*****11 Pr=0.70 1**1**1, 1*****11 Pr=0.67 1***1*1, 1*****11 Pr=0.75 1****11, 1*****11 Pr=0.71

16 WashU. Laboratory for Computational Genomics16 Design 3 simultaneous seeds:{s 1,s 2,s 3 } s 1 = argmax x Pr(x) s 2 =argmax x Pr(x|~s 1 ) s 3 =argmax x Pr(x|~{s 1,s 2 }) Similarit y space Similarities detected by S 1 Similarities detected by S 3 Similarities detected by S 2 Greedy Covering Algorithm

17 WashU. Laboratory for Computational Genomics17 Calculate Conditional Match Probabilities Challenge: how to calculate the conditional probability efficiently ?  Seeds with small span: exact computation via DFAs  Seeds with large span: Monte Carlo

18 WashU. Laboratory for Computational Genomics18 Calculate Conditional Match Probability via DFA Pr( x| ) = Pr(x )/ Pr( ) Build DFA corresponding to x by using cross product and complementation of DFA Efficiency: in the process of local search to find optimal single seed x, Pr( ) can be precomputed

19 WashU. Laboratory for Computational Genomics19 Outline Problem of multi-seed design Methods  Greedy covering algorithm Compute conditional match probabilities Experiments and results Conclusion and future work

20 20 Greedy Covering vs. Original Local Search Detection probability

21 WashU. Laboratory for Computational Genomics21 Greedy Covering is Much Faster When n=5, on the same hardware platform(P4)  Greedy covering needs 20 minutes  The original local search needs 2.4 hours

22 WashU. Laboratory for Computational Genomics22 Experimental Setup The ungapped alignments are sampled uniformly from human and mouse syntenies For a specified seed set  sensitivity : the number of significant gapped alignments found by our BLAST-like alignment tool  False positive rate : approximated by the number of seed matches

23 WashU. Laboratory for Computational Genomics23 Results: Verify the Hypothesis on Noncoding Sequences seed weight number of seeds # gapped alignments found (sensitivity) %improvement of sensitivity total seed matches (approximation of f.p) 11 1 251941 ---- 1.57x10 9 10 1 273831 8.75.88x10 9 11 3 292093 15.94.56x10 9

24 WashU. Laboratory for Computational Genomics24 Summary of Contributions Efficient algorithms to design multiple simultaneous seeds at reasonable cost Empirical verification: multiple simultaneous seeds have a better tradeoff between sensitivity and specificity

25 WashU. Laboratory for Computational Genomics25 Future Work Design a better evaluation platform for different seeds Investigate utility of seeds in multiple sequence alignment

26 WashU. Laboratory for Computational Genomics26 Acknowledgements Dr. Jeremy Buhler (advisor), Ben Westover, Rachel Nordgren, Joseph Lancaster and Christopher Swope Laboratory for computational genomics in Washington University in Saint Louis http://www.cse.wustl.edu/~jbuhler/mandala


Download ppt "Designing Multiple Simultaneous Seeds for DNA Similarity Search Yanni Sun, Jeremy Buhler Washington University in Saint Louis."

Similar presentations


Ads by Google