Designing Multiple Simultaneous Seeds for DNA Similarity Search Yanni Sun, Jeremy Buhler Washington University in Saint Louis.

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Yasuhiro Fujiwara (NTT Cyber Space Labs)
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
A new method of finding similarity regions in DNA sequences Laurent Noé Gregory Kucherov LORIA/UHP Nancy, France LORIA/INRIA Nancy, France Corresponding.
Last lecture summary.
Seeds for Similarity Search Presentation by: Anastasia Fedynak.
Jeff Shen, Morgan Kearse, Jeff Shi, Yang Ding, & Owen Astrachan Genome Revolution Focus 2007, Duke University, Durham, North Carolina Introduction.
Lecture outline Database searches
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Heuristic alignment algorithms and cost matrices
Design of Optimal Multiple Spaced Seeds for Homology Search Jinbo Xu School of Computer Science, University of Waterloo Joint work with D. Brown, M. Li.
Sequence similarity (II). Schedule Mar 23midterm assignedalignment Mar 30midterm dueprot struct/drugs April 6teams assignedprot struct/drugs April 13RNA.
Fa05CSE 182 L3: Blast: Keyword match basics. Fa05CSE 182 Silly Quiz TRUE or FALSE: In New York City at any moment, there are 2 people (not bald) with.
Linear-Space Alignment. Linear-space alignment Using 2 columns of space, we can compute for k = 1…M, F(M/2, k), F r (M/2, N – k) PLUS the backpointers.
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Index-based search of single sequences Omkar Mate CS 374 Stanford University.
Similar Sequence Similar Function Charles Yan Spring 2006.
Heuristic Approaches for Sequence Alignments
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Protein Sequence Comparison Patrice Koehl
“Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Novel computational methods for large scale genome comparison PhD Director: Dr. Xavier Messeguer Departament de Llenguatges i Sistemes Informàtics Universitat.
A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Filter Algorithms for Approximate String Matching Stefan Burkhardt.
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
SISAP’08 – Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu
Influence Maximization in Dynamic Social Networks Honglei Zhuang, Yihan Sun, Jie Tang, Jialin Zhang, Xiaoming Sun.
Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
An Empirical Study of Choosing Efficient Discriminative Seeds for Oligonucleotide Design Won-Hyong Chung and Seong-Bae Park Dept. of Computer Engineering.
Indexing DNA sequences for local similarity search Joint work of Angela, Dr. Mamoulis and Dr. Yiu 17/5/2007.
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
1 CPSC 320: Intermediate Algorithm Design and Analysis July 28, 2014.
PatternHunter II: Highly Sensitive and Fast Homology Search Bioinformatics and Computational Molecular Biology (Fall 2005): Representation R 林語君.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Using BLAST for Genomic Sequence Annotation Jeremy Buhler For HHMI / BIO4342 Tutorial Workshop.
BLAST, which stands for basic local alignment search tool, is a heuristic algorithm that is used to find similar sequences of amino acids or nucleotides.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
PatternHunter: A Fast and Highly Sensitive Homology Search Method Bin Ma Department of Computer Science University of Western Ontario.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
1 Longest Common Subsequence as Private Search Payman Mohassel and Mark Gondree U of CalgaryNPS.
A Hardware Accelerator for the Fast Retrieval of DIALIGN Biological Sequence Alignments in Linear Space Author: Azzedine Boukerche, Jan M. Correa, Alba.
BLAST, which stands for basic local alignment search tool, is a heuristic algorithm that is used to find similar sequences of amino acids or nucleotides.
Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
Sequence Alignment.
Construction of Substitution matrices
Biosequence Similarity Search on the Mercury System Praveen Krishnamurthy, Jeremy Buhler, Roger Chamberlain, Mark Franklin, Kwame Gyang, and Joseph Lancaster.
Dynamic programming with more complex models When gaps do occur, they are often longer than one residue.(biology) We can still use all the dynamic programming.
Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, All rights reserved.
Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Sequence comparison: Significance of similarity scores
Fast Sequence Alignments
Basic Local Alignment Search Tool
Presentation transcript:

Designing Multiple Simultaneous Seeds for DNA Similarity Search Yanni Sun, Jeremy Buhler Washington University in Saint Louis

WashU. Laboratory for Computational Genomics2 Outline Problem of multi-seed design Methods  Greedy covering algorithm Compute conditional match probabilities Experiments and results Conclusion and future work

WashU. Laboratory for Computational Genomics3 Sequence Alignment Functional regions conserved despite DNA mutations over time Conserved region can be aligned with high score Exact solution: DP; time complexity: O(MN) Fast but heuristic solution: seeded alignment algorithm

WashU. Laboratory for Computational Genomics4 Seeded Alignment Algorithm BLAST is the most popular tool. Step 1: word match step 2: extend the match to find the high similarity pair TAGGACCTAACC GACCACCTTTT TAGGACCTAACC GACCACCTTTTGACCACCTTTT

WashU. Laboratory for Computational Genomics5 Seed and Similarity Example of a similarity and a single seed tgcagaaatgcagaggca | || | | |||| tacacaggcaccgaggag Similarity: Seed: 11*1, weight = 3, span = 4 The seed detects/matches this similarity.

WashU. Laboratory for Computational Genomics6 Seed Choice is Important Significant alignmentSeed match

WashU. Laboratory for Computational Genomics7 Seed Design: Previous Work Traditional seed: word (e.g ) Discontiguous patterns of matching bases: [CR1993]; [MTL’02] { } Our work on single discontiguous seed: [BKS’03]

WashU. Laboratory for Computational Genomics8 Multiple Simultaneous Seeds Multiple simultaneous seeds are defined as a set of seeds.  ∏= {seed 1, seed 2,…seed i,…, seed n }  ∏ detects a similarity if at least one of the component seeds detects the similarity  Example Simultaneous seeds {11*1, 1*11} detect similarities , ,

9 Multi-seed Design – Balance Sensitivity with Specificity Sensitivity=A / Biologically meaningful alignments Specificity=A / seed matches Increase sensitivity:  Decrease weight of single seed  Use multiple seeds Both methods hurt specificity Hypothesis: a set of multiple seeds has a better tradeoff of sensitivity vs. specificity comparing to single seed biologically meaningful alignments seed matches A

WashU. Laboratory for Computational Genomics10 Our Work – Design Multiple Simultaneous Seeds Efficiently Use a new local search method to optimize seed set Design an efficient algorithm to calculate conditional match probability Empirical verification that multiple simultaneous seeds have better tradeoff of sensitivity vs. specificity

11 Multi-seed Design Problem Input:  Ungapped alignments sampled from two genomic DNA sequences  Resource constraints of seeds: weight, span, number Goal: find a set of seeds ∏ to maximize the detection probability Pr[∏ detects S].  Pr(∏ detects S) = Pr( (seed 1 detects S) or (seed 2 detects S)…or (seed n detects S))

WashU. Laboratory for Computational Genomics12 Outline Problem of multi-seed Design Methods  Greedy covering algorithm Compute conditional match probabilities Experiments and results Conclusion and future work

WashU. Laboratory for Computational Genomics13 Computing Match Probability for Specified Seeds [BKS ’03] Learn a kth-order Markov model from similarities. Build a DFA that only accepts strings containing the given seeds Compute the probability that the DFA accepts a string chosen randomly from model M by DP.

WashU. Laboratory for Computational Genomics14 Seek the Locally Optimal Set of Seeds Original local search Greedy covering algorithm – a faster local search strategy  Efficient computation of conditional match probability

WashU. Laboratory for Computational Genomics15 Find Optimal Set of Seeds by Original Local Search Seed space with span<=8,weight=3 1*1***1, 1*****11 Pr=0.70 1**1**1, 1*****11 Pr=0.67 1***1*1, 1*****11 Pr=0.75 1****11, 1*****11 Pr=0.71

WashU. Laboratory for Computational Genomics16 Design 3 simultaneous seeds:{s 1,s 2,s 3 } s 1 = argmax x Pr(x) s 2 =argmax x Pr(x|~s 1 ) s 3 =argmax x Pr(x|~{s 1,s 2 }) Similarit y space Similarities detected by S 1 Similarities detected by S 3 Similarities detected by S 2 Greedy Covering Algorithm

WashU. Laboratory for Computational Genomics17 Calculate Conditional Match Probabilities Challenge: how to calculate the conditional probability efficiently ?  Seeds with small span: exact computation via DFAs  Seeds with large span: Monte Carlo

WashU. Laboratory for Computational Genomics18 Calculate Conditional Match Probability via DFA Pr( x| ) = Pr(x )/ Pr( ) Build DFA corresponding to x by using cross product and complementation of DFA Efficiency: in the process of local search to find optimal single seed x, Pr( ) can be precomputed

WashU. Laboratory for Computational Genomics19 Outline Problem of multi-seed design Methods  Greedy covering algorithm Compute conditional match probabilities Experiments and results Conclusion and future work

20 Greedy Covering vs. Original Local Search Detection probability

WashU. Laboratory for Computational Genomics21 Greedy Covering is Much Faster When n=5, on the same hardware platform(P4)  Greedy covering needs 20 minutes  The original local search needs 2.4 hours

WashU. Laboratory for Computational Genomics22 Experimental Setup The ungapped alignments are sampled uniformly from human and mouse syntenies For a specified seed set  sensitivity : the number of significant gapped alignments found by our BLAST-like alignment tool  False positive rate : approximated by the number of seed matches

WashU. Laboratory for Computational Genomics23 Results: Verify the Hypothesis on Noncoding Sequences seed weight number of seeds # gapped alignments found (sensitivity) %improvement of sensitivity total seed matches (approximation of f.p) x x x10 9

WashU. Laboratory for Computational Genomics24 Summary of Contributions Efficient algorithms to design multiple simultaneous seeds at reasonable cost Empirical verification: multiple simultaneous seeds have a better tradeoff between sensitivity and specificity

WashU. Laboratory for Computational Genomics25 Future Work Design a better evaluation platform for different seeds Investigate utility of seeds in multiple sequence alignment

WashU. Laboratory for Computational Genomics26 Acknowledgements Dr. Jeremy Buhler (advisor), Ben Westover, Rachel Nordgren, Joseph Lancaster and Christopher Swope Laboratory for computational genomics in Washington University in Saint Louis