Seeds for Similarity Search Presentation by: Anastasia Fedynak.

Slides:

Advertisements

Similar presentations

Parallel BioInformatics Sathish Vadhiyar. Parallel Bioinformatics  Many large scale applications in bioinformatics – sequence search, alignment, construction.

Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.

Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.

Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪莊凱翔.

BLAST Sequence alignment, E-value & Extreme value distribution.

1 CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments Tamer Kahveci CISE Department University of Florida.

Combinatorial Pattern Matching CS 466 Saurabh Sinha.

Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.

Heuristic Local Alignerers 1.The basic indexing & extension technique 2.Indexing: techniques to improve sensitivity Pairs of Words, Patterns 3.Systems.

March 2006Vineet Bafna Designing Spaced Seeds March 2006Vineet Bafna Project/Exam deadlines May 2 – Send to me with a title of your project May.

. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.

Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.

Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.

Heuristic alignment algorithms and cost matrices

Design of Optimal Multiple Spaced Seeds for Homology Search Jinbo Xu School of Computer Science, University of Waterloo Joint work with D. Brown, M. Li.

Sequence similarity (II). Schedule Mar 23midterm assignedalignment Mar 30midterm dueprot struct/drugs April 6teams assignedprot struct/drugs April 13RNA.

Designing Multiple Simultaneous Seeds for DNA Similarity Search Yanni Sun, Jeremy Buhler Washington University in Saint Louis.

Fa05CSE 182 L3: Blast: Keyword match basics. Fa05CSE 182 Silly Quiz TRUE or FALSE: In New York City at any moment, there are 2 people (not bald) with.

Linear-Space Alignment. Linear-space alignment Using 2 columns of space, we can compute for k = 1…M, F(M/2, k), F r (M/2, N – k) PLUS the backpointers.

1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.

Pairwise Sequence Alignment Part 2. Outline Global alignments-continuation Local versus Global BLAST algorithms Evaluating significance of alignments.

Index-based search of single sequences Omkar Mate CS 374 Stanford University.

Similar Sequence Similar Function Charles Yan Spring 2006.

Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

Heuristic Approaches for Sequence Alignments

Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.

“Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain.

Blast heuristics Morten Nielsen Department of Systems Biology, DTU.

Sequence alignment, E-value & Extreme value distribution

Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.

Presented by Mario Flores, Xuepo Ma, and Nguyen Nguyen.

Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.

BLAST What it does and what it means Steven Slater Adapted from pt.

Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.

Filter Algorithms for Approximate String Matching Stefan Burkhardt.

Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.

Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.

Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?

BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.

CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.

Construction of Substitution Matrices

Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.

PatternHunter II: Highly Sensitive and Fast Homology Search Bioinformatics and Computational Molecular Biology (Fall 2005): Representation R 林語君.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.

Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.

PatternHunter: A Fast and Highly Sensitive Homology Search Method Bin Ma Department of Computer Science University of Western Ontario.

Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.

Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.

Construction of Substitution matrices

Doug Raiford Phage class: introduction to sequence databases.

Step 3: Tools Database Searching

Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.

©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.

Dynamic programming with more complex models When gaps do occur, they are often longer than one residue.(biology) We can still use all the dynamic programming.

Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.

9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.

Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.

Homology Search Ming Li Canada Research Chair in Bioinformatics

Homology Search Tools Kun-Mao Chao (趙坤茂)

Homology Search Tools Kun-Mao Chao (趙坤茂)

BLAST Anders Gorm Pedersen & Rasmus Wernersson.

paper study for class presentation on Nov16th, 2005 slider by 陳奕先

Homology Search Tools Kun-Mao Chao (趙坤茂)

Fast Sequence Alignments

Sequence alignment, Part 2

Basic Local Alignment Search Tool (BLAST)

PatternHunter: faster and more sensitive homology search

Basic Local Alignment Search Tool

Homology Search Tools Kun-Mao Chao (趙坤茂)

Presentation transcript:

Seeds for Similarity Search Presentation by: Anastasia Fedynak

Homology Search Homology search consumes 10% of the world’s supercomputing time NCBI Blast server processes 10 5 queries/day GenBank doubles in size every 18 months Completed genomes: human, mouse, rice, fly, etc Software must be scalable for large datasets

Homology Search Tools Identify short seed matches (consecutive k bases) between DNA sequences which are then extended –BLAST, FASTA too slow and miss many alignment Smith-Waterman DP too slow MegaBlast high speed, works well for highly similar sequences

Discontiguous Seeds Requires matching pairs of bases at a subset of positions Califano and Rigoutsos (1993) –Random discontiguous pattern in FLASH Buhler (2001) –Sensitivity of random patterns in LSH-ALL-PAIRS comparison algorithm Blastz underlying PipMaker program (2000) PatternHunter (Ma, Tromp, and Li, 2002)

Resource-constrained paradigm of seed design Given a collection of ungapped genomic sequence similarities of fixed length l, modeled by kth-order Markov model, M, find n seeds π 1 … π n, such that the probability of detecting a similarity is maximized

Problem Definition Let C be collection of genomic sequences of l bases 1 = match 0 = mismatch Starting point for gapped extension AATGC ATTAC similarity

Problem Definition Similarity is modeled by kth order markov process, M –Gives the probability the next bit seen will be a 1 (match) – Coding regions exhibit the pattern {1, 1, 0}, protein with silent mutations at 3 rd base position of codon

Problem Definition Devise a seed π, an ordered list of w positions {x 1 …x w },with weight w and span s –Ex. π = {1,3,4,6,7} w=5, s=7 π detects S iff at offset j S[j+ x i ] = 1 for 1 ≤ i ≤ w i.e. For every position of π, at offset j, S must contain matching bases S = match S = mismatch

Problem Definition Find a seed π, that maximizes sensitivity to S from model M. i.e. maximize detection probability Pr [π detects S ] S~M

Selecting Good Seeds Seed length determined by a tradeoff between speed and sensitivity: 1.Larger k = fast speed, low sensitivity 2.Small k = slow speed, high sensitivity Blast uses k consecutive letters as seeds –k = 11 in Blastn and k = 28 in MegaBlast

Selecting Good Seeds INDEPENDENCE: probabilities of matches at different offsets are not independent Generally, fewer bases shared between seed and shifted copies, higher sensitivity Consecutive models → low sensitivity

PatternHunter Optimal model via DP : w = 11, s = 18 shifted copy shares 5 bases

PatternHunter Optimal model via DP : w = 11, s = 18 shifted copy shares 5 bases

Spaced vs. Consecutive Seeds LEMMA: Expected number of hits with weight w, span s, within a length l region of similarity 0 ≤ p ≤ 1 is: (l – s + 1)p w Example: In a region of length 64 and similarity 0.7 Pr(1H)# hits πcπc π ph

Quality Comparison

Performance Comparison Seq1Seq2PHMB28Blastn M. pneu (828K) M. gen (529K) 10s / 65M 1s / 88M 47s / 45M A.thal (19.6M) A.thal (17.5M) 5020s / 279M 21720s / 1087M ∞ H. sap (35M) H. sap (26.2M) 14512s / 419M ∞∞

Mandala – Seed Selection Let π = {x 1 …x w } be the current seed Define local neighbourhood of π as set of all seeds π’ that differ from π in one position. Hill climbing with random restart to find a near-optimal seed Evaluation based on probability calculation

Detection Probabilities Detection probability encodes overlap structure of a seed into DFA DP computes the probability DFA accepts a random similarity of length l from kth-order Markov model, M P(q,t,δ·b) probability of reaching state q after reading t bits of an input S, the last k+1 of which are δ·b. For a state q, let Φb(q) is the set of all states that transition to q on bit b. P(q,t,δ·b) = Pr(S[t]=b|S[t-k’…t-1] = δ) x ∑ ∑ P(q’,t-1,b 0 · δ) q’ЄΦ b (q) b 0 Є{0,1}

Performance Comparison – Non Coding DNA Sequence SeedPr [det]Alignments found Time (s) πcπc π c π ph π N π N

Performance Comparison – Coding DNA Sequence SeedPr [det]Alignments found Time (s) π c π ph π π C

Influence of Model Order M 5 model (solid line) exploits nearest-neighbor M c 5 model (Dashed line) – exploits correlation arising from codon structure

Multi-Seed Design – Why? Seed matching heuristics optimize a tradeoff between sensitivity (true +ve rate) and specificity (1 – false +ve rate) True +ve: alignment contains a seed match False +ve: Prob match occurs by chance (~ 1/4 w bases) Increase w –reduces π’s false +ve –But lowers sensitivity

Multi-Seed Design – Why? Multiple seeds provide a more attractive way to trade sensitivity for specificity Set ∏ of seeds with weight w’ > w Expected chance matches is: |∏|/4 w’

Problem Definition A seed π matches alignment α → E π (α) Mismatch → E π (α) Match probability of π in M is given by: Pr (E π (α)) A set ∏ matches α, if at least one of its seeds matches (E ∏ (α)) α ~M

Problem Definition Find a set П of n seeds, that maximizes sensitivity to S from model M. i.e. maximize detection probability Pr [ П detects S ] S~M

Algorithms for Multi-Seed Design 1.Local Approach Used in Mandala 2.Greedy Covering 3.Beam Search

Mandala’s Local Search Algorithm Given w and s Begin with a set ∏ 0 of n randomly chosen seeds with common w and s Choose i and j, where 1≤i≤n and 2≤j≤w, then, find the best seed set ∏ 1 in the neighbourhood of ∏ 0 by deleting position x j of the i th seed π i Є ∏ 0, and replacing it with a position between 1 and s-1 not currently inspected by π i Iterates through i and j until no further improvements are possible

Greedy Heuristic for Computing Seed Sets Given a partial seed set ∏ 0, choose the next seed that maximizes the conditional match probability for alignment model M: Pr(E π |E ∏ ) i.e. highest-probability alignment not already matched by some seed in the current set Start from a single locally optimal seed

Extension to Beam Search Initially find a number of locally optimal single seeds The best b are saved and used in the next optimization round For each saved seed, we find N seeds, each of which locally optimizes Pr(E π |E ∏ ) The b seed pairs {π 0, π} with highest match probability over all b·N pairs are again saved. Best seed set overall is choosen

Performance

Computing Conditional Match Probabilities 1.Construct DFA, A π that accepts alignments containing a seed match to π 2.By DP, compute Pr A π accepts a random alignment of length l from M 3.Compute Pr(E π |E ∏ ) for seed π and set ∏ Pr(E π |E ∏ ) = Pr(E ∏Uπ ) - Pr(E ∏ ) 1 - Pr(E ∏ )

Detection Probabilities 1.Let π be a seed weight w span s 2.Q π set of all s-bit strings matching π 3.Construct a trie T π from the strings of Q π 4.Convert T π to DFA A π (Aho-Corasick alg) accepts a similarity S, if π detects S