PatternHunter II: Highly Sensitive and Fast Homology Search Bioinformatics and Computational Molecular Biology (Fall 2005): Representation R94922059 林語君.

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Ming Li Canada Research Chair in Bioinformatics University of Waterloo Modern Homology Search.
Ming Li Visiting Professor City University of Hong Kong Canada Research Chair in Bioinformatics Professor University of Waterloo Joint work with Bin Ma,
BLAST Sequence alignment, E-value & Extreme value distribution.
1 CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments Tamer Kahveci CISE Department University of Florida.
Sequence Alignment Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan
Ming Li Canada Research Chair in Bioinformatics University of Waterloo Visiting Professor City University of Hong Kong Joint work: B. Ma, J. Tromp, W.
Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically.
Seeds for Similarity Search Presentation by: Anastasia Fedynak.
Heuristic Local Alignerers 1.The basic indexing & extension technique 2.Indexing: techniques to improve sensitivity Pairs of Words, Patterns 3.Systems.
Database Searching for Similar Sequences Search a sequence database for sequences that are similar to a query sequence Search a sequence database for sequences.
March 2006Vineet Bafna Designing Spaced Seeds March 2006Vineet Bafna Project/Exam deadlines May 2 – Send to me with a title of your project May.
6/11/2015 © Bud Mishra, 2001 L7-1 Lecture #7: Local Alignment Computational Biology Lecture #7: Local Alignment Bud Mishra Professor of Computer Science.
Design of Optimal Multiple Spaced Seeds for Homology Search Jinbo Xu School of Computer Science, University of Waterloo Joint work with D. Brown, M. Li.
Finding approximate palindromes in genomic sequences.
Designing Multiple Simultaneous Seeds for DNA Similarity Search Yanni Sun, Jeremy Buhler Washington University in Saint Louis.
Fa05CSE 182 L3: Blast: Keyword match basics. Fa05CSE 182 Silly Quiz TRUE or FALSE: In New York City at any moment, there are 2 people (not bald) with.
Bioinformatics and Phylogenetic Analysis
Linear-Space Alignment. Linear-space alignment Using 2 columns of space, we can compute for k = 1…M, F(M/2, k), F r (M/2, N – k) PLUS the backpointers.
We continue where we stopped last week: FASTA – BLAST
From Pairwise Alignment to Database Similarity Search.
Pairwise Sequence Alignment Part 2. Outline Global alignments-continuation Local versus Global BLAST algorithms Evaluating significance of alignments.
Heuristic Approaches for Sequence Alignments
Finding the optimal pairwise alignment We are interested in finding the alignment of two sequences that maximizes the similarity score given an arbitrary.
Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Sequence alignment, E-value & Extreme value distribution
From Pairwise Alignment to Database Similarity Search.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B 鍾承宏 B 王凱平 B 莊謹譽 B 張智翔 B
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Lecture 6. Pairwise Local Alignment and Database Search Csc 487/687 Computing for bioinformatics.
1 Data structure:Lookup Table Application:BLAST. 2 The Look-up Table Data Structure A k-mer is a string of length k. A lookup table is a table of size.
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
Sequence Alignment Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
PatternHunter: A Fast and Highly Sensitive Homology Search Method Bin Ma Department of Computer Science University of Western Ontario.
Basic Overview of Bioinformatics Tools and Biocomputing Applications II Dr Tan Tin Wee Director Bioinformatics Centre.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
From Smith-Waterman to BLAST
Doug Raiford Phage class: introduction to sequence databases.
Pairwise Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 4, 2004 ChengXiang Zhai Department of Computer Science University.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.
Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.
SSAHA: A Fast Search Method For Large DNA Databases Zemin Ning, Anthony J. Cox and James C. Mullikin Seminar by: Gerry Kammerer © ETH Zürich.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Homology Search Ming Li Canada Research Chair in Bioinformatics
Homology Search Tools Kun-Mao Chao (趙坤茂)
Homology Search Tools Kun-Mao Chao (趙坤茂)
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
paper study for class presentation on Nov16th, 2005 slider by 陳奕先
Homology Search Tools Kun-Mao Chao (趙坤茂)
SMA5422: Special Topics in Biotechnology
Sequence Alignment Kun-Mao Chao (趙坤茂)
Basic Local Alignment Search Tool (BLAST)
BIOINFORMATICS Fast Alignment
PatternHunter: faster and more sensitive homology search
Homology Search Tools Kun-Mao Chao (趙坤茂)
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

PatternHunter II: Highly Sensitive and Fast Homology Search Bioinformatics and Computational Molecular Biology (Fall 2005): Representation R 林語君 Ming Li, Bin Ma Derek Kisman, John Tromp

Bioinformatics and Computational Molecular Biology (Fall 2005): Representation 2 Overview Homology search Local alignment algorithms PH I PH II Multiple Spaced Seeds Computing hit probability Finding a good seed set PH II Design Performance

Bioinformatics and Computational Molecular Biology (Fall 2005): Representation 3 Local alignment Smith-Waterman Smith and Waterman, 1981; Waterman and Eggert, 1987 SSearch FastA Wilbur and Lipman, 1983; Lipman and Pearson, 1985 BLAST Altschul et al., 1990; Altschul et al., 1997 Blast Family: BLASTN, BLASTP, etc. MEGABLAST

Bioinformatics and Computational Molecular Biology (Fall 2005): Representation 4 PatternHunter Seed Tradeoff: sensitivity computation Consecutive k letters k=11 in Blastn, k=28 in MegaBlast Nonconsecutive k letters Spaced seed A model of k as its weight

Bioinformatics and Computational Molecular Biology (Fall 2005): Representation 5 PatternHunter II Genome Informatics 14 (2003) Extend single optimized spaced seed of PH to multiple ones Speed: BLASTN (MEGABLAST) Sensitivity: Smith-Waterman (SSearch)

Bioinformatics and Computational Molecular Biology (Fall 2005): Representation 6 Definition A homologous region, R A seed hits R A seed set A={a 1,…a k } hits R Similarity R has p=x% identities Sensitivity Hit probability Optimal (DP) = 1

Bioinformatics and Computational Molecular Biology (Fall 2005): Representation 7 Computing Hit Probability NP-hard on multiple seeds DP on 1 seed Extend DP to multiple seeds

Bioinformatics and Computational Molecular Biology (Fall 2005): Representation 8 Computing Hit Probability of Multiple Seeds Let A={a 1,…a k } be a set of k seeds and R a random region of Length L with similarity level p. Binary string b is a suffix of R[0:i] Answer: f ( L,Є ), Є = empty string

Bioinformatics and Computational Molecular Biology (Fall 2005): Representation 9 Computing Hit Probability of Multiple Seeds

Bioinformatics and Computational Molecular Biology (Fall 2005): Representation 10 Computing Hit Probability of Multiple Seeds

Bioinformatics and Computational Molecular Biology (Fall 2005): Representation 11 Finding a Good Seed Set NP-hard for both optimal seed and multiple seeds Greedy

Bioinformatics and Computational Molecular Biology (Fall 2005): Representation 12 Finding a Good Seed Set Compute the 1st seed a 1 which maximizes the hit probability of {a 1 } Compute the 2nd seed a 2 which maximizes the hit probability of {a 1, a 2 } Repeat until Reach the desired number of seeds Reach the desired hit probability

Bioinformatics and Computational Molecular Biology (Fall 2005): Representation 13 Finding a Good Seed Set May not optimize the combined hit probability Good enough Optimal 16 weight, 11 seeds, L=64, similarity=70%, first four seeds:{ , , , } Greedy 16 weight, 12 seeds, L=64, similarity=70%, first four seeds:{ , , , }

Bioinformatics and Computational Molecular Biology (Fall 2005): Representation 14 Performance of the seeds From low to high Solid: weight-11 k=1,2,4,8,16 seeds Dashed: 1-seed, weight=10,9,8,7

Bioinformatics and Computational Molecular Biology (Fall 2005): Representation 15 Performance of the seeds Reducing the weight by 1 Increase the expected number of hits by a factor of 4 Doubling the number of seeds Increase the expected number of hits by a factor of 2 Better: Multiple seeds

Bioinformatics and Computational Molecular Biology (Fall 2005): Representation 16 PH II Performance Compare with Blast(Blastn), Smith- Waterman(SSearch) Sensitivity of SSearch = 1 Alignment score BLAST methods (hash, DP) match=1, mismatch=-1, gapopen=-5, gapextension=-1

Bioinformatics and Computational Molecular Biology (Fall 2005): Representation 17 PH II Performance From low to high Solid: PH II, 1, 2, 4, 8 seeds weight 11 Dashed: Blastn, seed weight 11

Bioinformatics and Computational Molecular Biology (Fall 2005): Representation 18 Complexity Proof Finding optimal spaced seeds NP-hard Finding one optimal seed NP-hard Computing the hit probability of multiple seeds NP-hard