Homology Search Tools Kun-Mao Chao (趙坤茂)

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Bioinformatics Tutorial I BLAST and Sequence Alignment.
BLAST Sequence alignment, E-value & Extreme value distribution.
1 CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments Tamer Kahveci CISE Department University of Florida.
Sequence Alignment Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan
Seeds for Similarity Search Presentation by: Anastasia Fedynak.
Space/Time Tradeoff and Heuristic Approaches in Pairwise Alignment.
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Heuristic alignment algorithms and cost matrices
Fa05CSE 182 L3: Blast: Keyword match basics. Fa05CSE 182 Silly Quiz TRUE or FALSE: In New York City at any moment, there are 2 people (not bald) with.
We continue where we stopped last week: FASTA – BLAST
Sequence Alignment vs. Database Task: Given a query sequence and millions of database records, find the optimal alignment between the query and a record.
From Pairwise Alignment to Database Similarity Search.
Heuristic Approaches for Sequence Alignments
Protein Sequence Comparison Patrice Koehl
Sequence alignment, E-value & Extreme value distribution
From Pairwise Alignment to Database Similarity Search.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Speed Up DNA Sequence Database Search and Alignment by Methods of DSP
BLAT – The B LAST- L ike A lignment T ool Kent, W.J. Genome Res : Presenter: 巨彥霖 田知本.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
PatternHunter: faster and more sensitive homology search By Bin Ma, John Tromp and Ming Li B 鍾承宏 B 王凱平 B 莊謹譽 B 張智翔 B
Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Dynamic-Programming Strategies for Analyzing Biomolecular Sequences.
Dynamic Programming Method for Analyzing Biomolecular Sequences Tao Jiang Department of Computer Science University of California - Riverside (Typeset.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Computational Biology, Part 9 Efficient database searching methods Robert F. Murphy Copyright  1996, 1999, All rights reserved.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Sequence Alignment Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan
PatternHunter II: Highly Sensitive and Fast Homology Search Bioinformatics and Computational Molecular Biology (Fall 2005): Representation R 林語君.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
BLAST, which stands for basic local alignment search tool, is a heuristic algorithm that is used to find similar sequences of amino acids or nucleotides.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
PatternHunter: A Fast and Highly Sensitive Homology Search Method Bin Ma Department of Computer Science University of Western Ontario.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
Construction of Substitution matrices
Doug Raiford Phage class: introduction to sequence databases.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.
Heuristic Alignment Algorithms Hongchao Li Jan
Space-Saving Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan.
Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.
Alignments and Phylogenetic tree Reading: Introduction to Bioinformatics. Arthur M. Lesk. Fourth Edition Chapter 5.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Homology Search Ming Li Canada Research Chair in Bioinformatics
Homology Search Tools Kun-Mao Chao (趙坤茂)
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
paper study for class presentation on Nov16th, 2005 slider by 陳奕先
LSM3241: Bioinformatics and Biocomputing Lecture 4: Sequence analysis methods revisited Prof. Chen Yu Zong Tel:
Dynamic-Programming Strategies for Analyzing Biomolecular Sequences
Homology Search Tools Kun-Mao Chao (趙坤茂)
Fast Sequence Alignments
SMA5422: Special Topics in Biotechnology
Sequence Alignment Kun-Mao Chao (趙坤茂)
Sequence alignment, Part 2
Lecture #7: FASTA & LFASTA
Basic Local Alignment Search Tool (BLAST)
BIOINFORMATICS Fast Alignment
Space-Saving Strategies for Analyzing Biomolecular Sequences
PatternHunter: faster and more sensitive homology search
Space-Saving Strategies for Analyzing Biomolecular Sequences
Basic Local Alignment Search Tool
Homology Search Tools Kun-Mao Chao (趙坤茂)
Sequence alignment, E-value & Extreme value distribution
Multiple Sequence Alignment
Presentation transcript:

Homology Search Tools Kun-Mao Chao (趙坤茂) Department of Computer Science and Information Engineering National Taiwan University, Taiwan WWW: http://www.csie.ntu.edu.tw/~kmchao

Homology Search Tools Smith-Waterman (Smith and Waterman, 1981; Waterman and Eggert, 1987) FASTA (Wilbur and Lipman, 1983; Lipman and Pearson, 1985) BLAST (Altschul et al., 1990; Altschul et al., 1997) BLAT (Kent, 2002) PatternHunter (Li et al., 2004)

Finding Exact Word Matches Hash Tables Suffix Trees Suffix Arrays

Hash Tables

Suffix Trees (I)

Suffix Trees (II)

Suffix Arrays

FASTA Find runs of identities and identify regions with the highest density of identities. Re-score using PAM matrix and keep top scoring segments. Eliminate segments that are unlikely to be part of the alignment. Optimize the alignment in a band.

FASTA Step 1: Find runes of identities and identify regions with the highest density of identities. Sequence B Sequence A

FASTA Step 2: Re-score using PAM matrix and keep top scoring segments.

FASTA Step 3: Eliminate segments that are unlikely to be part of the alignment.

FASTA Step 4: Optimize the alignment in a band.

Band Alignment (Joint work with W. Pearson and W. Miller) Sequence A Sequence B

Band Alignment in Linear Space The remaining subproblems are no longer only half of the original problem. In worst case, this could cause an additional log n factor in time. W O(log n) O(nW)*(1+1+…+1) =O(nW log n)

Band Alignment in Linear Space

Splitting the problem into a few subproblems

Parallelogram

Parallelogram

Yet another partition line Band width W

Yet another partition line

Arbitrary region

Arbitrary region

BLAST Basic Local Alignment Search Tool (by Altschul, Gish, Miller, Myers and Lipman) The central idea of the BLAST algorithm is that a statistically significant alignment is likely to contain a high-scoring pair of aligned words.

The maximal segment pair measure A maximal segment pair (MSP) is defined to be the highest scoring pair of identical length segments chosen from 2 sequences. (for DNA: Identities: +5; Mismatches: -4) The MSP score may be computed in time proportional to the product of their lengths. (How?) An exact procedure is too time consuming. BLAST heuristically attempts to calculate the MSP score. the highest scoring pair

A matrix of similarity scores

A maximum-scoring segment

BLAST Build the hash table for Sequence A. Scan Sequence B for hits. Extend hits.

BLAST Step 1: Build the hash table for Sequence A. (3-tuple example) For DNA sequences: Seq. A = AGATCGAT 12345678 AAA AAC .. AGA 1 .. ATC 3 .. CGA 5 .. GAT 2 6 .. TCG 4 .. TTT For protein sequences: Seq. A = ELVIS Add xyz to the hash table if Score(xyz, ELV) ≧ T; Add xyz to the hash table if Score(xyz, LVI) ≧ T; Add xyz to the hash table if Score(xyz, VIS) ≧ T;

BLAST Step2: Scan sequence B for hits.

BLAST Step2: Scan sequence B for hits. Step 3: Extend hits. BLAST 2.0 saves the time spent in extension, and considers gapped alignments. hit Terminate if the score of the extension fades away. (That is, when we reach a segment pair whose score falls a certain distance below the best score found for shorter extensions.)

Gapped BLAST (I) The two-hit method

Gapped BLAST (II) Confining the dynamic-programming

BLAT

PatternHunter – Spaced Seed

Reference Bin Ma, John Tromp, Ming Li Bioinformatics Vol. 18 no. 3 2002 Define the Seed Defining the seed: w -> weight or number of positions to match Blastn: 11 MegaBlast: 28 model -> relative position of letters for each w l-> length of model “window”

Reference Bin Ma, John Tromp, Ming Li Bioinformatics Vol. 18 no. 3 2002 Seed Parameters: w = 11 letters: *, 1 1 1 1 * 1 * * 1 * 1 * * 1 1 * 1 1 1 l = 18 model 1 – exact match required * – no match required, any value Patternhunter most sensitive model Blastn seed is all “1”s

Consecutive vs. Nonconsecutive? Reference Bin Ma, John Tromp, Ming Li Bioinformatics Vol. 18 no. 3 2002 Consecutive vs. Nonconsecutive? The non-consecutive seed is the primary difference and strength of Patternhunter Blastn: 1 1 1 1 1 1 1 1 1 1 1 PatternHunter: 1 1 1 * 1 * * 1 * 1 * * 1 1 * 1 1 1

Reference Bin Ma, John Tromp, Ming Li Bioinformatics Vol. 18 no. 3 2002 Example: Consider the following two sequences: GAGTACTCAACACCAACATCAGTGGGCAATGGAAAAT || ||||||||| |||||||| |||||| |||||| GAATACTCAACAGCAACATCAATGGGCAGCAGAAAAT What’s the differences in finding the seed between Blast and PatternHunter?

BLAST uses “consecutive seeds” Reference Bin Ma, John Tromp, Ming Li Bioinformatics Vol. 18 no. 3 2002 BLAST uses “consecutive seeds” In BLAST, we often use the consecutive model with weight 11. GAGTACTCAACACCAACATCAGTGGGCAATGGAAAAT || ||||||||| |||||||| |||||| |||||| GAATACTCAACAGCAACATCAATGGGCAGCAGAAAAT → 11111111111 → … →… … → 11111111111 ← However, it fails to find the alignment in the two sequence.

Reference Bin Ma, John Tromp, Ming Li Bioinformatics Vol. 18 no. 3 2002 Consecutive seeds There’s also a dilemma for BLAST type of search. Dilemma Sensitivity – needs shorter seeds too many random hits, slow computation Speed – needs longer seeds lose distant homologies

PatternHunter uses “non-consecutive seed” Reference Bin Ma, John Tromp, Ming Li Bioinformatics Vol. 18 no. 3 2002 PatternHunter uses “non-consecutive seed” In PatternHunter, we often use the spaced model with weight 11 and length 18. GAGTACTCAACACCAACATCAGTGGGCAATGGAAAAT || ||||||||| |||||||| |||||| |||||| GAATACTCAACAGCAACATCAATGGGCAGCAGAAAAT 111*1**1*1**11*111

A trivial comparison between spaced and consecutive seed Reference Ming Li, NHC2005 A trivial comparison between spaced and consecutive seed Consider 111 and 11*1. To fail seed 111, we can use 110110110110… 66.66% similarity But we can prove, seed 11*1 will hit every region with 61% similarity for sufficient long region.

Proof Suppose there is a length 100 region which is not hit by 11*1. Reference Ming Li, NHC2005 Proof Suppose there is a length 100 region which is not hit by 11*1. Skipping the first 0’s, we can break the region into blocks of 1a0b. Besides the last block, the other blocks have the following few cases: 10b for b>=1 110b for b>=2 1110b for b>=2 In each block, similarity <= 3/5. The last block has at most 3 matches. So, in total there are at most 61 matches in 100 positions. The similarity is <=61%. floor(97*0.6)+3=61

PatternHunter (I)

Reference Ming Li, NHC2005 Formalize Given i.i.d. sequence (homology region) with Pr(1)=p and Pr(0)=1-p for each bit: 1100111011101101011101101011111011101 Which seed is more likely to hit this region: BLAST seed: 11111111111 Spaced seed: 111*1**1*1**11*111 111*1**1*1**11*111

Reference Ming Li, NHC2005 Expect Less, Get More Lemma: The expected number of hits of a weight W length M seed model within a length L region with homology level p is (L-M+1)pW Proof. E(#hits) = ∑i=1 … L-M+1 pW ■ Example: In a region of length 64 with p=0.7 Pr(BLAST seed hits)=0.3 E(# of hits by BLAST seed)=1.07 Pr(optimal spaced seed hits)=0.466, 50% more E(# of hits by spaced seed)=0.93, 14% less

Why Is Spaced Seed Better? Reference Ming Li, NHC2005 A wrong, but intuitive, proof: seed s, interval I, similarity p E(#hits) = Pr(s hits) E(#hits | s hits) Thus: Pr(s hits) = Lpw / E(#hits | s hits) For optimized spaced seed, E(#hits | s hits) 111*1**1*1**11*111 Non overlap Prob 111*1**1*1**11*111 6 p6 111*1**1*1**11*111 6 p6 111*1**1*1**11*111 6 p6 111*1**1*1**11*111 7 p7 ….. For spaced seed: the divisor is 1+p6+p6+p6+p7+ … For BLAST seed: the divisor is bigger: 1+ p + p2 + p3 + …

PatternHunter (II) 010111 (23)

Remarks Filtering is based on the observation that a good alignment usually includes short identical or very similar fragments. The idea of filtration was used in FASTA, BLAST, BLAT, and PatternHunter.