Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Space/Time Tradeoff and Heuristic Approaches in Pairwise Alignment.
Searching Sequence Databases
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
Heuristic alignment algorithms and cost matrices
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Database searching. Purposes of similarity search Function prediction by homology (in silico annotation) Function prediction by homology (in silico annotation)
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Heuristic Approaches for Sequence Alignments
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 16 th, 2014.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
An Introduction to Bioinformatics
BLAST What it does and what it means Steven Slater Adapted from pt.
Protein Sequence Alignment and Database Searching.
BLAST Workshop Maya Schushan June 2009.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Local alignment, BLAST and Psi-BLAST October 25, 2012 Local alignment Quiz 2 Learning objectives-Learn the basics of BLAST and Psi-BLAST Workshop-Use BLAST2.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Chapter 3 Computational Molecular Biology Michael Smith
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Tutorial 4 Substitution matrices and PSI-BLAST 1.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
BLAST, which stands for basic local alignment search tool, is a heuristic algorithm that is used to find similar sequences of amino acids or nucleotides.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
Sequence Alignment.
Pairwise Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 4, 2004 ChengXiang Zhai Department of Computer Science University.
Step 3: Tools Database Searching
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Dynamic programming with more complex models When gaps do occur, they are often longer than one residue.(biology) We can still use all the dynamic programming.
Computational Biology, Part 3 Representing and Finding Sequence Features using Frequency Matrices Robert F. Murphy Copyright  All rights reserved.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.
Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Blast heuristics, Psi-Blast, and Sequence profiles Morten Nielsen Department of systems biology, DTU.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Blast Basic Local Alignment Search Tool
LSM3241: Bioinformatics and Biocomputing Lecture 4: Sequence analysis methods revisited Prof. Chen Yu Zong Tel:
Fast Sequence Alignments
CSE 5290: Algorithms for Bioinformatics Fall 2009
Searching Sequence Databases
Presentation transcript:

Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1

BLAST Basic local alignment search tool Enable to compare query sequence of amino-acid or protein or DNA to database To find similarity between sequences 2

BLAST 1.Make a k-letter word list of the query sequence 2.Using the scoring matrix (for protein,BLOSOM62 is often used) to score the comparison of k-letter word and possible matching word 3.Words which score above threshold T is remained 4.Scan the database sequences for exact matches with the remaining high-scoring words 3

BLAST 5.Extend the exact matches to high-scoring segment pair (HSP) 6.Evaluate the significance of the HSP score 7.Show local alignments of the query and each of the matched database sequences 4

refinement (i) criterion for extending word pairs is modified - lower threshold T - two word pairs 5

refinement (ii)The ability to generate gapped alignments has been added - allow threshold T increase - increase speed 6

refinement (iii)Position-specific score matrix - BLAST is easily generalized - speed and ease of operation 7

Statistical Preliminaries S’ : normalized score λ 、 K is determined by score matrix E : expect score N=m*n ex: for query protein length 250 and database 50million residues, to achieve E-value of 0.05,S’ is around 38 8

Original BLAST (Algorithm) extract each substring of length k from the query sequence “word” For proteins, k=3; for DNA, k=11 9

Original BLAST (Algorithm) For each word in the query sequence, find a set of word with score higher than k (when aligned with the word in the query sequence) Score: scoring matrix 10

11

Original BLAST (Algorithm) Scan the database, for these words which has score>T when aligned with some words in the query sequence “hit” 12

Original BLAST (Algorithm) Extend the hit in both direction, to generate HSP (HSP: High-scoring Segment Pair) Find HSP of statistical significance 13

Refinement: the Two-Hit method Extension is the most time-consuming part in the algorithm (>90% execution time) Observation: HSP length >>word length(k)  an HSP may contain multiple hits on the same diagonal and close to each other 14

Refinement: the Two-Hit method (Query) …PQGEFGVVVEEFQEE… (Database) …PVGGVGPFEEEFVQE… (Query) …PQGEFGVVVEEFQEE… (Database)…PVGGEEFVGPFEVQE… (Query) …PQGEFGVVVEEFQEE… (Database)…EEFVQEPVGGVGPFE… 15

16

Refinement: the Two-Hit method Method: Choose a window length A, invoke an ungapped extension when 2 hits are found within distance A and on the same diagonal T value need to be lowered to get sufficient sensitivity 17

Results Sensitivity Speed~2 times faster! 18

Gapped Alignment (Original BLAST): Several distinct HSPs on the same database sequence are combined Several HSP with only moderate score can be combined to have great significance But for HSPs with moderate scores, the sensitivity is low…  compensate by lowering T? 19

Gapped Alignment Define a moderate score S g, trigger gapped extension for all HSP with score> S g Can tolerate probability of missing HSP Ex: result should > 0.95, p: miss probability of HSP (Orignial) need to detect both HSPs: (1-p)(1-p)>0.95 p<0.025 (New) only need to detect one HSP:p 2 <0.05 p<0.22  T can be raised to speed up! 20

Summary Ungapped extension: Require 2 hits of score >T, and within distance A to evoke Gapped extension: For HSPs with score higher than Sg, gapped extension is evoked 21

Gapped Local Alignment EX. Broad bean leghemoglobin I V.S. Horse beta globin 1.Construct a k-letter “word” list 2.Scan database for “hit” 3.Extend “hit” ─ Two-Hit Method 22

4.Finding the seed < 11 central residue > 11 central residue of highest-scoring length-11 HSP

5.Gapped extensions - Considering cells for which the optimal local alignment score falls no more than Xg 24

6.Optimal Local Alignment 25

Relative Time Spent by BLAST & Gapped BLAST 26

Position-specific iterated BLAST Construct a position-specific score matrix(profile or motif) from the output of a BLAST run More sensitive Detection of weak relationships For each iteration takes little more than the same time to run

Position-specific score matrix architecture

Improved estimation of the probabilities with which amino acid occur at various pattern position More sensitive Relatively precise definition of the boundaries of important motifs Local alignment The size of the search space may be greatly reduced, lowering random noise Gap score In each iteration, employ the same gap scores that are used in the first BLAST run

Multiple alignment construction Collect all database sequence segments with E-value below a threshold(0.01) Multiple alignment M Retain any rows that are >98% identical to one another Gap characters inserted into the query are simply ignored Reduced multiple alignment M C

Sequence weights A mistake to give all sequences of the alignment equal weight Assign weights to the various sequences Any columns consisting of identical residues are ignored in calculating weights Column’s observed residue frequency The effective number of independent observations The relative number N C

Target frequency estimation

BLAST applied to position-specific score matrices

SWISS-PROT 108

Performance of PSI-BLAST-shuffled database test CCCCCCCCCCDDDDDDDDDDEEEEEEEEEEFFFFFFFFFF Original sequence DDCCCFFEFEEDFCCEFCCEDEEFCEDFFEDDDEDFFDCC shuffled sequence EDECDECDFDFCCDCDFCDECCDDEFEFFDCFFECEEFFE EDDFFDCEFECEFCCFCCFEDECDCDCDEDFDFDFEFCEE... Query compare with Database Significant alignments Significant alignments compare with Shuffled Database Score E value Test the accuracy of PSI-BLAST

(Lowest E-value) Performance of PSI-BLAST-shuffled database test It can automate the construction of position-specific score matrices during multiple iterations of the PSI- BLAST program.

Normalized speed Gapped BLAST is fast PSI-BLAST finds weak homologs PSI - Gapped

HIT and GALT protein Member of HIT family protein HIT protein & GALT protein alignment Lowest E value = 2x10 -4 PSI-BLASTStructure GALTP43424 HIT Family protein

HIT protein HIT protein / H.Influenza GALT HIT protein / yeast 5’’,5’’’-P1,P4-tetraphosphate phosphorylase I E = 4x10 -5 E = 2x10 -4

DISCUSSION In addition to the major algorithmic changes described above, we have modified an aspect of the original BLAST program’s output routine that on occasion caused important similarities to be overlooked. 41

Future Improvement Gap costs- In many cases, the new gap costs generate local alignments that are both more accurate and more statistically significant. Realignment-the realignment procedure can prevent inaccurate pairwise alignments from corrupting the evolving multiple alignment, and can accelerate the recognition of related sequences, all for very little computational cost. 42

CONCLUSION the new gapped version of BLAST is both considerably faster than the original one, and able to produce gapped alignments. For many queries, the PSI-BLAST extension can greatly increase sensitivity to weak but biologically relevant sequence relationships. PSI-BLAST retains the ability to report accurate statistics, per iteration runs in times not much greater than gapped BLAST, and can be used both iteratively and fully automatically. 43