BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio. 1990. CS 466 Saurabh Sinha.

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.
BLAST Sequence alignment, E-value & Extreme value distribution.
Last lecture summary.
Sequence Alignment Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
Searching Sequence Databases
Heuristic alignment algorithms and cost matrices
Fa05CSE 182 L3: Blast: Keyword match basics. Fa05CSE 182 Silly Quiz TRUE or FALSE: In New York City at any moment, there are 2 people (not bald) with.
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Heuristic Approaches for Sequence Alignments
Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Sequence alignment, E-value & Extreme value distribution
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 17 th, 2013.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Inferring function by homology The fact that functionally important aspects of sequences are conserved across evolutionary time allows us to find, by homology.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
An Introduction to Bioinformatics
BLAST Workshop Maya Schushan June 2009.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
Computational Biology, Part 9 Efficient database searching methods Robert F. Murphy Copyright  1996, 1999, All rights reserved.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
BLAST, which stands for basic local alignment search tool, is a heuristic algorithm that is used to find similar sequences of amino acids or nucleotides.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
BLAST, which stands for basic local alignment search tool, is a heuristic algorithm that is used to find similar sequences of amino acids or nucleotides.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
Sequence Alignment.
Construction of Substitution matrices
Doug Raiford Phage class: introduction to sequence databases.
Step 3: Tools Database Searching
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.
Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Blast Basic Local Alignment Search Tool
Homology Search Tools Kun-Mao Chao (趙坤茂)
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Homology Search Tools Kun-Mao Chao (趙坤茂)
Fast Sequence Alignments
Basic Local Alignment Search Tool
Homology Search Tools Kun-Mao Chao (趙坤茂)
BLAST Slides adapted & edited from a set by
Sequence alignment, E-value & Extreme value distribution
BLAST Slides adapted & edited from a set by
Searching Sequence Databases
Presentation transcript:

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha

Motivation Sequence homology to a known protein suggest function of newly sequenced protein Bioinformatics task is to find homologous sequence in a database of sequences Databases of sequences growing fast

Alignment Natural approach to check if the “query sequence” is homologous to a sequence in the database is to compute alignment score of the two sequences Alignment score counts gaps (insertions, deletions) and replacements Minimizing the evolutionary distance

Alignment Global alignment: optimize the overall similarity of the two sequences Local alignment: find only relatively conserved subsequences Local similarity measures preferred for database searches –Distantly related proteins may only share isolated regions of similarity

Alignment Dynamic programming is the standard approach to sequence alignment Algorithm is quadratic in length of the two sequences Not practical for searches against very large database of sequences (e.g., whole genome)

Scoring alignments Scoring matrix: 4 x 4 matrix (DNA) or 20 x 20 matrix (protein) Amino acid sequences: “PAM” matrix –Consider amino acid sequence alignment for very closely related proteins, extract replacement frequencies (probabilities), extrapolate to greater evolutionary distances DNA sequences: match = +5, mismatch = -4

BLAST: the MSP Given two sequences of same length, the similarity score of their alignment (without gaps) is the sum of similarity values for each pair of aligned residues Maximal segment pair (MSP): Highest scoring pair of identical length segments from the two sequences The similarity score of an MSP is called the MSP score BLAST heuristically aims to find this

Locally maximal segment pair A molecular biologist may be interested in all conserved regions shared by two proteins, not just their highest scoring pair A segment pair (segments of identical lengths) is locally maximal if its score cannot be improved by extending or shortening in either direction BLAST attempts to find all locally maximal segment pairs above some score cutoff.

Rapid approximation of MSP score Goal is to report those database sequences that have MSP score above some threshold S. Statistics tells us what is the highest threshold S at which “chance similarities” are likely to appear Tractability to statistical analysis is one of the attractive features of the MSP score

Rapid approximation of MSP score BLAST minimizes time spent on database sequences whose similarity with the query has little chance of exceeding this cutoff S. Main strategy: seek only segment pairs (one from database, one query) that contain a word pair with score >= T Intuition: If the sequence pair has to score above S, its most well matched word (of some predetermined small length) must score above T Lower T => Fewer false negatives Lower T => More pairs to analyze

Implementation 1.Compile a list of high scoring words 2.Scan database for hits to this word list 3.Extend hits

Step 1: Compiling list of words from query sequence For proteins: List of all w-length words that score at least T when compared to some word in query sequence Question: Does every word in the query sequence make it to the list? For DNA: list of all w-length words in the query sequence, often with w=12

Step 2: Scanning the database for hits Find exact matches to list words Can be done in linear time –two methods (next slides) Each word in list points to all occurrences of the word in word list from previous step

Scanning the database for hits Method 1: Let w=4, so 20 4 possible words Each integer in 0 … is an index for an array Array element point to list of all occurrences of that word in query Not all 20 4 elements of array are populated –only the ones in word list from previous step

Scanning the database for hits Method 2: use “deterministic finite automaton” or “finite state machine”. Similar to the keyword trees seen in course. Build the finite state machine out of all words in word list from previous step

Step 3: Extending hits Once a word pair with score >= T has been found, extend it in each direction. Extend until score >= S is obtained During extension, score may go up, and then down, and then up again Terminate if it goes down too much (a certain distance below the best score found for shorter extensions) One implementation allows gaps during extension

BLAST: approximating the MSP BLAST may not find all segment pairs above threshold S Trying to approximate the MSP Bounds on the error: not hard bounds, but statistical bounds –“Highly likely” to find the MSP

Statistics Suppose the MSP has been calculated by BLAST (and suppose this is the true MSP) Suppose this observed MSP scores S. What are the chances that the MSP score for two unrelated sequences would be >= S? If the chances are very low, then we can be confident that the two sequences must not have been unrelated

Statistics Given two random sequences of lengths m and n Probability that they will produce an MSP score of >= x ?

Statistics Number of separate SPs with score >= x is Poisson distributed with mean y(x) = Kmn exp(- x), where is the positive solution of ∑p i p j exp( s(i,j)) = 1 K is a constant s(i,j) is the scoring matrix, p i is the frequency of i in random sequences

Statistics Poisson distribution: Pr(x) = (e - x )/x! Pr(#SPs >=  ) = 1 - Pr(#SPs <=  -1)

Statistics For  =1, Pr(#SPs >= 1) = 1-e -y(x) Choose S such that 1-e -y(S) is small Suppose the probability of having at least 1 SP with score >= S is This seems reasonably small However, if you test random sequences, you expect 10 to cross the threshold Therefore, require “E-value” to be small. That is, expected number of random sequence pairs with score >= S should be small.

More statistics We just saw how to choose threshold S How to choose T ? BLAST is trying to find segment pairs (SPs) scoring above S If an SP scores S, what is the probability that it will have a w-word match of score T or more? We want this probability to be high

More statistics: Choosing T Given a segment pair (from two random sequences) that scores S, what is the probability q that it will have no w-word match scoring above T? Want this q to be low Obtained from simulations Found to decrease exponentially as S increases

BLAST is the universally used bioinformatics tool