BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
BLAST Sequence alignment, E-value & Extreme value distribution.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Space/Time Tradeoff and Heuristic Approaches in Pairwise Alignment.
Searching Sequence Databases
Lecture outline Database searches
Heuristic alignment algorithms and cost matrices
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
From Pairwise Alignment to Database Similarity Search.
Introduction to bioinformatics
Sequence similarity.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
From Pairwise Alignment to Database Similarity Search.
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 16 th, 2014.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Sequence alignment, E-value & Extreme value distribution
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 17 th, 2013.
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Inferring function by homology The fact that functionally important aspects of sequences are conserved across evolutionary time allows us to find, by homology.
An Introduction to Bioinformatics
Protein Sequence Alignment and Database Searching.
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
BLAST : Basic local alignment search tool B L A S T !
Biology 224 Tom Peavy Sept 20 & 22, 2010
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Indexing DNA sequences for local similarity search Joint work of Angela, Dr. Mamoulis and Dr. Yiu 17/5/2007.
1 Lecture outline Database searches –BLAST –FASTA Statistical Significance of Sequence Comparison Results –Probability of matching runs –Karin-Altschul.
Local alignment, BLAST and Psi-BLAST October 25, 2012 Local alignment Quiz 2 Learning objectives-Learn the basics of BLAST and Psi-BLAST Workshop-Use BLAST2.
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Significance in protein analysis
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?
Biology 4900 Biocomputing.
Sequence Alignment.
Construction of Substitution matrices
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,
Introduction to Bioinformatics
Introduction to Bioinformatics DNA and Protein Database Searching BLAST: Basic local alignment search tool Xiaolong Wang College of Life Sciences Ocean.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Courtesy of Jonathan Pevsner
Basic Local Alignment Sequence Tool (BLAST)
Blast Basic Local Alignment Search Tool
Identifying templates for protein modeling:
Johns Hopkins School of Medicine
Basic Local Alignment Search Tool
BLAST Slides adapted & edited from a set by
Sequence alignment, E-value & Extreme value distribution
BLAST Slides adapted & edited from a set by
1-month Practical Course Genome Analysis Iterative homology searching
Searching Sequence Databases
Presentation transcript:

BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier

What is BLAST? BLAST (Basic Local Alignment Search Tool) allows rapid sequence comparison of a query sequence against a database. Smith-Waterman is rigorous and it is guaranteed to find an optimal alignment. But also time and space consuming. It is especially inefficient in database searches. BLAST provides a rapid alternative.o S-W

Why do we use BLAST? To understanding the relatedness of any protein or DNA sequence (query sequence) to other known sequences (database) Identify sequences with a common ancestor (orthologs) and paralogs Discover new genes or proteins Explore protein structure and function …

The BLAST Algorithm S. F. Altschul, et al., 1997, Nucleic Acids Research, 25:3389 “The central idea of the BLAST algorithm is to confine attention to segment pairs that contain a word pair of length w with a score of at least T.” Altschul et al. (1990)

How the original BLAST algorithm works: Step 1. size w words in the query sequence Look at the query sequence by a moving window of size w Example: for a human RBP query …FSGTWYA… (query word is in yellow) The moving window of words: FSG SGT GTW TWY WYA page 101

Step 1: compile a list of words scoring at least T with query word GTW 6,5,11 22 ASW 6,1,11 18 word hitsATW 0,5,1116 > threshold NTW 0,5,1116 GTY 6,5,213 GNW10 GAW9 word hits < threshold (T=11)

3. Extend: when you manage to find a “hit” extend the hit in either direction. Keep track of the score (use a scoring matrix) Stop when the score drops below some cutoff. KENFDKARFSGTWYAMAKKDPEG 50 RBP (query) MKGLDIQKVAGTWYSLAMAASD. 44 lactoglobulin (hit) Hit! extend 2. Scan the database for entries that contains any word from the compiled hit list.

Alignment Score It is important to assess the statistical significance of search results. For local alignments (including BLAST search results), the scores follow an extreme value distribution (EVD). E-value is closely related to the analysis of the distribution of alignment score Karlin, S. & Altschul, S.F. (1990) "Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes." Proc. Natl. Acad. Sci. USA 87: (PubMed)(PubMed)

Alignment score as a random walk

Max Score in an Excursion P i : frequency of residue i in 1 st seq. P k ’ : frequency of residue k in 2 nd seq.

Protein Scoring

Dist’n of any excursion Y an exponential distribution

The Max of n variables Y 1, Y 2, …, Y n are identical & independently distributed Y max is the max of the above all. Then Prob( Y max<y)=(Prob(Y<y)) n

The Max of n Exponential variables Y 1, Y 2, …, Y n are independent exponential variables Y max is the max of the above all. Then Prob( Y max<y)=(Prob(Y<y)) n =(1-e -λy ) n Prob( Y max>y)=1-(Prob(Y<y)) n =1-(1-e -λy ) n

In a database of n seq.s Number of sequences: n Y 1, Y 2,…, Y n are i.i.d. exponential What happens when n is large? Using a widely used rule: (1+x/n) n  exp(x)  1-exp(-nCe λy ) Probability of scores/excursion higher than y The distribution of Y max follows an extreme value dist’n

x probability normal distribution the sum of a large number of independent identically distributed (i.i.d) random variables tends to a normal distribution,

x probability extreme value distribution normal distribution the maximum of a large number of i.i.d. random variables tends to an extreme value distribution

Expected number of better scores/higher excursions  1-exp(-nCe λy ) = E-value p-value

E is the number of hits you would expect from your search with scores greater than S where: K is a constant m is the size of the query n is the size of the database being searched scales for the specific scoring matrix used (decay constant from the extreme value distribution) E = Kmn e - S

Very small E values are very similar to p values. E values of about 1 to 10 are far easier to interpret than corresponding p values. Ep=1-exp(-E) (about 0.1) (about 0.05) (about 0.001) How to interpret BLAST: E values and p values

End

Extreme value distribution The distribution: The area to the right of S: Scaling to a particular type of score: where μ is the mode and λ is a scale factor.

Extreme value distribution The distribution: The area to the right of S: Scaling to a particular type of score: where μ is the mode and λ is a scale factor. Compute this value for x=0.

Extreme value distribution The distribution: The area to the right of S: Scaling to a particular type of score: where μ is the mode and λ is a scale factor. Compute this value for x = 0. Solution: exp[-1] = 0.368

An example You run BLAST and get a score of 45. You then run BLAST on a shuffled version of the database, and fit an extreme value distribution to the resulting empirical distribution. The parameters of the EVD are μ = 25 and λ = What is the p-value associated with 45?

An example You run BLAST and get a score of 45. You then run BLAST on a shuffled version of the database, and fit an extreme value distribution to the resulting empirical distribution. The parameters of the EVD are μ = 25 and λ = What is the p-value associated with 45?

Another example You run BLAST and get a score of 23. You then run BLAST on a shuffled version of the database, and fit an extreme value distribution to the resulting empirical distribution. The parameters of the EVD are μ = 20 and λ = What is the p-value associated with 23?

Another example You run BLAST and get a score of 23. You then run BLAST on a shuffled version of the database, and fit an extreme value distribution to the resulting empirical distribution. The parameters of the EVD are μ = 20 and λ = What is the p-value associated with 23?

BLAST: optional parameters You can... choose the organism to search turn filtering on/off change the substitution matrix change the expect (e) value change the word size change the output format

Choosing Gap Penalty 1. Choice must be made corresponding to each type of scoring system to place gaps where they will increase the overall alignment score. 2. There are no hard and fast rules for choosing gap penalties. 3. Both g (opening penalty) and r (extension penalty) should be non-zero. 4. The value of g + r should be greater than the maximum score used for a match if insertions and deletions are considered to be rarer than nucleotide substitutions. 5. The value of g strongly influences the number of gaps introduced into a region separating two closely matching regions.

Comparing Scoring Matrix PAM Homologous seq.s during evolution Based on extrapolation of a small evol. Period Track evolutionary origins BLOSUM Conserved blocks Based on a range of evol. Periods Find conserved domains

Another way to compare perform a search of a sequence database with a known member of a protein family and to find how many members of the family are found. When gap penalty was not considered, the BLOSUM62 matrix outperformed the PAM250 matrix in finding more members of 504 different families on the Prosite database.

Extension: In the original (1990) implementation of BLAST, hits were extended in either direction. In a 1997 refinement of BLAST, two independent hits are required. The hits must occur in close proximity to each other. With this modification, only one seventh as many extensions occur, greatly speeding the time required for a search. BLAST Phase 3

How a BLAST search works: threshold You can modify the threshold parameter. The default value for blastp is 11. To change it, enter “-f 16” or “-f 5” in the advanced options.

The expect value E is the number of alignments with scores greater than or equal to score S that are expected to occur by chance in a database search. An E value is related to a probability value p. The key equation describing an E value is: E = Kmn e - S How to interpret a BLAST search: expect value

This equation is derived from a description of the extreme value distribution S = the score E = the expect value = the expected number of HSPs with a score >= S m, n = the length of two sequences, K = Karlin Altschul statistics E = Kmn e - S

Some properties of the equation E = Kmn e - S The value of E decreases exponentially with increasing S (higher S values correspond to better alignments). Very high scores correspond to very low E values. The E value for aligning a pair of random sequences must be negative! Otherwise, long random alignments would acquire great scores Parameter K describes the search space (database). For E=1, one match with a similar score is expected to occur by chance. For a very much larger or smaller database, you would expect E to vary accordingly

From raw scores to bit scores There are two kinds of scores: raw scores (calculated from a substitution matrix) and bit scores (normalized scores) Bit scores are comparable between different searches because they are normalized to account for the use of different scoring matrices and different database sizes S’ = bit score = ( S - lnK) / ln2 The E value corresponding to a given bit score is: E = mn 2 -S’ Bit scores allow you to compare results between different database searches, even using different scoring matrices.

The expect value E is the number of alignments with scores greater than or equal to score S that are expected to occur by chance in a database search. A p value is a different way of representing the significance of an alignment. p = 1 - e -  How to interpret BLAST: E values and p values