1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
BLAST Sequence alignment, E-value & Extreme value distribution.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Searching Sequence Databases
Lecture outline Database searches
Heuristic alignment algorithms and cost matrices
Sequence similarity (II). Schedule Mar 23midterm assignedalignment Mar 30midterm dueprot struct/drugs April 6teams assignedprot struct/drugs April 13RNA.
Fa05CSE 182 L3: Blast: Keyword match basics. Fa05CSE 182 Silly Quiz TRUE or FALSE: In New York City at any moment, there are 2 people (not bald) with.
We continue where we stopped last week: FASTA – BLAST
Slide 1 EE3J2 Data Mining Lecture 20 Sequence Analysis 2: BLAST Algorithm Ali Al-Shahib.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Heuristic Approaches for Sequence Alignments
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 16 th, 2014.
Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 17 th, 2013.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Information theoretic interpretation of PAM matrices Sorin Istrail and Derek Aguiar.
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Computational Biology, Part 9 Efficient database searching methods Robert F. Murphy Copyright  1996, 1999, All rights reserved.
Scoring Matrices April 23, 2009 Learning objectives- 1) Last word on Global Alignment 2) Understand how the Smith-Waterman algorithm can be applied to.
1 Lecture outline Database searches –BLAST –FASTA Statistical Significance of Sequence Comparison Results –Probability of matching runs –Karin-Altschul.
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Chapter 3 Computational Molecular Biology Michael Smith
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Rationale for searching sequence databases June 25, 2003 Writing projects due July 11 Learning objectives- FASTA and BLAST programs. Psi-Blast Workshop-Use.
Significance in protein analysis
. Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms for l global alignment l local alignment l Multiple.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
Sequence Alignment.
Construction of Substitution matrices
Bioinformatics Computing 1 CMP 807 – Day 2 Kevin Galens.
Step 3: Tools Database Searching
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Blast Basic Local Alignment Search Tool
Fast Sequence Alignments
Basic Local Alignment Search Tool
BLAST Slides adapted & edited from a set by
BLAST Slides adapted & edited from a set by
Searching Sequence Databases
Presentation transcript:

1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one which binds with other proteins) are not while regions between active sites (which help to shape the protein) are. Smith-Waterman algorithm is appropriate to detect these parts, except that it runs in o(n 2 ). When comparing a sequence with a database of sequences one claims an algorithm which runs in o(n). Thus heuristics are required: those sequences to which heuristics assign high scores are then rescored with Smith-Waterman algorithm.

2 BLAST(cont.1) BLAST relies on two assumptions: - high scoring alignments contain one or more high scoring pairs of three letters sub-strings (words, in BLAST language) which leads to seeds (couples of words obtained from comparison with the database sequence). -homologous proteins contain significant gap-free alignments (most recent versions sequences select sequences on a gap-free base, but then applies an optimized version of Smith-Waterman algorithm). BLAST returns a list of high-scoring segment pairs (gap-free alignments) between the query sequence and sequences in the database. It proceeds in two steps.

3 Preprocessing of the query string Construction of a hash table (query index) with 20*20*20 = 8000 possible keys (all possible words) of three letters. Value of a key is the list of all positions in the query string of all words that give a high score if aligned (without gap) with the key word. Example: Query string: CINCINNATI Key: NCT If good scores are over 11 (with the scoring matrix BLOSUM62, it is a good threshold) then the value of the key is [3, 7]. That is because from BLOSUM62, the alignment between NCI (position 3) and NCT scores = 14 and the alignment between NAT (position 7) score = 11.

4 Scanning the target string The first goal is to identify hits: location of high scoring word pairs. Example: the BLOSUM62 scoring matrix is considered and a good score is over 11. Query string: CINCINNATI Target string : PRECINCTS Value of the key NCT: [3, 7] Two hits are found for NCT: (3, 6) and (7, 6) Then the hits are extended (right and left) to whatever positions maximizes the overall alignment score ( a tolerance can be introduced). Example: we get at query position 3 the segment pair (CINCIN, CINCTS) with score (-1) + 1 = 28, but the hit at position 7 (NAT, NCT) cannot be extended

5 MSP (Maximum Segment Pair) A maximum segment pair between two sequences is a segment pair of maximum score. Its score (MSP score) is a measure of sequence similarity (which has to be appreciated statistically). Alignments between the query sequence and target sequences which are reported come from MSPs.

6 2.Statistics of BLAST Database Searches The goal is to appreciate statistically how good is a (non gaped) local alignment which is given for example by BLAST (empirically, gaped local alignments seem to behave similarly). First, we study a case where substitution matrices (like PAM) do not intervene (typically DNA sequences) and then consider the general case.

7 Scores for random DNA The issue: Q: query sequence, D: target sequence, score of a match = 1, score of a mismatch = -1. A gap-free local assignment is discovered by BLAST (for example ) whose score is 13. Does it means that Q and D are related or simply that it occurs « by chance » ? The solution: Express the probability that an alignment with such a score occurs by chance and give a way to estimate and to compute this probability. Lower is this probability and higher we would believe that Q and D are related.

8 Scores for random DNA(cont.1) a) The first goal is to approximate the probability P s (i, j ) that a local alignment terminating in positions i and j has a score better than s by a function of the score s: P s (i, j) = Pr( best(i, j)  s ) for s = 0, 1, 2… where best(i, j) is the best score of a local alignment ending at the ith position of D and the jth position of Q. « by chance » is interpreted as Q and D are « random sequence », ie. (1) each position is equally likely A, C, G, T and (2) each position is independent of every other; Because of (1): Pr( D i = Q j ) = 1 - Pr( D i # Q j ) = 1/4 Because of (2): P s (i, j) = Pr( D i = Q j ) * Pr( best(i, j)  s | D i = Q j ) + Pr( D i # Q j ) * Pr( best(i, j)  s | D i # Q j )

9 Scores for random DNA(cont.2) The conditional probabilities are easily expressed in term of the P s (i, j)’s. Pr( best(i, j)  s | D i = Q j ) = P s-1 (i-1, j-1) Pr( best(i, j)  s | D i # Q j ) = P s+1 (i-1, j-1) So P s (i, j) could be defined by: P s (i, j) = (3/4) * P s+1 (i-1, j-1) + (1/4)*P s-1 (i-1, j-1) with P 0 (i, j) = 1 Also, in our case, P s (i, j) = 0 if i < s or j < s.

10 Scores for random DNA(cont.3) Then, one can compute for example P 2 (3, 5): P 2 (3, 5) = (3/4) * P 3 (2, 4) + (1/4)*P 1 (2, 4) = 0 + (1/4)*P 1 (2, 4) = (1/4)*((3/4) * P 2 (1, 3) + (1/4)*P 0 (1, 3)) = (1/4)*(1/4)*1 =1/16 Note that P 2 (3, 5) and P 2 (5, 3) would give the same answer. It appears that P s (i, j) = P s (j, i) = P s (m, m) = P s, m where m = min(i, j).

11 Scores for random DNA(cont.4) We get: P s,m = (3/4) * P s+1,m-1 + (1/4)* P s-1,m-1 A majorant (we are cautious) of P s,m is P s = lim m->  P s,m As lim m->  P s,m = (3/4) * lim m->  P s+1,m-1 + (1/4)* lim m->  P s-1,m-1 then one can study P s by using: P s = (3/4) * P s+1 + (1/4)* P s-1 whose characteristic equation is 3x 2 –4x +1 =0 with roots r 0 = 1 and r 1 = 1/3 So P s = c 0 r 0 s + c 1 r 1 s = c 0 + c 1 (1/3) s where c 0 and c 1 are rational.

12 Scores for random DNA(cont.5) To determine c 0 and c 1, we use P 0 and P 1 such that: P 0 = 1 = c 0 + c 1 P 1 =  s  1 P s -  s  2 P s = (1/4)* (P 0 + P 1 ) = 1/3 = c 0 + (1/3)* c 1 We get c 0 = 0 and c 1 = 1 So a majorant of P s is (1/3) s which is a good estimate of P s (i, j) unless i or j is quite small.

13 Scores for random DNA(cont.6) b) The second goal is to approximate the probability that a high score occurs at any of the positions (i, j). If Z smn is the sum of the positions (i, j) such that best-score(i, j)  s. This probability is: Pr( Z smn  1) We use the following Markov inequality to get a majorant of Pr (Z smn  1): Pr( Z smn   *E[Z smn ])  1/  E[Z smn ] is the expected number of position pairs with score exceeding s. The goal is then to express it in terms of m, n and s. Doing so, we obtain 1/ .

14 Scores for random DNA(cont.7) E[Z smn ] = E [  i=1 m  j=1 n I sij ] where I sij  (best(i, j )  s =  i=1 m  j=1 n E[I sij ]   i=1 m  j=1 n P s,min(i, j) = m* n * P s We get: Pr( Z smn  1)  Pr( Z smn  (m* n * P s )-1* E[Z smn ]) And then Pr( Z smn  1)  m* n * P s E[Z smn ] : E-value E Pr( Z smn  1) : P-value P

15 Scores for random DNA(cont.8) If the scores at the different positions were independent, statistics says that we would have (extreme value distribution) : P = 1 – e -E Accepting this approximation, we get: P = 1 –(1-E + E 2 /2) + … P and E are about the same when small (within 5% when either is less than O.1). Conclusion from our example: D and Q have a free-gapped alignment whose score is 13: -|D| = |Q| = 10 3 then E ~ 1O -3 * 1O -3 * (1/3) 13 ~ 1 as (1/3) 13 ~ 1O- 6. The score 13 is not surprising. -|D| = |Q| = 10 2 then E ~ 1O -2 * 1O -2 * 1O -6 = 1O 2. D and Q seem related.

16 BLAST scores for random residues This time, amino acid sequences have non-uniform background probabilities and scoring matrices. Example: suppose we have only three amino acids A, B and C with p A = 1/3, p B = 1/6, p C = 1/2 and the following scoring matrix sc: ABC A2-3 B 10 C 01

17 BLAST scores for random residue (cont.1) As before the issue is to approximate P s, m = Pr(best(m, m)  s) by a function of s. We get: P s, m = Pr( Q m =A  D m =A )*Pr(best(m, m)  s | Q m =A  D m =A ) + Pr( Q m =A  D m =B )*Pr(best(m, m)  s | Q m =A  D m =B ) +… = (1/3) * (1/3) * P s-sc(A,A), m-1 + (1/3) * (1/6)* P s-sc(A,B), m-1 +… = (1/9) * P s-2, m-1 + (1/9) * P s+3, m-1 +…

18 BLAST scores for random residue (cont.2) Passing to the limit, we get: P s =(4/9)* P s+3 + (1/6) *P s + (5/18) P s-1 + (1/9) P s-2 Then the characteristic polynomial is: 8*x 5 -15*x 2 + 5*x + 2 = 0 And finally: P s ~(0.9).(0.6287) s

19 3 BLAST output Roughly speaking, BLAST defines E-values such that: E ~ Kmne - s where K = c1 an = - ln r1 E-value must be lower than to be considered for homology. It also introduce a bit score s’ such that E = m*n*2 –s’ Or s’ = ( s – ln K)/ln 2 The higher the bit score, the more similar the sequences. Match below 50 are very unreliable.

20 BLAST output(cont.1) Are also given: -a listing of the database sequences with high-scoring alignments, along with the bit scores and the E-values of the alignments. -the alignments themselves with detailes (for example the number of identical residue pairs and with positive scores) -statistics related to the query and database as a whole. For example, the values of and of K for the substitution matrix used for the alignment (P ~ K *e - s )