1 Lecture outline Database searches –BLAST –FASTA Statistical Significance of Sequence Comparison Results –Probability of matching runs –Karin-Altschul.

Slides:



Advertisements
Similar presentations
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
BLAST Sequence alignment, E-value & Extreme value distribution.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
OUTLINE Scoring Matrices Probability of matching runs Quality of a database match.
Lecture 8 Alignment of pairs of sequence Local and global alignment
Searching Sequence Databases
Database searching. Purposes of similarity search Function prediction by homology (in silico annotation) Function prediction by homology (in silico annotation)
Lecture outline Database searches
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
Fa05CSE 182 L3: Blast: Keyword match basics. Fa05CSE 182 Silly Quiz TRUE or FALSE: In New York City at any moment, there are 2 people (not bald) with.
We continue where we stopped last week: FASTA – BLAST
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Heuristic alignment algorithms; Cost matrices 2.5 – 2.9 Thomas van Dijk.
Introduction to bioinformatics
Sequence similarity.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Database searching. Purposes of similarity search Function prediction by homology (in silico annotation) Function prediction by homology (in silico annotation)
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Database searching. Purposes of similarity search Function prediction by homology (in silico annotation) Function prediction by homology (in silico annotation)
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Sequence alignment, E-value & Extreme value distribution
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 17 th, 2013.
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
An Introduction to Bioinformatics
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
BLAST : Basic local alignment search tool B L A S T !
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Biology 224 Tom Peavy Sept 20 & 22, 2010
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Local alignment, BLAST and Psi-BLAST October 25, 2012 Local alignment Quiz 2 Learning objectives-Learn the basics of BLAST and Psi-BLAST Workshop-Use BLAST2.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
Comp. Genomics Recitation 3 The statistics of database searching.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Significance in protein analysis
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Sequence Alignment.
Construction of Substitution matrices
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,
Introduction to Bioinformatics
Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Pairwise Sequence Alignment and Database Searching
Courtesy of Jonathan Pevsner
Identifying templates for protein modeling:
Sequence comparison: Significance of similarity scores
Basic Local Alignment Search Tool
BLAST Slides adapted & edited from a set by
Sequence alignment, E-value & Extreme value distribution
BLAST Slides adapted & edited from a set by
Searching Sequence Databases
Presentation transcript:

1 Lecture outline Database searches –BLAST –FASTA Statistical Significance of Sequence Comparison Results –Probability of matching runs –Karin-Altschul statistics –Extreme value distribution

2 The purpose of sequence alignment Homology Function identification –about 70% of the genes of M. jannaschii were assigned a function using sequence similarity (1997)

3 Similarity How much similar do the sequences have to be to infer homology? Two possibilities when similarity is detected: –The similarity is by chance –They evolved from a common ancestor – hence, have similar functions

4 Measures of similarity Percent identity: –40% similar, 70% similar –problems with percent identity? Scoring matrices –matching of some amino acids may be more significant than matching of other amino acids –PAM matrix in 1970, BLOSUM in 1992 –problems?

5 Statistical Significance Goal: to provide a universal measure for inferring homology –How different is the result from a random match, or a match between unrelated requences? –Given a set of sequences not related to the query (or a set of random sequences), what is the probability of finding a match with the same alignment score by chance? Different statistical measures –p-value –E-value –z-score

6 Statistical significance measures p-value: the probability that at least one sequence will produce the same score by chance E-value: expected number of sequences that will produce same or better score by chance z-score: measures how much standard deviations above the mean of the score distribution

7

8 Search Significance Scores A search will always return some hits. How can we determine how “unusual” a particular alignment score is? –ORF’s Assumptions

9 Assessing significance requires a distribution I have an apple of diameter 5”. Is that unusual? Diameter (cm) Frequency

10 Is a match significant? Match scores for aligning my sequence with random sequences. Depends on: –Scoring system –Database –Sequence to search for Length Composition How do we determine the random sequences? Match score Frequency

11 Generating “random” sequences P(G) = P(A) = P(C) = P(T) = 0.25Random uniform model: P(G) = P(A) = P(C) = P(T) = 0.25 –Doesn’t reflect nature Use sequences from a database –Might have genuine homology We want unrelated sequences Random shuffling of sequences –Preserves composition –Removes true homology

12 What distribution do we expect to see? The mean of n random events tends towards a Gaussian distribution. –Example: Throw n dice and compute the mean. –Distribution of means: n = 2 n = 1000

13 The extreme value distribution This means that if we get the match scores for our sequence with n other sequences, the mean would follow a Gaussian distribution. The maximum of n random events tends towards the extreme value distribution as n grows large.

14 Comparing distributions   Extreme Value:Gaussian:

15 How to compute statistical significance? Significance of a match-run –Erdös-Renyí Significance of local alignments without gaps –Karlin-Altschul statistics –Scoring matrices revisited Significance of local alignments with gaps Significance of global alignments

16 Analysis of coin tosses Let black circles indicate heads Let p be the probability of a “head” –For a “fair” coin, p = 0.5 Probability of 5 heads in a row is (1/2)^5=0.031 The expected number of times that 5H occurs in above 14 coin tosses is 10*0.031 = 0.31

17 Analysis of coin tosses The expected number of a length l run of heads in n tosses. What is the expected length R of the longest match in n tosses?

18 Analysis of coin tosses (Erdös-Rényi) If there are n throws, then the expected length R of the longest run of heads is R = log 1/p (n)

19 Example Example: Suppose n = 20 for a “fair” coin R=log 2 (20)=4.32 –In other words: in 20 coin tosses we expect a run of heads of length 4.32, once. Trick is how to model DNA (or amino acid) sequence alignments as coin tosses.

20 Analysis of an alignment Probability of an individual match p = 0.05 Expected number of matches: 10x8x0.05 = 4 Expected number of two successive matches 10x8x0.05x0.05 = 0.2

21 Matching runs in sequence alignments Consider two sequences a 1..m and b 1..n If the probability of occurrence for every symbol is p, then a match of a residue a i with b j is p, and a match of length l from a i,b j to a i+l-1,b j+l-1 is p l. The head-run problem of coin tosses corresponds to the longest run of matches along the diagonals

22 There are m-l+1 x n-l+1 places where the match could start The expected length of the longest match can be approximated as R=log 1/p (mn) where m and n are the lengths of the two sequences. Matching runs in sequence alignments

23 So suppose m = n = 10 and we’re looking at DNA sequences R=log 4 (100)=3.32 This analysis makes assumptions about the base composition (uniform) and no gaps, but it’s a good estimate. Matching runs in sequence alignments

24 Statistics for matching runs Statistics of matching runs: Length versus score? –Consider all mismatches receive a negative score of -∞ and a i b j match receives a positive score of s i,j. What is the expected number of matching runs with a score x or higher? –Using this theory of matching runs, Karlin and Altschul developed a theory for statistics of local alignments without gaps (extended this theory to allow for mismatches).

25 Statistics of local alignments without gaps A scoring matrix which satisfy the following constraint: –The expected score of a single match obtained by a scoring matrix should be negative. –Otherwise? Arbitrarily long random sequences will get higher scores just because they are long, not because there’s a significant match. If this requirement is met then the expected number of alignments with score x or higher is given by:

26 –K < 1 is a proportionality constant that corrects the mn “space factor” for the fact that there are not really mn independent places that could have produced score S ≥ x. –K has little effect on the statistical significance of a similarity score –λ is closely related to the scoring matrix used and it takes into account that the scoring matrices do not contain actual probabilities of co-occurence, but instead a scaled version of those values. To understand how λ is computed, we have to look at the construction of scoring matrices. Statistics of local alignments without gaps

27 Scoring Matrices In 1970s there were few protein sequences available. Dayhoff used a limited set of families of protein sequences multiply aligned to infer mutation likelihoods.

28 Scoring Matrices Dayhoff represented the similarity of amino acids as a log odds ratio: where q ij is the observed frequency of co-occurrence, and p i, p j are the individual frequencies.

29 Example If M occurs in the sequences with 0.01 frequency and L occurs with 0.1 frequency. By random pairing, you expect amino acid pairs to be M-L. If the observed frequency of M-L is actually 0.003, score of matching M-L will be –log 2 (3)=1.585 bits or log e (3) = ln(3) = 1.1 nats Since, scoring matrices are usually provided as integer matrices, these values are scaled by a constant factor. λ is approximately the inverse of the original scaling factor.

30 How to compute λ Recall that: and: Sum of observed frequencies is 1. Given the frequencies of individual amino acids and the scores in the matrix, λ can be estimated.

31 Extreme value distribution Consider an experiment that obtains the maximum value of locally aligning a random string with query string (without gaps). Repeat with another random string and so on. Plot the distribution of these maximum values. The resulting distribution is an extreme value distribution, called a Gumbel distribution.

32 Normal vs. Extreme Value Distribution Normal Extreme Value Extreme value distribution: y = e -x – e -x Normal distribution: y = (1/√2π)e -x 2 /2

33 Local alignments with gaps The EVD distribution is not always observed. Theory of local alignments with gaps is not well studied as in without gaps. Mostly empirical results. For example, BLAST allows only a certain range of gap penalties.

34 Comparing distributions   Extreme Value:Gaussian:

35 Determining P-values If we can estimate  and , then we can determine, for a given match score x, the probability that a random match with score x or greater would have occurred in the database. For sequence matches, a scoring system and database can be parameterized by two parameters, K and, related to  and . –It would be nice if we could compare hit significance without regard to the database and scoring system used!

36 Bit Scores The expected number of hits with score  S is: E = Kmn e  s –Where m and n are the sequence lengths Normalize the raw score using: Obtains a “bit score” S’, with a standard set of units. The new E-value is:

37 P values and E values Blast reports E-values E = 5, E = 10 versus P = and P = When E < 0.01 P-values and E-values are nearly identical

38 BLAST parameters Lowering the neighborhood word threshold (T) allows more distantly related sequences to be found, at the expense of increased noise in the results set. Raising the segment extension cutoff (X) returns longer extensions for each hit. Changing the minimum E-value changes the threshold for reporting a hit.

39 BLAST statistics Pre-computed λ and K values for different scoring matrices and gap penalties are used for faster computation. Raw score is converted to bit score: E-value is computed using m is query size, n is database size and L is the typical length of maximal scoring alignment.

40 Evaluating BLAST Results A BLAST search in a sequence database might produce hundreds of candidate alignments. How to know which are meaningfull, i.e. homologous? BLAST provides with: –Raw scores –Bit scores –E-values Probability is the basic element of tests for statistical significance

41 Raw scores: the sum of the scores of the maximal-scoring segment pairs (MSPs) that makes up the alignment. Because of differences between scoring matrices raw scores are not directly comparable Bit scores: these are raw scores that have been converted from the log base of the scoring matrix that creates the alignment to log base 2. This rescaling allows bit scores to be comparable. E-scores: is the likelihood that a given sequence alignment is significant. The e-value indicates the number of alignments one expects to find with a score equal or greater to the given one in a search against a random database. Large e-value (5 or 10) indicates that the alignment is probably by chance. E-values of 0.1 or 0.05 are typical cuttoff values for data base search Proteins with less than 25% similarity are not similar enough for a reliable BLAST analysis and structural comparison must be used.

42 x probability extreme value distribution normal distribution The probability density function of the extreme value distribution (characteristic value u=0 and decay constant =1) page 103

43 page 104

44 The expect value E is the number of alignments with scores greater than or equal to score S that are expected to occur by chance in a database search. An E value is related to a probability value p. The key equation describing an E value is: E = Kmn e - S page 105 How to interpret a BLAST search: expect value

45 This equation is derived from a description of the extreme value distribution S = the score E = the expect value = the number of HSPs expected to occur with a score of at least S m, n = the length of two sequences, K = Karlin Altschul statistics parameters E = Kmn e - S

46 Some properties of the equation E = Kmn e - S The value of E decreases exponentially with increasing S (higher S values correspond to better alignments). Very high scores correspond to very low E values. The E value for aligning a pair of random sequences must be negative! Otherwise, long random alignments would acquire great scores Parameter K describes the search space (database). For E=1, one match with a similar score is expected to occur by chance. For a very much larger or smaller database, you would expect E to vary accordingly page

47 From raw scores to bit scores There are two kinds of scores: raw scores (calculated from a substitution matrix) and bit scores (normalized scores) Bit scores are comparable between different searches because they are normalized to account for the use of different scoring matrices and different database sizes S’ = bit score = ( S - lnK) / ln2 The E value corresponding to a given bit score is: E = mn 2 -S’ Bit scores allow you to compare results between different database searches, even using different scoring matrices. page 106

48 The expect value E is the number of alignments with scores greater than or equal to score S that are expected to occur by chance in a database search. A p value is a different way of representing the significance of an alignment. p = 1 - e -  page 106 How to interpret BLAST: E values and p values

49 Very small E values are very similar to p values. E values of about 1 to 10 are far easier to interpret than corresponding p values. ( p = 1 - e -E ) Ep (about 0.1) (about 0.05) (about 0.001) page 107 How to interpret BLAST: E values and p values

50 How to interpret BLAST: getting to the bottom page 107

51 threshold score = 11 EVD parameters matrix Effective search space = mn = length of query x db length 10.0 is the E value gap penalties cut-off parameters page 108

52 Changing E, T & matrix for blastp nr RBP Expect10 (T=11) 1 (T=11) 10,000 (T=11) 10 (T=5) 10 (T=11) 10 (T=16) 10 (BL45) 10 (PAM70) #hits to db129m 112m 386m129m #sequences1,043,4551.0m 907, m #extensions5.2m 508m4.5m73, m19.5m #successful extensions 8,367 11,4847,2881,1479,08813,873 #sequences better than E , #HSPs>E (no gapping) 53466, #HSPs gapped , X1, X2, X3 16 (7.4 bits) 38 (14.6 bits) 64 (24.7 bits)

53 Changing E, T & matrix for blastp nr RBP Expect10 (T=11) 1 (T=11) 10,000 (T=11) 10 (T=5) 10 (T=11) 10 (T=16) 10 (BL45) 10 (PAM70) #hits to db129m 112m 386m129m #sequences1,043,4551.0m 907, m #extensions5.2m 508m4.5m73, m19.5m #successful extensions 8,367 11,4847,2881,1479,08813,873 #sequences better than E , #HSPs>E (no gapping) 53466, #HSPs gapped , X1, X2, X3 16 (7.4 bits) 38 (14.6 bits) 64 (24.7 bits)

54 Changing E, T & matrix for blastp nr RBP Expect10 (T=11) 1 (T=11) 10,000 (T=11) 10 (T=5) 10 (T=11) 10 (T=16) 10 (BL45) 10 (PAM70) #hits to db129m 112m 386m129m #sequences1,043,4551.0m 907, m #extensions5.2m 508m4.5m73, m19.5m #successful extensions 8,367 11,4847,2881,1479,08813,873 #sequences better than E , #HSPs>E (no gapping) 53466, #HSPs gapped , X1, X2, X3 16 (7.4 bits) 38 (14.6 bits) 64 (24.7 bits)

55 Changing E, T & matrix for blastp nr RBP Expect10 (T=11) 1 (T=11) 10,000 (T=11) 10 (T=5) 10 (T=11) 10 (T=16) 10 (BL45) 10 (PAM70) #hits to db129m 112m 386m129m #sequences1,043,4551.0m 907, m #extensions5.2m 508m4.5m73, m19.5m #successful extensions 8,367 11,4847,2881,1479,08813,873 #sequences better than E , #HSPs>E (no gapping) 53466, #HSPs gapped , X1, X2, X3 16 (7.4 bits) 38 (14.6 bits) 64 (24.7 bits)

56 Changing E, T & matrix for blastp nr RBP Expect10 (T=11) 1 (T=11) 10,000 (T=11) 10 (T=5) 10 (T=11) 10 (T=16) 10 (BL45) 10 (PAM70) #hits to db129m 112m 386m129m #sequences1,043,4551.0m 907, m #extensions5.2m 508m4.5m73, m19.5m #successful extensions 8,367 11,4847,2881,1479,08813,873 #sequences better than E , #HSPs>E (no gapping) 53466, #HSPs gapped , X1, X2, X3 16 (7.4 bits) 38 (14.6 bits) 64 (24.7 bits)

57 Changing E, T & matrix for blastp nr RBP Expect10 (T=11) 1 (T=11) 10,000 (T=11) 10 (T=5) 10 (T=11) 10 (T=16) 10 (BL45) 10 (PAM70) #hits to db129m 112m 386m129m #sequences1,043,4551.0m 907, m #extensions5.2m 508m4.5m73, m19.5m #successful extensions 8,367 11,4847,2881,1479,08813,873 #sequences better than E , #HSPs>E (no gapping) 53466, #HSPs gapped , X1, X2, X3 16 (7.4 bits) 38 (14.6 bits) 64 (24.7 bits)

58 General concepts How to evaluate the significance of your results How to handle too many results How to handle too few results BLAST searching with HIV-1 pol, a multidomain protein BLAST searching with lipocalins using different matrices page BLAST search strategies

59 Sometimes a real match has an E value > 1 page 110 … try a reciprocal BLAST to confirm

60 Sometimes a similar E value occurs for a short exact match and long less exact match page 111

61 Assessing whether proteins are homologous page 111 RBP4 and PAEP: Low bit score, E value 0.49, 24% identity (“twilight zone”). But they are indeed homologous. Try a BLAST search with PAEP as a query, and find many other lipocalins.

62 page 112

63 page 114 Searching with a multidomain protein, pol

64

65 Searching bacterial sequences with pol

66 Protein sequence Motifs or Patterns