Significance in protein analysis

Slides:



Advertisements
Similar presentations
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

Statistics Review – Part II Topics: – Hypothesis Testing – Paired Tests – Tests of variability 1.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Continuous Probability Distributions.  Experiments can lead to continuous responses i.e. values that do not have to be whole numbers. For example: height.
OUTLINE Scoring Matrices Probability of matching runs Quality of a database match.
Random Walks and BLAST Marek Kimmel (Statistics, Rice)
Searching Sequence Databases
Lecture outline Database searches
Heuristic alignment algorithms and cost matrices
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Introduction to bioinformatics
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching.
Protein Sequence Comparison Patrice Koehl
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Getting Started with Hypothesis Testing The Single Sample.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
52930 Protein informatics Liisa Holm.
An Introduction to Bioinformatics
BLAST What it does and what it means Steven Slater Adapted from pt.
Correlation.
Protein Sequence Alignment and Database Searching.
PROBABILITY & STATISTICAL INFERENCE LECTURE 3 MSc in Computing (Data Analytics)
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Bacterial Genetics - Assignment and Genomics Exercise: Aims –To provide an overview of the development and.
1 Lecture outline Database searches –BLAST –FASTA Statistical Significance of Sequence Comparison Results –Probability of matching runs –Karin-Altschul.
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
PCB 3043L - General Ecology Data Analysis. OUTLINE Organizing an ecological study Basic sampling terminology Statistical analysis of data –Why use statistics?
Educational Research Chapter 13 Inferential Statistics Gay, Mills, and Airasian 10 th Edition.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Statistical significance of alignment scores Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Chapter Eight: Using Statistics to Answer Questions.
Data Analysis.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Sequence Alignment.
Construction of Substitution matrices
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,
NURS 306, Nursing Research Lisa Broughton, MSN, RN, CCRN RESEARCH STATISTICS.
Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Outline Sampling Measurement Descriptive Statistics:
Blast Basic Local Alignment Search Tool
Sequence comparison: Significance of similarity scores
Sequence comparison: Multiple testing correction
Pairwise Sequence Alignment (cont.)
Sequence comparison: Multiple testing correction
Sequence comparison: Significance of similarity scores
Descriptive Statistics
BLAST Slides adapted & edited from a set by
BLAST Slides adapted & edited from a set by
1-month Practical Course Genome Analysis Iterative homology searching
Searching Sequence Databases
Presentation transcript:

Significance in protein analysis Swapan ‘Shop’ Mallick Bioinformatics Group Institute of Biotechnology University of Helsinki

Overview The need for statistics Example: BLOSUM Example: BLAST What do the scores mean? How can you compare two scores? Example: BLAST Problems with BLAST Review of Distributions Distribution of random BLAST results P-values and e-values Statistics of BLAST Summary and Conclusion Exercise Statistics underly many tools which we use for protein analysis, including: alignment tools, eg: BLAST protein classification eg, PFAM (HMM), Statistics is very important for bioinformatics. It is very easy to have a computer analyze the data and give you back a result. The problem is to decide whether the answer the computer gives you is any good at all. How statistically significant is the answer? What is the probability that this answer could have been obtained by random chance? Is a certain pattern of amino acids or nucleotides important information that tells you something about a sequence, or is it nothing more than a fluctuation in the random background noise? These are the underlying questions you need to ask whenever you do a database search or other type of bioinformatics analysis.

The need for statistics Statistics is very important for bioinformatics. It is very easy to have a computer analyze the data and give you back a result. Problem is to decide whether the answer the computer gives you is any good at all. Questions: How statistically significant is the answer? What is the probability that this answer could have been obtained by random? What does this depend on? Statistics underly many tools which we use for protein analysis, including: alignment tools, eg: BLAST protein classification eg, PFAM (HMM), Other questions: Is a certain pattern of amino acids or nucleotides important information that tells you something about a sequence, or is it nothing more than a fluctuation in the random background noise? These are the underlying questions you need to ask whenever you do a database search or other type of bioinformatics analysis.

Basics N n Sample Population

Basics N Descriptive statistics n Sample Population Probability

Example: BLOSUM The BLOSUM matrix assigns a probability score for each residue pair in an alignment based on: the frequency with which that pairing is known to occur within conserved blocks of related proteins. Simple since size of population = size of sample BLOSUM matrices are constructed from observations which lead to observed probabilities

BLOSUM substitution matrices BLOSUM matrices are used in ‘log-odds’ form based on actually observed substitutions. This is because: Ease of use: ‘Scores’ can be just added (the raw probabilities would have to be multiplied) Ease of interpretation: S=0 : substitution is just as likely to occur as random S<0 : substitution is more likely to occur randomly than observed S>0 : substitution is less likely to occur randomly than observed Observed v. expected. But I’m going to describe what the numbers mean, and how they relate to one another. For example, we intuitively know that 6 is better than 5. But how much better? http://www.dina.kvl.dk/%7Esestoft/bsa/graphalign.html shows traceback

Substitution matrices Score of amino acid a with amino acid b Pab is the observed frequency that residues a and b are correlated because of homology Lambda is a scaling factor equal to 0.347, set so that the scores can be rounded off to sensible integers http://www.dina.kvl.dk/%7Esestoft/bsa/graphalign.html shows traceback fafb is the expected frequency of seeing residues a and b paired together, which is just the product of the frequency of residue a multiplied by the frequency of residue b Source: Where did the BLOSUM62 alignment score matrix come from? Eddy S., Nat. Biotech. 22 Aug 2004

Substitution matrices Lambda is a scaling factor equal to 0.347, set so that the scores can be rounded off to sensible integers Pab is the observed frequency that residues a and b are correlated because of homology http://www.dina.kvl.dk/%7Esestoft/bsa/graphalign.html shows traceback fafb is the expected frequency of seeing residues a and b paired together, which is just the product of the frequency of residue a multiplied by the frequency of residue b

ii) Compare S=5 and S=10. Ratio is based on exponential function i) S=0 : O/E ratio=1 ii) Compare S=5 and S=10. Ratio is based on exponential function iii) S=-10: O/E ratio = 0.031 ≈ 1/32. iv) Ratio of scores S1, S2 in terms of probabilities of observed/random = i) ii) iii) ie the correspondence of two aas in an alignment that accurately represents homology (evolutionary descent) is one tenth as frequent as the chance alignment of these amino acids. iv) eg: The ratio of probabilities for scores : 10, 5 would be about 5.6

ii) Compare S=5 and S=10. Ratio is based on exponential function i) S=0 : O/E ratio=1 ii) Compare S=5 and S=10. Ratio is based on exponential function iii) S=-10: O/E ratio = 0.031 ≈ 1/32. iv) Ratio of scores S1, S2 in terms of probabilities of observed/random = 32.1 i) ii) iii) ie the correspondence of two aas in an alignment that accurately represents homology (evolutionary descent) is one tenth as frequent as the chance alignment of these amino acids. iv) eg: The ratio of probabilities for scores : 10, 5 would be about 5.6 5.7

ii) Compare S=5 and S=10. Ratio is based on exponential function i) S=0 : O/E ratio=1 ii) Compare S=5 and S=10. Ratio is based on exponential function iii) S=-10: O/E ratio = 0.031 ≈ 1/32. iv) Ratio of scores S1, S2 in terms of probabilities of observed/random = 32.1 i) ii) iii) ie the correspondence of two aas in an alignment that accurately represents homology (evolutionary descent) is one tenth as frequent as the chance alignment of these amino acids. iv) eg: The ratio of probabilities for scores : 10, 5 would be about 5.6 5.7

ii) Compare S=5 and S=10. Ratio is based on exponential function i) S=0 : O/E ratio=1 ii) Compare S=5 and S=10. Ratio is based on exponential function iii) S=-10: O/E ratio = 0.031 ≈ 1/32. iv) Ratio of scores S1, S2 in terms of probabilities of observed/random = 32.1 i) ii) iii) ie the correspondence of two aas in an alignment that accurately represents homology (evolutionary descent) is one tenth as frequent as the chance alignment of these amino acids. iv) eg: The ratio of probabilities for scores : 10, 5 would be about 5.6 5.7

Example: BLAST Motivations Exact algorithms are exhaustive but computationally expensive. Exact algorithms are impractical for comparing a query sequence to millions of other sequences in a database (database scanning), and so, database scanning requires heuristic alignment algorithm (at the cost of optimality).

Interpret BLAST results - Description ID (GI #, refseq #, DB-specific ID #) Click to access the record in GenBank Gene/sequence Definition Expect value – lower, better. It tells the possibility that this is a random hit Bit score – higher, better. Click to access the pairwise alignment Links

Problems with BLAST Why do results change? How can you compare results from different BLAST tools which may report different types of values? How are results (eg evalue) affected by query There are _many_ values reported in the output – what do they mean?

Example: Importance of Blast statistics But, first a review.

Review What is a distribution? A plot showing the frequency of a given variable or observation.

Review What is a distribution? A plot showing the frequency of a given variable or observation.

Features of a Normal Distribution Symmetric Distribution Has an average or mean value at the centre Has a characteristic width called the standard deviation (S.D. = σ) Most common type of distribution known m = mean

Standard Deviations (Z-score) Z value is the number of standard deviations you are away…. Disadvantages of Z-scores: Absolute value is lost Same score in different sample => different z-score

Mean, Median & Mode Mode Median Mean

Mean, Median, Mode In a Normal Distribution the mean, mode and median are all equal In skewed distributions they are unequal Mean - average value, affected by extreme values in the distribution Median - the “middlemost” value, usually half way between the mode and the mean Mode - most common value

Different Distributions Unimodal Bimodal

Other Distributions Binomial Distribution Poisson Distribution Extreme Value Distribution

Binomial Distribution 1 1 1 1 2 1 1 3 3 1 1 4 6 4 1 1 5 10 10 5 1 P(x) = (p + q)n

Poisson Distribution P(x) x Proportion of samples m = 10 =0.1 = 1 = 2 = 3 P(x) Poisson distribution is like a normal distribution at high values of mu x

Review What is a distribution? What is a null hypothesis? A plot showing the frequency of a given variable or observation. What is a null hypothesis? A statistician’s way of characterizing “chance.” Generally, a mathematical model of randomness with respect to a particular set of observations. The purpose of most statistical tests is to determine whether the observed data can be explained by the null hypothesis.

Review What is a distribution? What is a null hypothesis? A plot showing the frequency of a given variable or observation. What is a null hypothesis? A statistician’s way of characterizing “chance.” Generally, a mathematical model of randomness with respect to a particular set of observations. The purpose of most statistical tests is to determine whether the observed data can be explained by the null hypothesis.

Review Examples of null hypotheses: Sequence comparison using shuffled sequences. A normal distribution of log ratios from a microarray experiment. LOD scores from genetic linkage analysis when the relevant loci are randomly sprinkled throughout the genome.

Empirical score distribution The picture shows a distribution of scores from a real database search using BLAST. This distribution contains scores from non-homologous and homologous pairs. High scores from homology.

Empirical null score distribution This distribution is similar to the previous one, but generated using a randomized sequence database.

Review What is a p-value?

Review What is a p-value? The probability of observing an effect as strong or stronger than you observed, given the null hypothesis. I.e., “How likely is this effect to occur by chance?” Pr(x > S|null)

Review What is the name of the distribution created by sequence similarity scores, and what does it look like? Extreme value distribution, or Gumbel distribution. It looks similar to a normal distribution, but it has a larger tail on the right. Arises from sampling the extreme end of a normal distribution A distribution which is “skewed” due to its selective sampling Skew can be either right or left In the limit of sufficiently large sequence lengths m and n, the statistics of HSP scores are characterized by two parameters, K and lambda. Most simply, the expected number of HSPs with score at least S is given by the formula Equation gives Evalue. We call this the E-value for the score S. This formula makes eminently intuitive sense. Doubling the length of either sequence should double the number of HSPs attaining a given score. Also, for an HSP to attain the score 2x it must attain the score x twice in a row, so one expects E to decrease exponentially with score. The parameters K and lambda can be thought of simply as natural scales for the search space size and the scoring system respectively.

Review What is the name of the distribution created by sequence similarity scores, and what does it look like? Extreme value distribution, or Gumbel distribution. It looks similar to a normal distribution, but it has a larger tail on the right. Arises from sampling the extreme end of a normal distribution A distribution which is “skewed” due to its selective sampling Skew can be either right or left In the limit of sufficiently large sequence lengths m and n, the statistics of HSP scores are characterized by two parameters, K and lambda. Most simply, the expected number of HSPs with score at least S is given by the formula Equation gives Evalue. We call this the E-value for the score S. This formula makes eminently intuitive sense. Doubling the length of either sequence should double the number of HSPs attaining a given score. Also, for an HSP to attain the score 2x it must attain the score x twice in a row, so one expects E to decrease exponentially with score. The parameters K and lambda can be thought of simply as natural scales for the search space size and the scoring system respectively.

Statistics BLAST (and also local i.e. Smith-Waterman and BLAT scores) between random, unrelated sequences follow the Gumbel Extreme Value Distribution (EVD) Pr(s>S) = 1-exp(-Kmn e-lS) This is the probability of randomly encountering a score greater than S. S alignment score m,n query sequence lengths, and length of database resp. K, l parameters depending on scoring scheme and sequence composition Bit score : S’ = lS – log(K) log(2) We're interested in high scores S. Note that the bigger S gets, the smaller e-vS gets, and the smaller that gets, the closer exp(-Kmne-vS) gets to 1, and the closer the lower bound for P(s>S) gets to zero. That is, big S yield small P. Notice that the function 1 - exp(-Kmne-vS) is not the distribution itself, but the area under its right-tail. Recall that areas are associated with probabilities. In addition, if Kmne-vS is close to zero (ie as S gets bigger), then exp(-Kmne-vS) is well approximated by 1 - Kmne-vS. In that case the lower bound above can be well approximated by Kmne-vS. This value is called the expect. According to Setabul and Meidanis in their book "Introduction to Computational Molecular Biology", it is interpreted as the expected number of distinct segment pairs between two random sequences with score above S Nice website: http://www.bio.brandeis.edu/InterpGenes/Project/stat10.htm

BLAST output revisited S’ S E n m  K From: Expasy BLAST

Review EVD for random blast Upper tail behaviour: Pr( s > S ) ~ Kmn e-lS This is the EXPECT value = Evalue This is the EXPECT value that you see on the NCBI web site.

How to Calculate E-values Think of the databank as one very long random sequence, length G Alignments with s>S occur randomly across the genome, with a Poisson distribution Pr (highest-scoring alignment s>S) ~ KmGe-lS Pr( no alignment s>S ) ~ 1 - KmGe-lS Expected number m of alignments with s>S given by 1-e-m ~ 1 - KmGe-lS (Poisson property) m ~ -log(KmG) + lS Threshold S ~ [log(KmG) + m ]/l

Summary Want to be able to compare scores in sequences of different compositions or different scoring schemes Score: S = sum(match) – sum(gap costs) Notice that the bit score is a function of the database. Though this is better, this means that as the size of the database grows, the bit score for the same alignment can drop! This also means that the E-value will change.

Summary Want to be able to compare scores in sequences of different compositions or different scoring schemes Score: S = sum(match) – sum(gap costs) Bit score S’ = lS – log(K) log(2) Notice that the bit score is a function of the database. Though this is better, this means that as the size of the database grows, the bit score for the same alignment can drop! This also means that the E-value will change.

Score and bit score grow linearly with the length of the alignment Summary Want to be able to compare scores in sequences of different compositions or different scoring schemes Score: S = sum(match) – sum(gap costs) Bit score S’ = lS – log(K) log(2) Notice that the bit score is a function of the database. Though this is better, this means that as the size of the database grows, the bit score for the same alignment can drop! This also means that the E-value will change.

Score and bit score grow linearly with the length of the alignment Summary Want to be able to compare scores in sequences of different compositions or different scoring schemes Score: S = sum(match) – sum(gap costs) Bit score S’ = lS – log(K) log(2) E-value of bit score E = mn2-S’

Score and bit score grow linearly with the length of the alignment Summary E-Value shrinks really fast as bit score grows Want to be able to compare scores in sequences of different compositions or different scoring schemes Score: S = sum(match) – sum(gap costs) Bit score S’ = lS – log(K) log(2) E-value of bit score E = mn2-S’

Score and bit score grow linearly with the length of the alignment Summary E-Value shrinks really fast as bit score grows Want to be able to compare scores in sequences of different compositions or different scoring schemes Score: S = sum(match) – sum(gap costs) Bit score S’ = lS – log(K) log(2) E-value of bit score E = mn2-S’ E-Value grows linearly with the product of target and query sizes.

Score and bit score grow linearly with the length of the alignment Summary E-Value shrinks really fast as bit score grows Want to be able to compare scores in sequences of different compositions or different scoring schemes Score: S = sum(match) – sum(gap costs) Bit score S’ = lS – log(K) log(2) E-value of bit score E = mn2-S’ E-Value grows linearly with the product of target and query sizes. Doubling target set size and doubling query length have the same effect on e-value

Conclusion You should now be able to compare BLAST results from different databases, converting values if they are reported differently (which happens frequently) You should now know why BLAST results might change from one day to the next, even on the same server You should understand also the dependance of query length on E-value. Statistical rankings are reported for (almost) every database search tool. When making comparisons between databases, between sequences it is useful to know how the statistics are derived to know if comparisons are meaningful.

THE END

Supplemental Section

What is the structure of my sequence? Look through: Patterns in sequences (Searching for information within sequences) - Some common problems and their solutions: http://lepo.it.da.ut.ee./~mremm/kurs/pattern.htm What is the structure of my sequence? http://speedy.embl-heidelberg.de/gtsp/flowchart2.html (clickable!) Statistics underly many tools which we use for protein analysis, including: alignment tools, eg: BLAST protein classification eg, PFAM (HMM),