Statistical significance of alignment scores Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering.

Slides:



Advertisements
Similar presentations
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
PTP 560 Research Methods Week 9 Thomas Ruediger, PT.
BLAST Sequence alignment, E-value & Extreme value distribution.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Modeling and Simulation Monte carlo simulation 1 Arwa Ibrahim Ahmed Princess Nora University.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
OUTLINE Scoring Matrices Probability of matching runs Quality of a database match.
STAT 135 LAB 14 TA: Dongmei Li. Hypothesis Testing Are the results of experimental data due to just random chance? Significance tests try to discover.
Random Walks and BLAST Marek Kimmel (Statistics, Rice)
Measuring the degree of similarity: PAM and blosum Matrix
Bioinformatics Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington
Searching Sequence Databases
Lecture outline Database searches
Heuristic alignment algorithms and cost matrices
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Evaluating Hypotheses
Lecture 9: One Way ANOVA Between Subjects
Sequence Similarity Search 2005/10/ Autumn / YM / Bioinformatics.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching.
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Protein Sequence Comparison Patrice Koehl
BCOR 1020 Business Statistics
Sequence alignment, E-value & Extreme value distribution
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Motif search and discovery Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Sequence comparison: Local alignment Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Traceback and local alignment Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington.
Multiple testing correction
Chapter 26: Comparing Counts AP Statistics. Comparing Counts In this chapter, we will be performing hypothesis tests on categorical data In previous chapters,
Protein Sequence Alignment and Database Searching.
Motif search Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
1 Lecture outline Database searches –BLAST –FASTA Statistical Significance of Sequence Comparison Results –Probability of matching runs –Karin-Altschul.
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
Comp. Genomics Recitation 3 The statistics of database searching.
Confidence intervals and hypothesis testing Petter Mostad
Testing Hypothesis That Data Fit a Given Probability Distribution Problem: We have a sample of size n. Determine if the data fits a probability distribution.
Chapter 13 - ANOVA. ANOVA Be able to explain in general terms and using an example what a one-way ANOVA is (370). Know the purpose of the one-way ANOVA.
M ONTE C ARLO SIMULATION Modeling and Simulation CS
Using BLAST for Genomic Sequence Annotation Jeremy Buhler For HHMI / BIO4342 Tutorial Workshop.
Significance in protein analysis
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Sequence Alignment.
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Comp. Genomics Recitation 3 (week 4) 26/3/2009 Multiple Hypothesis Testing+Suffix Trees Based in part on slides by William Stafford Noble.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
Tests of Significance We use test to determine whether a “prediction” is “true” or “false”. More precisely, a test of significance gets at the question.
Statistical principles: the normal distribution and methods of testing Or, “Explaining the arrangement of things”
Slide Slide 1 Hypothesis Testing 8-1 Overview 8-2 Basics of Hypothesis Testing Chapter 8.
Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.
Comp. Genomics Recitation 2 (week 3) 19/3/09. Outline Finding repeats Branch & Bound for MSA Multiple hypotheses testing.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Pairwise sequence comparison
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Sequence comparison: Significance of similarity scores
Motif p-values GENOME 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Sequence comparison: Multiple testing correction
Hypothesis Testing.
Sequence comparison: Multiple testing correction
Sequence comparison: Significance of similarity scores
False discovery rate estimation
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

Statistical significance of alignment scores Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington

Outline Responses from last class Revision Hypothesis testing and the p-value Extreme value distribution Python

One-minute responses Be patient with our programming skills. Explain the Python code before we start solving problems. We should do an evening tutorial with the tutors for Python dictionaries and sys.argv. I am not used to the sys library and working from a.txt file. Otherwise, Python programming is fine. What are the things being aligned? Are they proteins?

Are these proteins homologs? SEQ 1: RVVNLVPS--FWVLDATYKNYAINYNCDVTYKLY | | | | | | | | | SEQ 2: QFFPLMPPAPYWILATDYENLPLVYSCTTFFWLF SEQ 1: RVVNLVPS--FWVLDATYKNYAINYNCDVTYKLY | | | ||||||||| | | | SEQ 2: QFFPLMPPAPYWILDATYKNYALVYSCTTFFWLF SEQ 1: RVVNLVPS--FWVLDATYKNYAINYNCDVTYKLY ||| | || | ||||||||| | ||||||| SEQ 2: RVVPLMPSAPYWILDATYKNYALVYSCDVTYKLF YES (score = 153) MAYBE (score = 104) NO (score = 69)

Significance of scores Sequence alignment algorithm HPDKKAHSIHAWILSKSKVLEGNTKEVVDNVLKT LENENQGKCTIAEYKYDGKKASVYNSFVSNGVKE 45 Low score = unrelated High score = homologs How high is high enough?

An empirical histogram Roll a 20-sided die 1000 times. Keep a tally of the number of times each value is observed.

A table of results

A table of results A histogram plots the number of times or the frequency with which each value of a given variable (e.g., the die value) is observed.

After 21,000 rolls As the number of observations increases, the histogram becomes smoother.

Frequency histogram = distribution Divide through by the total number of observations to get frequencies. The resulting histogram is called a distribution. The sum of the bars in the distribution is 1. This particular distribution is “ uniform. ”

Distribution from two dice There is only one way to get a score of 2. How many ways can you get a 7 from two six-sided dice? This distribution is approximately normal (Gaussian).

Normal distribution This is probably the most common distribution in science. The area under a continuous distribution curve is 1.

The null hypothesis We are interested in characterizing the distribution of scores from sequence comparison algorithms. We would like to measure how surprising a given score is, assuming that the two sequences are not related. The assumption is called the null hypothesis. The purpose of most statistical tests is to determine whether the observed results provide a reason to reject the hypothesis that the results are merely a product of chance factors.

Sequence similarity score distribution Search a randomly generated database of DNA sequences using a randomly generated DNA query. What will be the form of the resulting distribution of pairwise sequence comparison scores? Sequence comparison score Frequency ?

Empirical score distribution The picture shows a distribution of scores from a real database search using BLAST. This distribution contains scores from non-homologous and homologous pairs. High scores from homology. Score Frequency

Empirical null score distribution This distribution is similar to the previous one, but generated using a randomized sequence database.

Computing a p-value The probability of observing a score >X is the area under the curve to the right of X. This probability is called a p-value. p-value = Pr(data|null) Out of 1685 scores, 28 receive a score of 20 or better. Thus, the p-value associated with a score of 20 is approximately 28/1685 =

Problems with empirical distributions We are interested in very small probabilities. These are computed from the tail of the distribution. Estimating a distribution with accurate tails is computationally very expensive.

A solution Solution: Characterize the form of the distribution mathematically. Fit the parameters of the distribution empirically, or compute them analytically. Use the resulting distribution to compute accurate p-values.

Extreme value distribution This distribution is characterized by a larger tail on the right.

Computing a p-value The probability of observing a score >4 is the area under the curve to the right of 4. This probability is called a p-value. p-value = Pr(data|null)

Extreme value distribution Compute this value for x=4.

Computing a p-value Calculator keys: 4, +/-, inv, ln, +/-, inv, ln, +/-, +, 1, = Solution:

Scaling the EVD An extreme value distribution derived from, e.g., the Smith- Waterman algorithm will have a characteristic mode μ and scale parameter λ. These parameters depend upon the size of the query, the size of the target database, the substitution matrix and the gap penalties.

An example You run the Smith-Waterman algorithm and get a score of 45. You then run Smith-Waterman with the same query but using a shuffled version of the database, and you fit an extreme value distribution to the resulting empirical distribution. The parameters of the EVD are μ = 25 and λ = What is the p-value associated with 45?

An example You run BLAST and get a score of 45. You then run BLAST on a shuffled version of the database, and fit an extreme value distribution to the resulting empirical distribution. The parameters of the EVD are μ = 25 and λ = What is the p- value associated with 45?

What p-value is significant? The most common thresholds are 0.01 and A threshold of 0.05 means you are 95% sure that the result is significant. Is 95% enough? It depends upon the cost associated with making a mistake. Examples of costs: – Doing expensive wet lab validation. – Making clinical treatment decisions. – Misleading the scientific community. Most sequence analysis uses more stringent thresholds because the p-values are not very accurate.

Multiple testing Say that you perform a statistical test with a 0.05 threshold, but you repeat the test on twenty different observations. Assume that all of the observations are explainable by the null hypothesis. What is the chance that at least one of the observations will receive a p-value less than 0.05?

Multiple testing Say that you perform a statistical test with a 0.05 threshold, but you repeat the test on twenty different observations. Assuming that all of the observations are explainable by the null hypothesis, what is the chance that at least one of the observations will receive a p-value less than 0.05? Pr(making a mistake) = 0.05 Pr(not making a mistake) = 0.95 Pr(not making any mistake) = = Pr(making at least one mistake) = = There is a 64.2% chance of making at least one mistake.

Bonferroni correction Assume that individual tests are independent. Divide the desired p-value threshold by the number of tests performed. For the previous example, 0.05 / 20 = Pr(making a mistake) = Pr(not making a mistake) = Pr(not making any mistake) = = Pr(making at least one mistake) = =

Database searching Say that you search the non-redundant protein database at NCBI, containing roughly one million sequences. What p-value threshold should you use?

Database searching Say that you search the non-redundant protein database at NCBI, containing roughly one million sequences. What p-value threshold should you use? Say that you want to use a conservative p-value of Recall that you would observe such a p-value by chance approximately every 1000 times in a random database. A Bonferroni correction would suggest using a p- value threshold of / 1,000,000 = =

E-values A p-value is the probability of making a mistake. The E-value is the expected number of times that the given score would appear in a random database of the given size. One simple way to compute the E-value is to multiply the p-value times the size of the database. Thus, for a p-value of and a database of 1,000,000 sequences, the corresponding E-value is × 1,000,000 = 1,000. BLAST actually calculates E-values in a more complex way.

Summary A distribution plots the frequency of a given type of observation. The area under the distribution is 1. Most statistical tests compare observed data to the expected result according to the null hypothesis. Sequence similarity scores follow an extreme value distribution, which is characterized by a larger tail. The p-value associated with a score is the area under the curve to the right of that score. Selecting a significance threshold requires evaluating the cost of making a mistake. Bonferroni correction: Divide the desired p-value threshold by the number of statistical tests performed. The E-value is the expected number of times that the given score would appear in a random database of the given size.

Python resources nglish2e/ nglish2e/

Why write programs clearly? Many people may want to read and understand your program – Collaborators and colleagues – Future lab members – People who read your published paper – Your supervisor – Your future self Well written code can more easily be re-used later.

White space The layout of the program on the page should reflect the logical structure of the program. Insert blank lines to indicate related lines of code.

Comments Above all, use comments. Don ’ t use too many comments. Don ’ t use too few comments. Adopt a consistent commenting style. Be sure to comment particularly tricky lines of code.

Naming Naming is more important than you think. Avoid single letter variable names. Use abbreviations sparingly. Think about your variables names Don ’ t be afraid to change confusing names.

#!/usr/bin/python import sys USAGE = """USAGE: revcomp.py In DNA strings, symbols 'A' and 'T' are complements of each other, as are 'C' and 'G'. The reverse complement of a DNA string s is the string formed by reversing the symbols of s, then taking the complement of each symbol. Given: A DNA string s of length at most 1000 bp. Return: The reverse complement of s. """ # Dictionary storing the complementarity of DNA nucleotides. revComp = { "A":"T", "T":"A", "G":"C", "C":"G" } # Parse the command line. if (len(sys.argv) != 2): sys.stderr.write(USAGE) sys.exit(1) dna = sys.argv[1] # Do the reverse complement. for index in range(len(dna) - 1, -1, -1): char = dna[index] if char in revComp: sys.stdout.write(revComp[char]) else: sys.stderr.write("Cannot complement %s.\n" % char) sys.exit(1) sys.stdout.write("\n”)

#!/usr/bin/python import sys USAGE = """USAGE: revcomp.py In DNA strings, symbols 'A' and 'T' are complements of each other, as are 'C' and 'G'. The reverse complement of a DNA string s is the string formed by reversing the symbols of s, then taking the complement of each symbol. Given: A DNA string s of length at most 1000 bp. Return: The reverse complement of s. """ # Dictionary storing the complementarity of DNA nucleotides. revComp = { "A":"T", "T":"A", "G":"C", "C":"G" } # Parse the command line. if (len(sys.argv) != 2): sys.stderr.write(USAGE) sys.exit(1) dna = sys.argv[1] # Do the reverse complement. for index in range(len(dna) - 1, -1, -1): char = dna[index] if char in revComp: sys.stdout.write(revComp[char]) else: sys.stderr.write("Cannot complement %s.\n" % char) sys.exit(1) sys.stdout.write("\n”)

Problem #1 Given: – a bin width – a file containing one number per line, and Return: the frequency associated with each bin File: Bin width: 1 Output: