False discovery rate estimation

Slides:



Advertisements
Similar presentations
Tests of Hypotheses Based on a Single Sample
Advertisements

Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Learning Algorithm Evaluation
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Multiple testing adjustments European Molecular Biology Laboratory Predoc Bioinformatics Course 17 th Nov 2009 Tim Massingham,
Heuristic alignment algorithms and cost matrices
CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.
ANOVA Determining Which Means Differ in Single Factor Models Determining Which Means Differ in Single Factor Models.
Evaluating Hypotheses
Cbio course, spring 2005, Hebrew University (Alignment) Score Statistics.
Chapter Sampling Distributions and Hypothesis Testing.
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Scaffold Download free viewer:
Hypothesis Testing: Two Sample Test for Means and Proportions
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Motif search and discovery Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington.
Traceback and local alignment Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington.
Multiple testing correction
BLAST What it does and what it means Steven Slater Adapted from pt.
Essential Statistics in Biology: Getting the Numbers Right
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
Motif search Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington
Comp. Genomics Recitation 3 The statistics of database searching.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Statistical significance of alignment scores Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering.
Slide Slide 1 Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley. Overview.
Statistical Testing with Genes Saurabh Sinha CS 466.
Lecture PowerPoint Slides Basic Practice of Statistics 7 th Edition.
Inen 460 Lecture 2. Estimation (ch. 6,7) and Hypothesis Testing (ch.8) Two Important Aspects of Statistical Inference Point Estimation – Estimate an unknown.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Comp. Genomics Recitation 3 (week 4) 26/3/2009 Multiple Hypothesis Testing+Suffix Trees Based in part on slides by William Stafford Noble.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 7 Inferences Concerning Means.
Copyright © 2009 Pearson Education, Inc t LEARNING GOAL Understand when it is appropriate to use the Student t distribution rather than the normal.
Comp. Genomics Recitation 2 (week 3) 19/3/09. Outline Finding repeats Branch & Bound for MSA Multiple hypotheses testing.
Estimating the False Discovery Rate in Genome-wide Studies BMI/CS 576 Colin Dewey Fall 2008.
Lecture Slides Elementary Statistics Twelfth Edition
Chapter 16: Sample Size “See what kind of love the Father has given to us, that we should be called children of God; and so we are. The reason why the.
Genetical Genomics in the Mouse
Chapter 9: Inferences Involving One Population
Differential Gene Expression
Statistical Testing with Genes
Sequence comparison: Local alignment
Lecture Slides Elementary Statistics Twelfth Edition
Chapter 8: Inference for Proportions
CS 213: Data Structures and Algorithms
Hypothesis Testing: Two Sample Test for Means and Proportions
Transcription factor binding motifs
Sequence comparison: Significance of similarity scores
Chapter 6 Hypothesis tests.
Sequence comparison: Traceback and local alignment
Multiple Testing Methods for the Analysis of Gene Expression Data
Motif p-values GENOME 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble.
Sequence comparison: Multiple testing correction
Chapter 10 Analyzing the Association Between Categorical Variables
Learning Algorithm Evaluation
Sequence comparison: Local alignment
Testing Hypotheses about a Population Proportion
Sequence comparison: Multiple testing correction
Sequence comparison: Significance of similarity scores
Dr. Sampath Jayarathna Cal Poly Pomona
Transcription factor binding motifs
8.3 Estimating a Population Mean
CHAPTER 18: Inference about a Population Mean
Last Update 12th May 2011 SESSION 41 & 42 Hypothesis Testing.
Statistical Testing with Genes
Dr. Sampath Jayarathna Cal Poly Pomona
Presentation transcript:

False discovery rate estimation Genome 559: Introduction to Statistical and Computational Genomics Prof. William Stafford Noble

One-minute responses Overall I feel I’m getting the hang of it. Great sequence of problem solving. I like the questions that build on previous code. Today’s class was helpful in terms of learning how to break down problems. Relevant programming makes me happy. Good pace. I liked the integration of the programming with concepts.

One-minute responses: questions and requests A bit fast. x3 Not enough time for any of the practice problems today. Today’s problem was very challenging. I tend to grasp the math parts of lecture after class. Are there any resources you could recommend for stats & comp bio. I added a pointer for today’s content. You can also look at the “Points of Significance” column in Nature Methods (http://blogs.nature.com/methagora/2013/08/giving_s tatistics_the_attention_it_deserves.html) Please go over the math stuff again. Still having tricky time looping over matrices. Screenshots of the text editor would be more helpful than using text in powerpoint for sample problems. Can we spend more time on the motif practice problem next class? What will the exam look like? Similar to homeworks, except the programming will be on paper and will be short questions. Open book. For the EVD, do the distributions always look the same given that the DP matrix can be different? No, the distribution for a given motif will not always precisely follow an EVD. And it does differ from one motif to the next. It would be helpful if we could see the DP for motifs written out to the solution. Can we go over how to fill the histogram in class?

Can you discuss conditionals we add to for loops like and, or Can you discuss conditionals we add to for loops like and, or? Whether order of conditionals matters.

Converting scores to p-values 0 1 2 3 4 … 10 60 77 100 400 A 10 67 59 44 C 60 39 49 29 G 0 71 50 54 T 100 43 13 64 1 1 1 1 1 In the end, the bottom row contains the scores for all possible sequences of length N. Use these scores to compute a p-value.

Cumulative distribution of counts in the final row of the DP matrix Number of sequences Motif score

False discovery rate: Motivation Scenario #1: You have used PSI-BLAST to identify a new protein homology, and you plan to publish a paper describing this result. Scenario #2: You have used PSI-BLAST to discover many potential homologs of a single query protein, and you plan to carry out a wet lab experiment to validate your findings. The experiment can be done in parallel on 96 proteins.

Types of errors False positive: the algorithm indicates that the sequences are homologs, but actually they are not. False negative: the sequences are homologs, but the algorithm indicates that they are not. Both types of errors are defined relative to some confidence threshold. Typically, researchers are more concerned about false positives.

False discovery rate 5 FP 13 TP 33 TN 5 FN The false discovery rate (FDR) is the expected percentage of target sequences above the threshold that are false positives. In the context of sequence database searching, the false discovery rate is the percentage of sequences above the threshold that are not homologous to the query. 33 TN 5 FN Homolog of the query sequence Non-homolog of the query sequence FDR* = FP / (FP + TP) = 5/18 = 27.8% *Technically, this is the false discovery proportion, and the FDR is the expectation of the FDP.

Bonferroni vs. FDR Bonferroni controls the family-wise error rate; i.e., the probability of at least one false positive among the sequences that score better than the threshold. The Benjamini-Hochberg procedure controls the false discovery rate; i.e., the expected percentage of false positives among the target sequences that score better than the threshold.

Controlling the FDR Order the unadjusted p-values p1  p2  …  pm. To control FDR at level α, Reject the null hypothesis for j = 1, …, j*. (Benjamini & Hochberg, 1995)

Benjamini-Hochberg example Rank (jα)/m p-value 1 0.00005 0.0000008 2 0.00010 0.0000012 3 0.00015 0.0000013 4 0.00020 0.0000056 5 0.00025 0.0000078 6 0.00030 0.0000235 7 0.00035 0.0000945 8 0.00040 0.0002450 9 0.00045 0.0004700 10 0.00050 0.0008900 … 1000 0.05000 1.0000000 Choose the largest threshold j so that (jα)/m is less than the corresponding p-value. Approximately 5% of the examples above the line are expected to be false positives.

Benjamini-Hochberg test Test of 100 uniformly distributed p-values (p- values from non-significant results) P-values as blue dots Significance threshold for FDR = 0.2 as red line An idealized experiment in which 100 cases, none of which are significant, are tested with the Benjamini-Hochberg procedure, controlling the false discovery rate at 20%. The blue dots are the ranked p-values from the 100 cases, and the red line is the significance threshold established by the Benjamini-Hochberg procedure. None of the cases can be declared significant. www.complextrait.org/Powerpoint/ctc2002/KenAffyQTL2002.ppt

Benjamini-Hochberg test Test of 10 low p-values (significant results) mixed with 90 p-values from non- significant results P-values as blue dots Significance threshold for FDR = 0.2 as red line Eleven cases declared significant Declare significant An idealized experiment in which 10 cases with significantly low p-values are mixed with 90 cases that are not significant. All cases can be declared significant up to the highest-ranked case that falls below the significance threshold.

Summary Selecting a significance threshold requires evaluating the cost of making a mistake. Bonferroni correction divides the desired p-value threshold by the number of statistical tests performed. The E-value is the expected number of times that the given score would appear in a random database of the given size. The false discovery rate is the percentage of false positives among the target sequences that score better than the threshold. Use Bonferroni correction when you want to avoid making a single mistake; use Benjamini-Hochberg when you can tolerate a certain percentage of mistakes.

Hint: <list>.sort() will sort your list. Sample problem #1 Rank (jα)/m p-value 1 0.00005 0.0000008 2 0.00010 0.0000012 3 0.00015 0.0000013 4 0.00020 0.0000056 5 0.00025 0.0000078 6 0.00030 0.0000235 7 0.00035 0.0000945 8 0.00040 0.0002450 9 0.00045 0.0004700 10 0.00050 0.0008900 … 1000 0.05000 1.0000000 Choose the largest threshold j so that (jα)/m is less than the corresponding p-value. Given: a confidence threshold, and a list of p-values Return: a set of p-values with the specified false discovery rate ./compute-fdr.py 0.1 pvalues.txt Hint: <list>.sort() will sort your list.

Sample problem #2 Modify your program so that it will work with an arbitrarily large collection of p-values. You may assume that the p-values are given in sorted order. Read the file twice: once to find out how many p-values there are, and a second time to do the actual calculation.