Estimating the False Discovery Rate in Genome-wide Studies BMI/CS 576 Colin Dewey Fall 2008.

Slides:



Advertisements
Similar presentations
Estimating the False Discovery Rate in Multi-class Gene Expression Experiments using a Bayesian Mixture Model Alex Lewin 1, Philippe Broët 2 and Sylvia.
Advertisements

Multiple testing and false discovery rate in feature selection
Differential Expression Analysis Introduction to Systems Biology Course Chris Plaisier Institute for Systems Biology.
Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is, “What is the statistical model to this data?” We then characterize.
1 Statistics Achim Tresch UoC / MPIPZ Cologne treschgroup.de/OmicsModule1415.html
From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion mean(nucs) = bp median(nucs) = 110 bp sd(nucs+ = 17.3.
Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005.
Chapter 9 Hypothesis Testing Testing Hypothesis about µ, when the s.t of population is known.
Differentially expressed genes
Basic Elements of Testing Hypothesis Dr. M. H. Rahbar Professor of Biostatistics Department of Epidemiology Director, Data Coordinating Center College.
Hypothesis Testing: Type II Error and Power.
Probability & Statistics for Engineers & Scientists, by Walpole, Myers, Myers & Ye ~ Chapter 10 Notes Class notes for ISE 201 San Jose State University.
Hypothesis Tests for Means The context “Statistical significance” Hypothesis tests and confidence intervals The steps Hypothesis Test statistic Distribution.
Significance Tests P-values and Q-values. Outline Statistical significance in multiple testing Statistical significance in multiple testing Empirical.
Chapter 9 Hypothesis Testing.
Q-Vals (and False Discovery Rates) Made Easy Dennis Shasha Based on the paper "Statistical significance for genomewide studies" by John Storey and Robert.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 5 – Testing for equivalence or non-inferiority. Power.
Different Expression Multiple Hypothesis Testing STAT115 Spring 2012.
Statistical hypothesis testing – Inferential statistics I.
False Discovery Rate (FDR) = proportion of false positive results out of all positive results (positive result = statistically significant result) Ladislav.
Multiple testing in high- throughput biology Petter Mostad.
Confidence Intervals and Hypothesis Testing - II
Lucio Baggio - Lucio Baggio - False discovery rate: setting the probability of false claim of detection 1 False discovery rate: setting the probability.
Essential Statistics in Biology: Getting the Numbers Right
Lecture 7 Introduction to Hypothesis Testing. Lecture Goals After completing this lecture, you should be able to: Formulate null and alternative hypotheses.
Hypothesis testing Chapter 9. Introduction to Statistical Tests.
Statistical significance for genomewide studies John D. Storey and Robert Tibshirani Saurabh Paliwal Topics in Bioinformatics class presentation 11/14/06.
1 If we live with a deep sense of gratitude, our life will be greatly embellished.
Agresti/Franklin Statistics, 1 of 122 Chapter 8 Statistical inference: Significance Tests About Hypotheses Learn …. To use an inferential method called.
Multiple Testing in Microarray Data Analysis Mi-Ok Kim.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
Significance Test A claim is made. Is the claim true? Is the claim false?
Back to basics – Probability, Conditional Probability and Independence Probability of an outcome in an experiment is the proportion of times that.
Hypothesis Testing State the hypotheses. Formulate an analysis plan. Analyze sample data. Interpret the results.
Multiple Testing Matthew Kowgier. Multiple Testing In statistics, the multiple comparisons/testing problem occurs when one considers a set of statistical.
Statistical Testing with Genes Saurabh Sinha CS 466.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 13: One-way ANOVA Marshall University Genomics Core.
Suppose we have T genes which we measured under two experimental conditions (Ctl and Nic) in n replicated experiments t i * and p i are the t-statistic.
Hypothesis Testing. “Not Guilty” In criminal proceedings in U.S. courts the defendant is presumed innocent until proven guilty and the prosecutor must.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Spatial Smoothing and Multiple Comparisons Correction for Dummies Alexa Morcom, Matthew Brett Acknowledgements.
The Broad Institute of MIT and Harvard Differential Analysis.
AP Statistics Chapter 11 Notes. Significance Test & Hypothesis Significance test: a formal procedure for comparing observed data with a hypothesis whose.
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
Don’t be wordy; mind your style! “The results that were taken from a sample of people go to show some results that tell us some facts about how women and.
Multiple Testing in Impact Evaluations: Discussant Comments IES Research Conference June 11, 2008 Larry L. Orr.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
Hypothesis Testing Steps for the Rejection Region Method State H 1 and State H 0 State the Test Statistic and its sampling distribution (normal or t) Determine.
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Analysis of (cDNA) Microarray.
1 A Discussion of False Discovery Rate and the Identification of Differentially Expressed Gene Categories in Microarray Studies Ames, Iowa August 8, 2007.
Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall Statistics for Business and Economics 7 th Edition Chapter 9 Hypothesis Testing: Single.
9.3 Hypothesis Tests for Population Proportions
Multiple Testing Methods for the Analysis of Microarray Data
Differential Gene Expression
Statistical Testing with Genes
Q-Vals (and False Discovery Rates) Made Easy
BMI/CS 776 Spring 2018 Anthony Gitter
Two-sided p-values (1.4) and Theory-based approaches (1.5)
Multiple Testing Methods for the Analysis of Gene Expression Data
Q-Vals (and False Discovery Rates) Made Easy
St. Edward’s University
Slides by JOHN LOUCKS St. Edward’s University.
CHAPTER 12 Inference for Proportions
CHAPTER 12 Inference for Proportions
False discovery rate estimation
Testing Hypotheses about a Population Proportion
Statistical Testing with Genes
Statistical chart of significantly differentially expressed genes
Testing Hypotheses about a Population Proportion
Presentation transcript:

Estimating the False Discovery Rate in Genome-wide Studies BMI/CS Colin Dewey Fall 2008

Expression in BRCA1 and BRCA2 Mutation-Positive Tumors 7 patients with BRCA1 mutation-positive tumors vs. 7 patients with BRCA2 mutation-positive tumors 5631 genes assayed Hedenfalk et al., New England Journal of Medicine 344: , 2001.

Expression in BRCA1 and BRCA2 Mutation-Positive Tumors Key question: which genes are differentially expressed in these two sets of tumors? Methodology: for each gene, use a statistical test to assess the hypothesis that the expression levels differ in the two sets

Hypothesis Testing consider two competing hypotheses for a given gene: –null hypothesis: the expression levels in the first set come from the same distribution as the levels in the second set –alternative hypothesis: they come from different distributions we first calculate a test statistic for these measurements, and then determine its p-value p-value: the probability of observing a test statistic that is as extreme or more extreme than the one we have, assuming the null hypothesis is true

Calculating a p-value 1.calculate test statistic (e.g. T statistic) where 2.see how much mass in null distribution with value this extreme or more if test statistic is here, p = 0.034

The Multiple Testing Problem if we’re testing one gene, the p-value is a useful measure of whether the variation of the gene’s expression across two groups is significant suppose that most genes are not differentially expressed (this is the typical situation) if we’re testing 5000 genes that don’t have a significant change in their expression (i.e. the null hypothesis holds), we’d still expect about 250 of them to have p-values ≤ 0.05 Can think of p-value as the false positive rate over null genes

Family-wise error rate One way to deal with the multiple testing problem is to control the probability of rejecting at least one null hypothesis when all genes are null This is the family-wise error rate (FWER) Simplest approach (Bonferroni correction) –Set threshold for calling a p-value significant to α/g, where g is the number of genes –Then FWER is ≤ α

Loss of power with FWER FWER, and Bonferroni in particular, reduce our power to reject null hypotheses –As g gets large, p-value threshold gets very small For expression analysis, FWER and false positive rate are not really the primary concern –We can live with false positives –We just don’t want too many of them relative to the total number of genes called significant

The False Discovery Rate genep-valuerank C F G J I B A D H0.519 E suppose we pick a threshold, and call genes above this threshold “significant” the false discovery rate is the expected fraction of these that are mistakenly called significant (i.e. are truly null) [Benjamini & Hochberg ‘95; Storey & Tibshirani ‘02]

The False Discovery Rate genep-valuerank C F G J I B A D H0.519 E t # genes

The False Discovery Rate to compute the FDR for a threshold t, we need to estimate E[ F(t) ]and E[ S(t) ] estimate by the observed S(t) so how can we estimate E[ F(t) ]?

What Fraction of the Genes are Truly Null? consider the histogram of p-values from Hedenfalk et al. –includes both null and alternative genes –but we expect null p-values to be uniformly distributed estimated proportion of null p-values Figure from Storey & Tibshirani PNAS 100(16), 2002.

The False Discovery Rate q-value genep-valuerank C F G J I B A D H0.519 E t estimated proportion of null p-values # genes

q-values vs. p-values for Hedenfalk et al. Figure from Storey & Tibshirani PNAS 100(16), 2002.

FDR Summary in many high-throughput experiments, we want to know what is different across a two sets of conditions/individuals (e.g. which genes are differentially expressed) because of the multiple testing problem, p-values may not be so informative in such cases the FDR, however, tells us which fraction of significant features are likely to be null q-values based on the FDR can be readily computed from p-values (see Storey’s package QVALUE)