ODP and SVA European Institute of Statistical Genetics Liege, Belgium September 4, 2007 Greg Gibson.

Slides:



Advertisements
Similar presentations
Estimating the False Discovery Rate in Multi-class Gene Expression Experiments using a Bayesian Mixture Model Alex Lewin 1, Philippe Broët 2 and Sylvia.
Advertisements

Shibing Deng Pfizer, Inc. Efficient Outlier Identification in Lung Cancer Study.
3.3 Hypothesis Testing in Multiple Linear Regression
Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Uncertainty and confidence intervals Statistical estimation methods, Finse Friday , 12.45–14.05 Andreas Lindén.
From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion mean(nucs) = bp median(nucs) = 110 bp sd(nucs+ = 17.3.
Genomic Profiles of Brain Tissue in Humans and Chimpanzees II Naomi Altman Oct 06.
Differentially expressed genes
Lecture 19: Tues., Nov. 11th R-squared (8.6.1) Review
Final Review Session.
Analysis of Differential Expression T-test ANOVA Non-parametric methods Correlation Regression.
Lecture 9: One Way ANOVA Between Subjects
Statistical Methods in Computer Science Hypothesis Testing I: Treatment experiment designs Ido Dagan.
Chapter 11 Multiple Regression.
Significance Tests P-values and Q-values. Outline Statistical significance in multiple testing Statistical significance in multiple testing Empirical.
Statistical Methods in Computer Science Hypothesis Testing I: Treatment experiment designs Ido Dagan.
Today Concepts underlying inferential statistics
Lorelei Howard and Nick Wright MfD 2008
Chapter 12 Inferential Statistics Gay, Mills, and Airasian
Regression and Correlation Methods Judy Zhong Ph.D.
Multiple testing in high- throughput biology Petter Mostad.
LEARNING PROGRAMME Hypothesis testing Intermediate Training in Quantitative Analysis Bangkok November 2007.
Candidate marker detection and multiple testing
Essential Statistics in Biology: Getting the Numbers Right
Statistics & Biology Shelly’s Super Happy Fun Times February 7, 2012 Will Herrick.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Differential Expression II Adding power by modeling all the genes Oct 06.
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
Multiple Testing in Microarray Data Analysis Mi-Ok Kim.
Biostatistics, statistical software VII. Non-parametric tests: Wilcoxon’s signed rank test, Mann-Whitney U-test, Kruskal- Wallis test, Spearman’ rank correlation.
Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics Brigham Young University Dept. Integrative Biology.
Chapter 10: Analyzing Experimental Data Inferential statistics are used to determine whether the independent variable had an effect on the dependent variance.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
A A R H U S U N I V E R S I T E T Faculty of Agricultural Sciences Introduction to analysis of microarray data David Edwards.
Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman.
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
Maximum Likelihood - "Frequentist" inference x 1,x 2,....,x n ~ iid N( ,  2 ) Joint pdf for the whole random sample Maximum likelihood estimates.
Analysis of Variance 1 Dr. Mohammed Alahmed Ph.D. in BioStatistics (011)
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Chapter 16 Data Analysis: Testing for Associations.
MRNA Expression Experiment Measurement Unit Array Probe Gene Sequence n n n Clinical Sample Anatomy Ontology n 1 Patient 1 n Disease n n ProjectPlatform.
Experimental Research Methods in Language Learning Chapter 10 Inferential Statistics.
Simple Linear Regression (OLS). Types of Correlation Positive correlationNegative correlationNo correlation.
Contrasts & Statistical Inference
Statistics for Differential Expression Naomi Altman Oct. 06.
Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.
Inferential Statistics. The Logic of Inferential Statistics Makes inferences about a population from a sample Makes inferences about a population from.
Suppose we have T genes which we measured under two experimental conditions (Ctl and Nic) in n replicated experiments t i * and p i are the t-statistic.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Analyzing Expression Data: Clustering and Stats Chapter 16.
1 Estimation of Gene-Specific Variance 2/17/2011 Copyright © 2011 Dan Nettleton.
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
Lecture 23: Quantitative Traits III Date: 11/12/02  Single locus backcross regression  Single locus backcross likelihood  F2 – regression, likelihood,
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Chapter 9: Introduction to the t statistic. The t Statistic The t statistic allows researchers to use sample data to test hypotheses about an unknown.
Hypothesis Testing. Statistical Inference – dealing with parameter and model uncertainty  Confidence Intervals (credible intervals)  Hypothesis Tests.
Educational Research Inferential Statistics Chapter th Chapter 12- 8th Gay and Airasian.
NURS 306, Nursing Research Lisa Broughton, MSN, RN, CCRN RESEARCH STATISTICS.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Estimating the False Discovery Rate in Genome-wide Studies BMI/CS 576 Colin Dewey Fall 2008.
Estimation of Gene-Specific Variance
Genome Wide Association Studies using SNP
12 Inferential Analysis.
Correlation and Regression
Nat. Rev. Nephrol. doi: /nrneph
12 Inferential Analysis.
15.1 The Role of Statistics in the Research Process
MGS 3100 Business Analysis Regression Feb 18, 2016
Presentation transcript:

ODP and SVA European Institute of Statistical Genetics Liege, Belgium September 4, 2007 Greg Gibson

1.SAM and ANOVA assume that all tests are independent, but they aren’t 2.Some within sample variances are underestimated, which artificially inflates test statistics; some are overestimated, which reduces power 3.They fail to optimize the ETP (true positive estimation) rate for a given FDR What’s the matter with t-tests?

Optimal Discovery Procedure Storey, Dai and Leek (2007) Biostatistics 8: the ODP is defined as the testing procedure that maximizes the ETP for each fixed EFP level. A consequence of this optimality is that the rate of “missed discoveries” is minimized for each FDR level. Neyman–Pearson lemma: Given a single set of observed data, the optimal single-testing procedure is based on the statistic: the ODP is similar, but considers the data for a single feature evaluated at all true probability density functions:

ODP Principle Fig. 1. Plots comparing the NP testing approach to the ODP testing approach through a simple example. (a) NP approach. The null (gray) and alternative (black) probability density functions of a single test. For observed data x and y, the statistics are calculated by taking the ratio of the alternative to the null densities at each respective point. In this NP approach, the test with data y is more significant than the test with data x. (b) ODP approach. The common null density (gray) for true null tests and the alternative densities (black) for several true alternative tests. For observed data x and y, the statistics are calculated by taking the ratio of the sum of alternative densities to the null density evaluated at each respective point. In this ODP approach, the test with data x is now more significant than the test with data y because multiple alternative densities have similar positive means even though each one is smaller than the single alternative density with negative mean.

ODP Performance: BRCA data A comparison of the ODP approach to five leading methods for identifying differentially expressed genes (described in the text). The number of genes found to be significant by each method over a range of estimated q-value cutoffs is shown. The methods involved in the comparison are the proposed ODP, SAM, the traditional t-test/ F-test, a shrunken t-test/F-test, a nonparametric empirical Bayes "local FDR" method, and a model-based empirical Bayes method. A color version of the figure is given in the supplementary material available at Biostatistics online, Figure 9. (a) Results for identifying differential expression between the BRCA1 and BRCA2 groups in the Hedenfalk and others data. (b) Results for identifying differential expression between the BRCA1, BRCA2, and Sporadic groups in the Hedenfalk and others data. The model-based empirical Bayes method has not been detailed for a three-sample analysis, so it is omitted in this panel.

ODP Table 1 Table 1. Improvements of the ODP approach over existing thresholding methods. Shown are the minimum, median, and maximum percentage increases in the number of genes called significant by the proposed ODP approach relative to the existing approaches among FDR levels 2%, 3%,..., 10%. The exact same FDR methodology (Storey, 2002; Storey and Tibshirani, 2003) was applied to each gene-ranking method in order to make the comparisons fair. The model-based Bayesian method (Lonnstedt and Speed, 2002) is not defined for a three-sample analysis, so that case is omitted Thresholding method % Increase by ODP 2-sample% Increase by ODP 3-sample MinimumMedianMaximumMinimumMedianMaximum SAM (Tusher et al, 2001) t/F-test (Dudoit et al 2002, Kerr et al, 2000) Shrunken t/F-test (Cui and others, 2005) Bayesian local FDR (Efron and others, 2001) Posterior probability (Lonnstedt & Speed 2002) ———

ODP algorithm 1.Estimate the true null hypotheses from distribution of P-values from KW rank tests for all genes 2.Determine the maximum likelihood distributions for all genes according to standard methods: 3. Evaluate the ODP statistic for each gene: 4.Use bootstrap resampling to obtain null statistics 5.Contrast observed and expected ODPs -> q values

ODP Performance: simulated data A comparison of the ODP approach to five leading methods for identifying differentially expressed genes (described in the text and Figure 2) based on simulated data. The number of genes found to be significant by each method over a range of estimated q-value cutoffs is shown for a single, representative data set from each scenario. The proposed ODP approach is in black and the other methods are in gray. In general, the data sets increase in complexity from panels (a) to (d). (a) In this scenario, two groups are compared, there is perfectly symmetric differential expression, and the variances are simulated from a unimodal, well-behaved distribution. (b) Two groups are compared, there is moderate asymmetry in the differential expression, and the variances are simulated from a bimodal distribution. (c) Three groups are compared, there is slight asymmetry in differential expression, and the variances are simulated from a unimodal, well-behaved distribution. (d) Three groups are compared, there is moderate asymmetry in differential expression, and the variances are simulated from a bimodal distribution.

In addition to the primary measured variables that are estimated as fixed or random effects in an analysis, there are usually also unmodeled factors that contribute to expression heterogeneity. For example, age, time-of-day, nutrition probably all impact an analysis without being directly studied, but they are more predictable than gene specific noise. Sometimes the variable of interest may be confounded with the hidden factors (eg batch with population). In many situations, SVA can be used to improve power. Surrogate Variable Analysis Leek and Storey (2007) PLoS Genetics, In press

SVA Simulation Simulated Example of Expression Heterogeneity (A) A heatmap of a simulated microarray study consisting of 1,000 genes measured on 20 arrays. (B) Genes in this simulated study are differentially expressed between two hypothetical treatment groups; here the two groups are shown as an indicator variable for each array. (C) Genes in each simulated study are affected by an independent factor that causes EH. This factor is distinct from, but possibly correlated with the group variable. Here the factor is shown as a quantitative variable, but it could also be an indicator variable or some linear or nonlinear function of the covariates.

SVA Table 1 The results of the significance analysis in the three real gene expression studies. The results of the genetics of gene expression study include the number of significant cis-linkages before and after adjusting for surrogate variables. The disease class results report the number of genes differentially expressed between BRCA1 and BRCA2 before and after adjusting for surrogate variables. For the timecourse study, the number of genes differentially expressed with respect to age are shown for an unadjusted analysis, an analysis adjusted for tissue type, and an SVA adjusted analysis. An SVA-adjusted analysis may result in an increase or decrease in the number of significant results depending on the direction and degree to which the unmodeled factors (now captured by surrogate variables) were confounded with the primary variables.

SVA Performance Impact of Expression Heterogeneity One thousand gene expression data sets containing EH were simulated, tested, and ranked for differential expression as detailed in Simulated Examples. (A) A boxplot of the standard deviation of the ranks of each gene for differential expression over repeated simulated studies. Results are shown for analyses that ignore expression heterogeneity (Unadjusted), take expression heterogeneity into account by SVA (Adjusted), and for simulated data unaffected by expression heterogeneity (Ideal). (B) For each simulated data set, a Kolmogorov-Smirnov test was employed to assess whether the p-values of null genes followed the correct null Uniform distribution (Supplementary Text). A quantile- quantile plot of the one thousand Kolmogorov-Smirnov p-values are shown for the SVA adjusted analysis (solid line) and the unadjusted analysis (dashed line). It can be seen that the SVA adjusted analysis provides correctly distributed null p-values, whereas the unadjusted analysis does not due to EH. (C) A plot of expected true positives versus false discovery rate for the SVA adjusted (solid) and unadjusted (dashed) analyses. The SVA adjusted analysis shows increased power to detect true differential expression.

SVA Procedure 1.Remove the signal due to the primary variable(s) of interest to obtain a residual expression matrix. 2.Apply a decomposition to the residual expression matrix to identify signatures of EH in terms of an orthogonal basis of singular vectors. 3.Use a statistical test to determine the singular vectors that represent significantly more variation than would be expected by chance. 4.Identify the subset of genes driving each orthogonal signature of EH. 5.For each subset of genes, build a surrogate variable based on the full EH signature of that subset in the original data. 6.Include all significant surrogate variables as covariates in subsequent regression analyses, allowing for gene-specific coefficients for each surrogate variable.

SVA: Trans-eQTL detection SVA Captures EH Due to Genotype (A) A plot of significant linkage peaks (p-value < 1e-7) for expression QTL in the Brem et al. [10, 21] study by marker location (x-axis) and expression trait location (y-axis). (B) Significant linkage peaks (p-value < 1e-7) after adjusting for surrogate variables. Large trans-linkage peaks on Chromosomes II, III, VII, XII, XIV and XV have been eliminated without reducing cis-linkage peaks.

SVA: Breast Cancer Study Surrogate Variables from Human Studies (A) A plot of the top surrogate variable estimated from the breast cancer data [22]. The BRCA1 group is relatively homogeneous (triangles), but the BRCA2 group shows substantial heterogeneity (pluses). (B) A plot of tissue type versus array for the Rodwell et al. [7] study (dotted line) and the top surrogate variable estimated from the expression data when tissue was ignored (dashed line). There is strong correlation between the top surrogate variable and the tissue type variable.

SVA: Moroccan study