Controlling FDR in Second Stage Analysis Catherine Tuglus Work with Mark van der Laan UC Berkeley Biostatistics.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Shibing Deng Pfizer, Inc. Efficient Outlier Identification in Lung Cancer Study.
Minimum Redundancy and Maximum Relevance Feature Selection
Multiple Testing and Prediction and Variable Selection Class web site: Statistics for Microarrays.
Gene selection using Random Voronoi Ensembles Stefano Rovetta Department of Computer and Information Sciences, University of Genoa, Italy Francesco masulli.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Differentially expressed genes
Biomarker Discovery Analysis Targeted Maximum Likelihood Cathy Tuglus, UC Berkeley Biostatistics November 7 th -9 th 2007 BASS XIV Workshop with Mark van.
Analysis of Covariance Goals: 1)Reduce error variance. 2)Remove sources of bias from experiment. 3)Obtain adjusted estimates of population means.
Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.
Prediction Methods Mark J. van der Laan Division of Biostatistics U.C. Berkeley
1 Test of significance for small samples Javier Cabrera.
Feature Selection and Its Application in Genomic Data Analysis March 9, 2004 Lei Yu Arizona State University.
Statistics for Microarrays
Analysis of Covariance Goals: 1)Reduce error variance. 2)Remove sources of bias from experiment. 3)Obtain adjusted estimates of population means.
Probability theory 2008 Outline of lecture 5 The multivariate normal distribution  Characterizing properties of the univariate normal distribution  Different.
\department of mathematics and computer science Supervised microarray data analysis Mark van de Wiel.
Guidelines on Statistical Analysis and Reporting of DNA Microarray Studies of Clinical Outcome Richard Simon, D.Sc. Chief, Biometric Research Branch National.
Multiple Testing Procedures Examples and Software Implementation.
Different Expression Multiple Hypothesis Testing STAT115 Spring 2012.
Multiple Testing in the Survival Analysis of Microarray Data
Multiple testing in high- throughput biology Petter Mostad.
Candidate marker detection and multiple testing
Applying False Discovery Rate (FDR) Control in Detecting Future Climate Changes ZongBo Shang SIParCS Program, IMAGe, NCAR August 4, 2009.
Biostatistics Case Studies 2015 Youngju Pak, PhD. Biostatistician Session 4: Regression Models and Multivariate Analyses.
Essential Statistics in Biology: Getting the Numbers Right
Exagen Diagnostics, Inc., all rights reserved Biomarker Discovery in Genomic Data with Partial Clinical Annotation Cole Harris, Noushin Ghaffari.
CSCE555 Bioinformatics Lecture 16 Identifying Differentially Expressed Genes from microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun.
Multiple Testing Mark J. van der Laan Division of Biostatistics U.C. Berkeley
Multiple Testing in Microarray Data Analysis Mi-Ok Kim.
Strong Control of the Familywise Type I Error Rate in DNA Microarray Analysis Using Exact Step-Down Permutation Tests Peter H. Westfall Texas Tech University.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
A A R H U S U N I V E R S I T E T Faculty of Agricultural Sciences Introduction to analysis of microarray data David Edwards.
Statistical analysis of expression data: Normalization, differential expression and multiple testing Jelle Goeman.
1 THE ROLE OF COVARIATES IN CLINICAL TRIALS ANALYSES Ralph B. D’Agostino, Sr., PhD Boston University FDA ODAC March 13, 2006.
Empirical Efficiency Maximization: Locally Efficient Covariate Adjustment in Randomized Experiments Daniel B. Rubin Joint work with Mark J. van der Laan.
Introduction to Statistical Analysis of Gene Expression Data Feng Hong Beespace meeting April 20, 2005.
Linear Discriminant Analysis (LDA). Goal To classify observations into 2 or more groups based on k discriminant functions (Dependent variable Y is categorical.
Application of Class Discovery and Class Prediction Methods to Microarray Data Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics.
Statistical Testing with Genes Saurabh Sinha CS 466.
Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.
Innovative Paths to Better Medicines Design Considerations in Molecular Biomarker Discovery Studies Doris Damian and Robert McBurney June 6, 2007.
Geology 6600/7600 Signal Analysis 02 Sep 2015 © A.R. Lowry 2015 Last time: Signal Analysis is a set of tools used to extract information from sequences.
Simulation Study for Longitudinal Data with Nonignorable Missing Data Rong Liu, PhD Candidate Dr. Ramakrishnan, Advisor Department of Biostatistics Virginia.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
1 Estimation of Gene-Specific Variance 2/17/2011 Copyright © 2011 Dan Nettleton.
Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring T.R. Golub et al., Science 286, 531 (1999)
Tutorial I: Missing Value Analysis
A comparative approach for gene network inference using time-series gene expression data Guillaume Bourque* and David Sankoff *Centre de Recherches Mathématiques,
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
- 1 - Preliminaries Multivariate normal model (section 3.6, Gelman) –For a multi-parameter vector y, multivariate normal distribution is where  is covariance.
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.1 Lecture 11: Canonical correlation analysis (CANCOR)
Chapter 14 EXPLORATORY FACTOR ANALYSIS. Exploratory Factor Analysis  Statistical technique for dealing with multiple variables  Many variables are reduced.
Strategies for Metabolomic Data Analysis Dmitry Grapov, PhD.
Analyzing circadian expression data by harmonic regression based on autoregressive spectral estimation Rendong Yang and Zhen Su Division of Bioinformatics,
Microarray Data Analysis Xuming He Department of Statistics University of Illinois at Urbana-Champaign.
ODP and SVA European Institute of Statistical Genetics Liege, Belgium September 4, 2007 Greg Gibson.
Estimating the False Discovery Rate in Genome-wide Studies BMI/CS 576 Colin Dewey Fall 2008.
Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.
JMP Discovery Summit 2016 Janet Alvarado
Probability Theory and Parameter Estimation I
Differential Gene Expression
CH 5: Multivariate Methods
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Boosting For Tumor Classification With Gene Expression Data
OVERVIEW OF LINEAR MODELS
OVERVIEW OF LINEAR MODELS
Presentation transcript:

Controlling FDR in Second Stage Analysis Catherine Tuglus Work with Mark van der Laan UC Berkeley Biostatistics

Outline What is a Second Stage Analysis What is a Second Stage Analysis Issues with MTP for Secondary Analysis Issues with MTP for Secondary Analysis Proposed solution for Marginal FDR controlling procedure Proposed solution for Marginal FDR controlling procedure Simulations Simulations Data Example: Golub et al 1999 Data Example: Golub et al 1999

Second Stage Analysis Given large dataset (50,000 variables) Given large dataset (50,000 variables) Dimension reduction is performed using supervised analysis Dimension reduction is performed using supervised analysis Univariate regression Univariate regression RandomForest selection, etc. RandomForest selection, etc. Additional analysis is applied to reduced dataset (~1000 variables) Additional analysis is applied to reduced dataset (~1000 variables) “Secondary Analysis” “Secondary Analysis” Variable Importance Methods for instance Variable Importance Methods for instance Would like to adjust for multiple testing Would like to adjust for multiple testing

MTP for Secondary Analysis Supervised reduction of the data invalidates standard MTPs Supervised reduction of the data invalidates standard MTPs Adds Bias to analysis Adds Bias to analysis Cannot account for initial screening using standard MTPs Cannot account for initial screening using standard MTPs MTP will not control Type I and Type II error appropriately MTP will not control Type I and Type II error appropriately

Marginal FDR controlling MTP for Secondary Analysis Process Process Given (Y,W)~P, where W contains M variables Given (Y,W)~P, where W contains M variables Initial analysis reduces the set to N variables Initial analysis reduces the set to N variables Complete secondary analysis on reduced dataset (N variables), obtaining p-values Complete secondary analysis on reduced dataset (N variables), obtaining p-values Add to list of p-values (M-N) 1’s Add to list of p-values (M-N) 1’s Thus, all tests not completed are insignificant Thus, all tests not completed are insignificant Apply marginal Benjamini & Hochberg step-up FDR controlling procedure Apply marginal Benjamini & Hochberg step-up FDR controlling procedure If FDR applied to all variables would select a subset of the N variables, then this two-stage FDR method will be equivalent with applying FDR to all variables. Thus, loss in power only occurs if the N variables exclude significant variables. If FDR applied to all variables would select a subset of the N variables, then this two-stage FDR method will be equivalent with applying FDR to all variables. Thus, loss in power only occurs if the N variables exclude significant variables. Should be generous in the reduction of the data Should be generous in the reduction of the data To maximize power, the reduced dataset should include all significant variables. To maximize power, the reduced dataset should include all significant variables.

Simulate 100 variables from Multivariate Normal Distribution Simulate 100 variables from Multivariate Normal Distribution with random mean and identity covariance matrix with variance 10 Y is dependent on Y is dependent on 10 variables, equally Using results from univariate linear regression apply VIM method to variable subsets with raw p-values less than 0.05, 0.1, 0.2, 0.3, and 1 Using results from univariate linear regression apply VIM method to variable subsets with raw p-values less than 0.05, 0.1, 0.2, 0.3, and 1 MTP for secondary analysis is applied to p-values from all 5 sets of VIM results MTP for secondary analysis is applied to p-values from all 5 sets of VIM results Simulations: Set-up

Simulations: Results Ranking of P-values P-value Rank Sensitivity (Power) Type I error (1-Specificity) P-value Rank

Simulations: Results P-value cut-off P-value cut-off Sensitivity (Power) Type I error (1-Specificity) P-value Rank

Classification of AML vs ALL using microarray gene expression data Classification of AML vs ALL using microarray gene expression data 38 individuals (27 ALL, 11 AML) 38 individuals (27 ALL, 11 AML) Originally 6817 human genes, reduced using pre-processing methods outlined in Dudoit et al 2003 to 3051 genes Originally 6817 human genes, reduced using pre-processing methods outlined in Dudoit et al 2003 to 3051 genes Objective: Identify biomarkers which are differentially expressed (ALL vs AML) Objective: Identify biomarkers which are differentially expressed (ALL vs AML) Univariate generalized linear regression is applied Univariate generalized linear regression is applied VIM method is applied to subsets with raw p-values less than 0.01, 0.025, 0.05, 0.1, 0.2, 0.3, and 1 VIM method is applied to subsets with raw p-values less than 0.01, 0.025, 0.05, 0.1, 0.2, 0.3, and 1 MTP for secondary analysis is applied to p-values from all 7 sets of VIM results MTP for secondary analysis is applied to p-values from all 7 sets of VIM results Application: Golub et al. 1999

Application: Results Ranked vs P-value FDR adjusted p-values P-value rank

Summary Assuming all significant variables are present in the reduced set of variables, MTP for secondary analysis has equivalent Power and Type I error control Assuming all significant variables are present in the reduced set of variables, MTP for secondary analysis has equivalent Power and Type I error control Can still control FDR even if secondary analysis is only completed on a subset of the original variables Can still control FDR even if secondary analysis is only completed on a subset of the original variables

References “Short Note: FDR Controling Multiple Testing Procedure for Secondary Analysis” (Tech Report...) “Short Note: FDR Controling Multiple Testing Procedure for Secondary Analysis” (Tech Report...) Y. Ge, S. Dudoit, and T. P. Speed (2003). Resampling-based multiple testing for microarray data analysis. TEST, Vol. 12, No. 1, p (plus discussion p ). [PDF] [Tech report #633] Y. Ge, S. Dudoit, and T. P. Speed (2003). Resampling-based multiple testing for microarray data analysis. TEST, Vol. 12, No. 1, p (plus discussion p ). [PDF] [Tech report #633]PDFTech report #633PDFTech report #633 Golub et al. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science, Vol. 286: Golub et al. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science, Vol. 286: