Presentation is loading. Please wait.

Presentation is loading. Please wait.

Controlling FDR in Second Stage Analysis Catherine Tuglus Work with Mark van der Laan UC Berkeley Biostatistics.

Similar presentations


Presentation on theme: "Controlling FDR in Second Stage Analysis Catherine Tuglus Work with Mark van der Laan UC Berkeley Biostatistics."— Presentation transcript:

1 Controlling FDR in Second Stage Analysis Catherine Tuglus Work with Mark van der Laan UC Berkeley Biostatistics

2 Outline What is a Second Stage Analysis What is a Second Stage Analysis Issues with MTP for Secondary Analysis Issues with MTP for Secondary Analysis Proposed solution for Marginal FDR controlling procedure Proposed solution for Marginal FDR controlling procedure Simulations Simulations Data Example: Golub et al 1999 Data Example: Golub et al 1999

3 Second Stage Analysis Given large dataset (50,000 variables) Given large dataset (50,000 variables) Dimension reduction is performed using supervised analysis Dimension reduction is performed using supervised analysis Univariate regression Univariate regression RandomForest selection, etc. RandomForest selection, etc. Additional analysis is applied to reduced dataset (~1000 variables) Additional analysis is applied to reduced dataset (~1000 variables) “Secondary Analysis” “Secondary Analysis” Variable Importance Methods for instance Variable Importance Methods for instance Would like to adjust for multiple testing Would like to adjust for multiple testing

4 MTP for Secondary Analysis Supervised reduction of the data invalidates standard MTPs Supervised reduction of the data invalidates standard MTPs Adds Bias to analysis Adds Bias to analysis Cannot account for initial screening using standard MTPs Cannot account for initial screening using standard MTPs MTP will not control Type I and Type II error appropriately MTP will not control Type I and Type II error appropriately

5 Marginal FDR controlling MTP for Secondary Analysis Process Process Given (Y,W)~P, where W contains M variables Given (Y,W)~P, where W contains M variables Initial analysis reduces the set to N variables Initial analysis reduces the set to N variables Complete secondary analysis on reduced dataset (N variables), obtaining p-values Complete secondary analysis on reduced dataset (N variables), obtaining p-values Add to list of p-values (M-N) 1’s Add to list of p-values (M-N) 1’s Thus, all tests not completed are insignificant Thus, all tests not completed are insignificant Apply marginal Benjamini & Hochberg step-up FDR controlling procedure Apply marginal Benjamini & Hochberg step-up FDR controlling procedure If FDR applied to all variables would select a subset of the N variables, then this two-stage FDR method will be equivalent with applying FDR to all variables. Thus, loss in power only occurs if the N variables exclude significant variables. If FDR applied to all variables would select a subset of the N variables, then this two-stage FDR method will be equivalent with applying FDR to all variables. Thus, loss in power only occurs if the N variables exclude significant variables. Should be generous in the reduction of the data Should be generous in the reduction of the data To maximize power, the reduced dataset should include all significant variables. To maximize power, the reduced dataset should include all significant variables.

6 Simulate 100 variables from Multivariate Normal Distribution Simulate 100 variables from Multivariate Normal Distribution with random mean and identity covariance matrix with variance 10 Y is dependent on Y is dependent on 10 variables, equally Using results from univariate linear regression apply VIM method to variable subsets with raw p-values less than 0.05, 0.1, 0.2, 0.3, and 1 Using results from univariate linear regression apply VIM method to variable subsets with raw p-values less than 0.05, 0.1, 0.2, 0.3, and 1 MTP for secondary analysis is applied to p-values from all 5 sets of VIM results MTP for secondary analysis is applied to p-values from all 5 sets of VIM results Simulations: Set-up

7 Simulations: Results Ranking of P-values P-value Rank Sensitivity (Power) Type I error (1-Specificity) P-value Rank

8 Simulations: Results P-value cut-off P-value cut-off Sensitivity (Power) Type I error (1-Specificity) P-value Rank

9 Classification of AML vs ALL using microarray gene expression data Classification of AML vs ALL using microarray gene expression data 38 individuals (27 ALL, 11 AML) 38 individuals (27 ALL, 11 AML) Originally 6817 human genes, reduced using pre-processing methods outlined in Dudoit et al 2003 to 3051 genes Originally 6817 human genes, reduced using pre-processing methods outlined in Dudoit et al 2003 to 3051 genes Objective: Identify biomarkers which are differentially expressed (ALL vs AML) Objective: Identify biomarkers which are differentially expressed (ALL vs AML) Univariate generalized linear regression is applied Univariate generalized linear regression is applied VIM method is applied to subsets with raw p-values less than 0.01, 0.025, 0.05, 0.1, 0.2, 0.3, and 1 VIM method is applied to subsets with raw p-values less than 0.01, 0.025, 0.05, 0.1, 0.2, 0.3, and 1 MTP for secondary analysis is applied to p-values from all 7 sets of VIM results MTP for secondary analysis is applied to p-values from all 7 sets of VIM results Application: Golub et al. 1999

10 Application: Results Ranked vs P-value FDR adjusted p-values P-value rank

11 Summary Assuming all significant variables are present in the reduced set of variables, MTP for secondary analysis has equivalent Power and Type I error control Assuming all significant variables are present in the reduced set of variables, MTP for secondary analysis has equivalent Power and Type I error control Can still control FDR even if secondary analysis is only completed on a subset of the original variables Can still control FDR even if secondary analysis is only completed on a subset of the original variables

12 References “Short Note: FDR Controling Multiple Testing Procedure for Secondary Analysis” (Tech Report...) “Short Note: FDR Controling Multiple Testing Procedure for Secondary Analysis” (Tech Report...) Y. Ge, S. Dudoit, and T. P. Speed (2003). Resampling-based multiple testing for microarray data analysis. TEST, Vol. 12, No. 1, p. 1-44 (plus discussion p. 44-77). [PDF] [Tech report #633] Y. Ge, S. Dudoit, and T. P. Speed (2003). Resampling-based multiple testing for microarray data analysis. TEST, Vol. 12, No. 1, p. 1-44 (plus discussion p. 44-77). [PDF] [Tech report #633]PDFTech report #633PDFTech report #633 Golub et al. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science, Vol. 286:531-537.. Golub et al. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science, Vol. 286:531-537..


Download ppt "Controlling FDR in Second Stage Analysis Catherine Tuglus Work with Mark van der Laan UC Berkeley Biostatistics."

Similar presentations


Ads by Google