Multiple Testing and Prediction and Variable Selection Class web site: Statistics for Microarrays.

Slides:



Advertisements
Similar presentations
Statistical Modeling and Data Analysis Given a data set, first question a statistician ask is, “What is the statistical model to this data?” We then characterize.
Advertisements

Copyright © 2009 Pearson Education, Inc. Chapter 29 Multiple Regression.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Model generalization Test error Bias, variance and complexity
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 13 Nonlinear and Multiple Regression.
Model Assessment and Selection
Model Assessment, Selection and Averaging
Model assessment and cross-validation - overview
CMPUT 466/551 Principal Source: CMU
Linear Methods for Regression Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Elementary hypothesis testing
Analysis of gene expression data (Nominal explanatory variables) Shyamal D. Peddada Biostatistics Branch National Inst. Environmental Health Sciences (NIH)
Differentially expressed genes
Elementary hypothesis testing Purpose of hypothesis testing Type of hypotheses Type of errors Critical regions Significant levels Hypothesis vs intervals.
Evaluating Hypotheses
1 Test of significance for small samples Javier Cabrera.
The Need For Resampling In Multiple testing. Correlation Structures Tukey’s T Method exploit the correlation structure between the test statistics, and.
 Goal A: Find groups of genes that have correlated expression profiles. These genes are believed to belong to the same biological process and/or are co-regulated.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Statistics for Microarrays
Today Concepts underlying inferential statistics
Multiple Testing Procedures Examples and Software Implementation.
Ensemble Learning (2), Tree and Forest
Different Expression Multiple Hypothesis Testing STAT115 Spring 2012.
Chapter 12 Inferential Statistics Gay, Mills, and Airasian
An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets.
False Discovery Rate (FDR) = proportion of false positive results out of all positive results (positive result = statistically significant result) Ladislav.
Analysis of Variance. ANOVA Probably the most popular analysis in psychology Why? Ease of implementation Allows for analysis of several groups at once.
Hypothesis Testing Statistics for Microarray Data Analysis – Lecture 3 supplement The Fields Institute for Research in Mathematical Sciences May 25, 2002.
Multiple Testing in the Survival Analysis of Microarray Data
Multiple testing in high- throughput biology Petter Mostad.
Candidate marker detection and multiple testing
沈致远. Test error(generalization error): the expected prediction error over an independent test sample Training error: the average loss over the training.
Applying statistical tests to microarray data. Introduction to filtering Recall- Filtering is the process of deciding which genes in a microarray experiment.
Differential Expression II Adding power by modeling all the genes Oct 06.
Significance Testing of Microarray Data BIOS 691 Fall 2008 Mark Reimers Dept. Biostatistics.
Controlling FDR in Second Stage Analysis Catherine Tuglus Work with Mark van der Laan UC Berkeley Biostatistics.
Multiple Testing in Microarray Data Analysis Mi-Ok Kim.
Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics Brigham Young University Dept. Integrative Biology.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Back to basics – Probability, Conditional Probability and Independence Probability of an outcome in an experiment is the proportion of times that.
Differential Expressions Classical Methods Lecture Topic 7.
Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002.
Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory.
INTRODUCTION TO Machine Learning 3rd Edition
Application of Class Discovery and Class Prediction Methods to Microarray Data Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics.
Statistical Testing with Genes Saurabh Sinha CS 466.
Chapter 22: Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.
BUILDING THE REGRESSION MODEL Data preparation Variable reduction Model Selection Model validation Procedures for variable reduction 1 Building the Regression.
Suppose we have T genes which we measured under two experimental conditions (Ctl and Nic) in n replicated experiments t i * and p i are the t-statistic.
CSIRO Insert presentation title, do not remove CSIRO from start of footer Experimental Design Why design? removal of technical variance Optimizing your.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
DTC Quantitative Methods Bivariate Analysis: t-tests and Analysis of Variance (ANOVA) Thursday 14 th February 2013.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
1 השוואות מרובות מדדי טעות, עוצמה, רווחי סמך סימולטניים ד"ר מרינה בוגומולוב מבוסס על ההרצאות של פרופ' יואב בנימיני ופרופ' מלכה גורפיין.
Educational Research Inferential Statistics Chapter th Chapter 12- 8th Gay and Airasian.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
A Parametrized Strategy of Gatekeeping, Keeping Untouched the Probability of Having at Least One Significant Result Analysis of Primary and Secondary Endpoints.
Methods of multivariate analysis Ing. Jozef Palkovič, PhD.
Estimating the False Discovery Rate in Genome-wide Studies BMI/CS 576 Colin Dewey Fall 2008.
Statistics in MSmcDESPOT
CJT 765: Structural Equation Modeling
Significance Analysis of Microarrays (SAM)
Significance Analysis of Microarrays (SAM)
Linear Model Selection and regularization
Presentation transcript:

Multiple Testing and Prediction and Variable Selection Class web site: Statistics for Microarrays

cDNA gene expression data Data on G genes for n samples Genes mRNA samples Gene expression level of gene i in mRNA sample j = (normalized) Log( Red intensity / Green intensity) sample1sample2sample3sample4sample5 …

Multiple Testing Problem Simultaneously test G null hypotheses, one for each gene j H j : no association between expression level of gene j and the covariate or response Because microarray experiments simultaneously monitor expression levels of thousands of genes, there is a large multiplicity issue Would like some sense of how ‘surprising’ the observed results are

Hypothesis Truth vs. Decision # not rejected # rejectedtotals # true HUV (F +)m0m0 # non-true HTSm1m1 totalsm - RRm Truth Decision

Type I (False Positive) Error Rates Per-family Error Rate PFER = E(V) Per-comparison Error Rate PCER = E(V)/m Family-wise Error Rate FWER = p(V ≥ 1) False Discovery Rate FDR = E(Q), where Q = V/R if R > 0; Q = 0 if R = 0

Strong vs. Weak Control All probabilities are conditional on which hypotheses are true Strong control refers to control of the Type I error rate under any combination of true and false nulls Weak control refers to control of the Type I error rate only under the complete null hypothesis (i.e. all nulls true) In general, weak control without other safeguards is unsatisfactory

Comparison of Type I Error Rates In general, for a given multiple testing procedure, PCER  FWER  PFER, and FDR  FWER, with FDR = FWER under the complete null

Adjusted p-values (p*) If interest is in controlling, e.g., the FWER, the adjusted p-value for hypothesis H j is: p j * = inf {  : H j is rejected at FWER  } Hypothesis H j is rejected at FWER  if p j *   Adjusted p-values for other Type I error rates are similarly defined

Some Advantages of p-value Adjustment Test level (size) does not need to be determined in advance Some procedures most easily described in terms of their adjusted p-values Usually easily estimated using resampling Procedures can be readily compared based on the corresponding adjusted p-values

A Little Notation For hypothesis H j, j = 1, …, G observed test statistic: t j observed unadjusted p-value: p j Ordering of observed (absolute) t j : {r j } such that |t r 1 |  |t r 2 |  …  |t r G | Ordering of observed p j : {r j } such that |p r 1 |  |p r 2 |  …  |p r G | Denote corresponding RVs by upper case letters (T, P)

Control of the FWER Bonferroni single-step adjusted p-values p j * = min (Gp j, 1) Holm (1979) step-down adjusted p-values p r j * = max k = 1…j {min ((G-k+1)p r k, 1)} Hochberg (1988) step-down adjusted p-values (Simes inequality) p r j * = min k = j…G {min ((G-k+1)p r k, 1) }

Control of the FWER Westfall & Young (1993) step-down minP adjusted p-values p r j * = max k = 1…j { p(max l  { r k… r G} P l  p r k  H 0 C )} Westfall & Young (1993) step-down maxT adjusted p-values p r j * = max k = 1…j { p(max l  { r k… r G} |T l | ≥ |t r k |  H 0 C )}

Westfall & Young (1993) Adjusted p-values Step-down procedures: successively smaller adjustments at each step Take into account the joint distribution of the test statistics Less conservative than Bonferroni, Holm, or Hochberg adjusted p-values Can be estimated by resampling but computer-intensive (especially for minP)

maxT vs. minP The maxT and minP adjusted p-values are the same when the test statistics are identically distributed (id) When the test statistics are not id, maxT adjustments may be unbalanced (not all tests contribute equally to the adjustment) maxT more computationally tractable than minP maxT can be more powerful in ‘small n, large G’ situations

Control of the FDR Benjamini & Hochberg (1995): step-up procedure which controls the FDR under some dependency structures p r j * = min k = j…G { min ([G/k] p r k, 1) } Benjamini & Yuketieli (2001): conservative step- up procedure which controls the FDR under general dependency structures p r j * = min k = j…G { min (G  j=1 G [1/j]/k] p r k, 1) } Yuketieli & Benjamini (1999): resampling based adjusted p-values for controlling the FDR under certain types of dependency structures

Identification of Genes Associated with Survival Data: survival y i and gene expression x ij for individuals i = 1, …, n and genes j = 1, …, G Fit Cox model for each gene singly: h(t) = h 0 (t) exp(  j x ij ) For any gene j = 1, …, G, can test H j :  j = 0 Complete null H 0 C :  j = 0 for all j = 1, …, G The H j are tested on the basis of the Wald statistics t j and their associated p-values p j

Datasets Lymphoma (Alizadeh et al.) 40 individuals, 4026 genes Melanoma (Bittner et al.) 15 individuals, 3613 genes Both available at

Results: Lymphoma

Results: Melanoma

Other Proposals from the Microarray Literature ‘Neighborhood Analysis’, Golub et al. –In general, gives only weak control of FWER ‘Significance Analysis of Microarrays (SAM)’ (2 versions) –Efron et al. (2000): weak control of PFER –Tusher et al. (2001): strong control of PFER SAM also estimates ‘FDR’, but this ‘FDR’ is defined as E(V|H 0 C )/R, not E(V/R)

Controversies Whether multiple testing methods (adjustments) should be applied at all Which tests should be included in the family (e.g. all tests performed within a single experiment; define ‘experiment’) Alternatives –Bayesian approach –Meta-analysis

Situations where inflated error rates are a concern It is plausible that all nulls may be true A serious claim will be made whenever any p <.05 is found Much data manipulation may be performed to find a ‘significant’ result The analysis is planned to be exploratory but wish to claim ‘sig’ results are real Experiment unlikely to be followed up before serious actions are taken

References Alizadeh et al. (2000) Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403: Benjamini and Hochberg (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. JRSSB 57: Benjamini and Yuketieli (2001) The control of false discovery rate in multiple hypothesis testing under dependency. Annals of Statistics Bittner et al. (2000) Molecular classification of cutaneous malignant melanoma by gene expression profiling. Nature 406: Efron et al. (2000) Microarrays and their use in a comparative experiment. Tech report, Stats, Stanford Golub et al. (1999) Molecular classification of cancer. Science 286:

References Hochberg (1988) A sharper Bonferroni procedure for multiple tests of significance. Biometrika 75: Holm (1979) A simple sequentially rejective multiple testing procedure. Scand. J Statistics 6: Ihaka and Gentleman (1996) R: A language for data analysis and graphics. J Comp Graph Stats 5: Tusher et al. (2001) Significance analysis of microarrays applied to transcriptional responses to ionizing radiation. PNAS 98: Westfall and Young (1993) Resampling-based multiple testing: Examples and methods for p-value adjustment. New York: Wiley Yuketieli and Benjamini (1999) Resampling based false discovery rate controlling multiple test procedures for correlated test statistics. J Stat Plan Inf 82:

(BREAK)

Prediction and Variable Selection Substantial statistical literature on model selection for minimizing prediction error Most of the focus is on linear models Almost universally assumed (in the statistics literature) that n > (or >>) p, the number of available predictors Other fields (e.g. chemometrics) have been dealing with the n << p problem

Model Selection (Generic) Select the class of models to be considered (e.g. linear models, regression trees, etc) Use a procedure to compare models in the class Search the model space Assess prediction error

Model Selection and Assessment The generalization performance of a learning method relates to its prediction capability on independent test data This performance guides model choice Performance is a measure of quality of the chosen model

Bias, Variance, and Model Complexity Test error (or generalization error) is the expected prediction error over an independent test sample Err = E[L(Y, f(X))] Training error is the average loss over the training sample Err = (1/n)  n i=1 L(y i, f(x i )) ^ ^

Error vs. Complexity Prediction Error Model Complexity Test sample Training sample Low Bias High Variance High Bias Low Variance

Using the data Ideally, divide data into 3 sets: –Training set: used to fit models –Validation set: used to estimate prediction error for model selection –Test set: used to assess the generalization error for the final model How much training data are ‘enough’ depends on signal-noise ratio, model complexity, etc. Most microarray data sets too small for dividing further

Approximate Validation Methods Analytic Methods include –Akaike information criterion (AIC) –Bayesian information criterion (BIC) –Minimum description length (MDL) Sample re-use methods –Cross-validation –Bootstrap

Some Approaches when n < p Some kind of initial screening is essential for many types of models Rank genes in terms of variance (or coefficient of variation) across samples, use only biggest Dimensionality reduction through principal components, use the first (some number) PCs as variables

Parametric Variable Selection (I) Forward selection: start with no variables; add additional variables satisfying some criterion Backward elimination: start with all variables; delete variables one at a time according to some criterion until some stopping rule is satisfied ‘Stepwise’: after each variable added, test to see if any previously selected variable may be deleted without appreciable loss of explanatory power

Parametric Variable Selection (II) Sequential replacement: see if any variables can be replaced with another, according to some criterion Generating all subsets: provided the number of variables is not too large and the criterion assessing performance is not too difficult or time-consuming to compute Branch and bound: divide possible subsets into groups (branches), search of some sub- branches may be avoided if exceed bound on some criterion

An Intriguing Approach Gabriel and Pun (1979): suggested that when an exhaustive search infeasible, may be possible to separate variables into groups for which an exhaustive search is feasible For linear model, grouping would be such that regression sum of squares is additive for variables in different groups (orthogonal; also under certain other conditions) But hard to see how to extend to other types of models, e.g. survival

Tree-based Variable Selection Tree-based models most often used for prediction, with little attention to details on the chosen model Trees can be used to identify subsets of variables with good discriminatory power via importance statistics An idea is to use bagging to generate a collection of tree predictors and importance statistics for each variable; can then rank variables by their (median, say) importance Create a prediction accuracy criterion for inclusion of variables in the final subset

Genomic Computing for Variable Selection A type of evolutionary computing algorithm Goal is to evolve simple explanatory rules with high explanatory power May do better than tree-based methods, where variables selected on the basis of their individual importance (but bagging may improve this)

The Basic Strategy of Evolutionary Computing

What this course covered Biological basics of (mostly cDNA) microarray technology Special problems arising, particularly regarding normalization of arrays and multiple hypothesis testing Some ways that standard statistical techniques may be useful Some ways that more sophisticated techniques have been/may be applied Examples of areas where more research is needed

What was left out Pathway modeling –This is a very active field, as there is much interest in picking out genes working together based on expression –My view is that progress here will not come from generic ‘black box’ methods, but will instead require highly collaborative, directed modeling A comprehensive review of methods developed for analysis of microarray data –Instead, we have covered what are, in my opinion, some of the most important and fundamentally justifiable methods

Perspectives on the future Technologies are evolving, don’t get too ‘locked in’ to any particular technology Keep an open mind to various problem- solving approaches… …But that doesn’t mean not to think!

Important Applications Include… Identification of therapeutic targets Molecular classification of cancers Host-parasite interactions Disease process pathways Genomic response to pathogens Many others

Acknowledgements Debashis Ghosh Erin Conlon Sandrine Dudoit José Correa