Design & Analysis of Microarray Studies for Diagnostic & Prognostic Classification Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer Institute http://linus.nci.nih.gov/brb
http://linus.nci.nih.gov/brb http://linus.nci.nih.gov/brb Powerpoint presentations Reprints & Technical Reports BRB-ArrayTools software BRB-ArrayTools Data Archive Sample Size Planning for Targeted Clinical Trials
Simon R, Korn E, McShane L, Radmacher M, Wright G, Zhao Y Simon R, Korn E, McShane L, Radmacher M, Wright G, Zhao Y. Design and analysis of DNA microarray investigations, Springer-Verlag, 2003. Radmacher MD, McShane LM, Simon R. A paradigm for class prediction using gene expression profiles. Journal of Computational Biology 9:505-511, 2002. Simon R, Radmacher MD, Dobbin K, McShane LM. Pitfalls in the analysis of DNA microarray data. Journal of the National Cancer Institute 95:14-18, 2003. Dobbin K, Simon R. Comparison of microarray designs for class comparison and class discovery, Bioinformatics 18:1462-69, 2002; 19:803-810, 2003; 21:2430-37, 2005; 21:2803-4, 2005. Dobbin K and Simon R. Sample size determination in microarray experiments for class comparison and prognostic classification. Biostatistics 6:27-38, 2005. Dobbin K, Shih J, Simon R. Questions and answers on design of dual-label microarrays for identifying differentially expressed genes. Journal of the National Cancer Institute 95:1362-69, 2003. Wright G, Simon R. A random variance model for detection of differential gene expression in small microarray experiments. Bioinformatics 19:2448-55, 2003. Korn EL, Troendle JF, McShane LM, Simon R.Controlling the number of false discoveries. Journal of Statistical Planning and Inference 124:379-08, 2004. Molinaro A, Simon R, Pfeiffer R. Prediction error estimation: A comparison of resampling methods. Bioinformatics 21:3301-7,2005.
Simon R. Using DNA microarrays for diagnostic and prognostic prediction. Expert Review of Molecular Diagnostics, 3(5) 587-595, 2003. Simon R. Diagnostic and prognostic prediction using gene expression profiles in high dimensional microarray data. British Journal of Cancer 89:1599-1604, 2003. Simon R and Maitnourim A. Evaluating the efficiency of targeted designs for randomized clinical trials. Clinical Cancer Research 10:6759-63, 2004. Maitnourim A and Simon R. On the efficiency of targeted clinical trials. Statistics in Medicine 24:329-339, 2005. Simon R. When is a genomic classifier ready for prime time? Nature Clinical Practice – Oncology 1:4-5, 2004. Simon R. An agenda for Clinical Trials: clinical trials in the genomic era. Clinical Trials 1:468-470, 2004. Simon R. Development and Validation of Therapeutically Relevant Multi-gene Biomarker Classifiers. Journal of the National Cancer Institute 97:866-867, 2005. Simon R. A roadmap for developing and validating therapeutically relevant genomic classifiers. Journal of Clinical Oncology (In Press). Freidlin B and Simon R. Adaptive signature design. Clinical Cancer Research (In Press). Simon R. Validation of pharmacogenomic biomarker classifiers for treatment selection. Disease Markers (In Press). Simon R. Guidelines for the design of clinical studies for development and validation of therapeutically relevant biomarkers and biomarker classification systems. In Biomarkers in Breast Cancer, Hayes DF and Gasparini G, Humana Press (In Press).
Myth That microarray investigations should be unstructured data-mining adventures without clear objectives
Good microarray studies have clear objectives, but not generally gene specific mechanistic hypotheses Design and analysis methods should be tailored to study objectives
Good Microarray Studies Have Clear Objectives Class Comparison Find genes whose expression differs among predetermined classes Class Prediction Prediction of predetermined class (phenotype) using information from gene expression profile Class Discovery Discover clusters of specimens having similar expression profiles Discover clusters of genes having similar expression profiles
Class Comparison and Class Prediction Not clustering problems Global similarity measures generally used for clustering arrays may not distinguish classes Don’t control multiplicity or for distinguishing data used for classifier development from data used for classifier evaluation Supervised methods Requires multiple biological samples from each class
Levels of Replication Technical replicates Biological replicates RNA sample divided into multiple aliquots and re-arrayed Biological replicates Multiple subjects Replication of the tissue culture experiment
Biological conclusions generally require independent biological replicates. The power of statistical methods for microarray data depends on the number of biological replicates. Technical replicates are useful insurance to ensure that at least one good quality array of each specimen will be obtained.
Class Prediction Predict which tumors will respond to a particular treatment Predict which patients will relapse after a particular treatment
Class prediction methods usually have gene selection as a component The criteria for gene selection for class prediction and for class comparison are different For class comparison false discovery rate is important For class prediction, predictive accuracy is important
Clarity of Objectives is Important Patient selection Many microarray studies developing classifiers are not “therapeutically relevant” Analysis methods Many microarray studies use cluster analysis inappropriately or misleadingly
Microarray Platforms for Developing Predictive Classifiers Single label arrays Affymetrix GeneChips Dual label arrays using common reference design Dye swaps are unnecessary
Common Reference Design A1 A2 B1 B2 RED R R R R GREEN Array 1 Array 2 Array 3 Array 4 Ai = ith specimen from class A Bi = ith specimen from class B R = aliquot from reference pool
The reference generally serves to control variation in the size of corresponding spots on different arrays and variation in sample distribution over the slide. The reference provides a relative measure of expression for a given gene in a given sample that is less variable than an absolute measure. The reference is not the object of comparison. The relative measure of expression will be compared among biologically independent samples from different classes.
Myth For two color microarrays, each sample of interest should be labeled once with Cy3 and once with Cy5 in dye-swap pairs of arrays.
Dye Bias Average differences among dyes in label concentration, labeling efficiency, photon emission efficiency and photon detection are corrected by normalization procedures Gene specific dye bias may not be corrected by normalization
Dye swap technical replicates of the same two rna samples are rarely necessary. Using a common reference design, dye swap arrays are not necessary for valid comparisons of classes since specimens labeled with different dyes are never compared. For two-label direct comparison designs for comparing two classes, it is more efficient to balance the dye-class assignments for independent biological specimens than to do dye swap technical replicates
Balanced Block Design A1 B2 A3 B4 B1 A2 B3 A4 RED B1 A2 B3 A4 GREEN Array 1 Array 2 Array 3 Array 4 Ai = ith specimen from class A Bi = ith specimen from class B
Detailed comparisons of the effectiveness of designs: Dobbin K, Simon R. Comparison of microarray designs for class comparison and class discovery. Bioinformatics 18:1462-9, 2002 Dobbin K, Shih J, Simon R. Statistical design of reverse dye microarrays. Bioinformatics 19:803-10, 2003 Dobbin K, Simon R. Questions and answers on the design of dual-label microarrays for identifying differentially expressed genes, JNCI 95:1362-1369, 2003
Common reference designs are very effective for many microarray studies. They are robust, permit comparisons among separate experiments, and permit many types of comparisons and analyses to be performed. For simple two class comparison problems, balanced block designs require many fewer arrays than common reference designs. Efficiency decreases for more than two classes Are more difficult to apply to more complicated class comparison problems. They are not appropriate for class discovery or class prediction. Loop designs are less robust, and dominated by either common reference designs or balanced block designs, and are not suitable for class prediction or class discovery.
What We Will Not Discuss Today Image analysis Normalization Clustering methods Class comparison SAM and other methods of gene finding FDR (false discovery rate) and methods for controlling the number of false positive genes
Simple Procedures If each gene is tested for significance at level and there are k genes, then the expected number of false discoveries is k . To control E(FD) u Conduct each of k tests at level = u/k Bonferroni control of familywise error (FWE) rate at level 0.05 Conduct each of k tests at level 0.05/k At least 95% confident that FD = 0
False Discovery Rate (FDR) FDR = Expected proportion of false discoveries among the tests declared significant Studied by Benjamini and Hochberg (1995):
Not rejected Rejected Total True null hypotheses 890 10 False discoveries 900 False null hypotheses 90 True discoveries 100 1000
Controlling the Expected False Discovery Rate Compare classes separately by gene and compute significance levels {pi} Rank genes in order of significance p(1) < p(2) < ... < p(k) Find largest index i for which p(i)k / i FDR Consider genes with the i’th smallest p values as statistically significant
Problems With Simple Procedures Bonferroni control of FWE is very conservative p values based on normal theory are not accurate at extremes quantiles Difficult to achieve extreme quantiles for permutation p values of individual genes Controlling expected number or proportion of false discoveries may not provide adequate control because distributions of FD and FDP may have large variances Multiple comparisons are controlled by adjustment of univariate (single gene) p values and so may not take advantage of correlation among genes
Additional Procedures “SAM” - Significance Analysis of Microarrays Tusher et al., PNAS, 2001 Estimate FDR Statistical properties unclear Empirical Bayes Efron et al., JASA, 2001 Related to FDR Ad hoc aspects Multivariate permutation tests Korn et al., 2001 (http://linus.nci.nih.gov/brb) Control number or proportion of false discoveries Can specify confidence level of control
Multivariate Permutation Procedures (Korn et al., 2001) Allows statements like: FD Procedure: We are 95% confident that the (actual) number of false discoveries is no greater than 5. FDP Procedure: We are 95% confident that the (actual) proportion of false discoveries does not exceed .10.
Class Prediction Model Given a sample with an expression profile vector x of log-ratios or log signals and unknown class. Predict which class the sample belongs to The class prediction model is a function f which maps from the set of vectors x to the set of class labels {1,2} (if there are two classes). f generally utilizes only some of the components of x (i.e. only some of the genes) Specifying the model f involves specifying some parameters (e.g. regression coefficients) by fitting the model to the data (learning the data).
Problems With Many Diagnostic/Prognostic Marker Studies Are not reproducible Retrospective non-focused analysis Multiplicity problems Inter-laboratory assay variation Have no impact Not therapeutically relevant questions Not therapeutically relevant group of patients Black box predictors
Components of Class Prediction Feature (gene) selection Which genes will be included in the model Select model type E.g. Diagonal linear discriminant analysis, Nearest-Neighbor, … Fitting parameters (regression coefficients) for model Selecting value of tuning parameters
Do Not Confuse Statistical Methods Appropriate for Class Comparison with Those Appropriate for Class Prediction Demonstrating statistical significance of prognostic factors is not the same as demonstrating predictive accuracy. Demonstrating goodness of fit of a model to the data used to develop it is not a demonstration of predictive accuracy. Statisticians are used to inference, not prediction Most statistical methods were not developed for p>>n prediction problems
Feature Selection Genes that are differentially expressed among the classes at a significance level (e.g. 0.01) The level is selected only to control the number of genes in the model
t-test Comparisons of Gene Expression xj~N(j1 , j2) for class 1 xj~N(j2 , j2) for class 2 H0j: j1 = j2
Estimation of Within-Class Variance Estimate separately for each gene Limited degrees of freedom Gene list dominated by genes with small fold changes and small variances Assume all genes have same variance Poor assumption Random (hierarchical) variance model Wright G.W. and Simon R. Bioinformatics19:2448-2455,2003 Inverse gamma distribution of residual variances Results in exact F (or t) distribution of test statistics with increased degrees of freedom for error variance For any normal linear model
Feature Selection Small subset of genes which together give most accurate predictions Combinatorial optimization algorithms Genetic algorithms Little evidence that complex feature selection is useful in microarray problems Failure to compare to simpler methods Some published complex methods for selecting combinations of features do not appear to have been properly evaluated
Linear Classifiers for Two Classes
Linear Classifiers for Two Classes Fisher linear discriminant analysis Requires estimating correlations among all genes selected for model y = vector of class labels Diagonal linear discriminant analysis (DLDA) assumes features are uncorrelated Naïve Bayes classifier Compound covariate predictor (Radmacher) and Golub’s method are similar to DLDA in that they can be viewed as weighted voting of univariate classifiers
Linear Classifiers for Two Classes Compound covariate predictor Instead of for DLDA
Linear Classifiers for Two Classes Support vector machines with inner product kernel are linear classifiers with weights determined to separate the classes with a hyperplain that minimizes the length of the weight vector
Support Vector Machine
Perceptrons Perceptrons are neural networks with no hidden layer and linear transfer functions between input output Number of input nodes equals number of genes selected Number of output nodes equals number of classes minus 1 Number of inputs may be major principal components of genes or major principal components of informative genes Perceptrons are linear classifiers
Naïve Bayes Classifier Expression profiles for class j assumed normal with mean vector mj and diagonal covariance matrix D Likelihood of expression profile vector x is l(x; mj ,D) Posterior probability of class j for case with expression profile vector x is proportional to πj l(x; mj ,D)
Compound Covariate Bayes Classifier Compound covariate y = tixi Sum over the genes selected as differentially expressed xi the expression level of the ith selected gene for the case whose class is to be predicted ti the t statistic for testing differential expression for the i’th gene Proceed as for the naïve Bayes classifier but using the single compound covariate as predictive variable GW Wright et al. PNAS 2005.
When p>>n The Linear Model is Too Complex It is always possible to find a set of features and a weight vector for which the classification error on the training set is zero. Why consider more complex models?
Myth Complex classification algorithms such as neural networks perform better than simpler methods for class prediction.
Artificial intelligence sells to journal reviewers and peers who cannot distinguish hype from substance when it comes to microarray data analysis. Comparative studies have shown that simpler methods work as well or better for microarray problems because they avoid overfitting the data.
Other Simple Methods Nearest neighbor classification Nearest k-neighbors Nearest centroid classification Shrunken centroid classification
Nearest Neighbor Classifier To classify a sample in the validation set as being in outcome class 1 or outcome class 2, determine which sample in the training set it’s gene expression profile is most similar to. Similarity measure used is based on genes selected as being univariately differentially expressed between the classes Correlation similarity or Euclidean distance generally used Classify the sample as being in the same class as it’s nearest neighbor in the training set
Other Methods Neural networks Top-scoring pairs CART Random Forrest Genetic algorithm based classification
Apparent Dimension Reduction Based Methods Principal component regression Supervised principal component regression Partial least squares Stepwise logistic regression
When There Are More Than 2 Classes Nearest neighbor type methods Decision tree of binary classifiers
Decision Tree of Binary Classifiers Partition the set of classes {1,2,…,K} into two disjoint subsets S1 and S2 Develop a binary classifier for distinguishing the composite classes S1 and S2 Compute the cross-validated classification error for distinguishing S1 and S2 Repeat the above steps for all possible partitions in order to find the partition S1and S2 for which the cross-validated classification error is minimized If S1and S2 are not singleton sets, then repeat all of the above steps separately for the classes in S1and S2 to optimally partition each of them
Evaluating a Classifier “Prediction is difficult, especially the future.” Neils Bohr Fit of a model to the same data used to develop it is no evidence of prediction accuracy for independent data.
Evaluating a Classifier Fit of a model to the same data used to develop it is no evidence of prediction accuracy for independent data Goodness of fit vs prediction accuracy Demonstrating statistical significance of prognostic factors is not the same as demonstrating predictive accuracy Demonstrating stability of identification of gene predictors is not necessary for demonstrating predictive accuracy
Evaluating a Classifier The classification algorithm includes the following parts: Determining what type of classifier to use Gene selection Fitting parameters Optimizing with regard to tuning parameters If a re-sampling method such as cross-validation is to be used to estimate predictive error of a classifier, all aspects of the classification algorithm must be repeated for each training set and the accuracy of the resulting classifier scored on the corresponding validation set
Split-Sample Evaluation Training-set Used to select features, select model type, determine parameters and cut-off thresholds Test-set Withheld until a single model is fully specified using the training-set. Fully specified model is applied to the expression profiles in the test-set to predict class labels. Number of errors is counted Ideally test set data is from different centers than the training data and assayed at a different time
Leave-one-out Cross Validation Omit sample 1 Develop multivariate classifier from scratch on training set with sample 1 omitted Predict class for sample 1 and record whether prediction is correct
Leave-one-out Cross Validation Repeat analysis for training sets with each single sample omitted one at a time e = number of misclassifications determined by cross-validation Subdivide e for estimation of sensitivity and specificity
Cross validation is only valid if the test set is not used in any way in the development of the model. Using the complete set of samples to select genes violates this assumption and invalidates cross-validation. With proper cross-validation, the model must be developed from scratch for each leave-one-out training set. This means that feature selection must be repeated for each leave-one-out training set. The cross-validated estimate of misclassification error is an estimate of the prediction error for model fit using specified algorithm to full dataset If you use cross-validation estimates of prediction error for a set of algorithms indexed by a tuning parameter and select the algorithm with the smallest cv error estimate, you do not have a valid estimate of the prediction error for the selected model
Prediction on Simulated Null Data Generation of Gene Expression Profiles 14 specimens (Pi is the expression profile for specimen i) Log-ratio measurements on 6000 genes Pi ~ MVN(0, I6000) Can we distinguish between the first 7 specimens (Class 1) and the last 7 (Class 2)? Prediction Method Compound covariate prediction (discussed later) Compound covariate built from the log-ratios of the 10 most differentially expressed genes.
Invalid Criticisms of Cross-Validation “You can always find a set of features that will provide perfect prediction for the training and test sets.” For complex models, there may be many sets of features that provide zero training errors. A modeling strategy that either selects among those sets or aggregates among those models, will have a generalization error which will be validly estimated by cross-validation.
Simulated Data 40 cases, 10 genes selected from 5000 Method Estimate Std Deviation True .078 Resubstitution .007 .016 LOOCV .092 .115 10-fold CV .118 .120 5-fold CV .161 .127 Split sample 1-1 .345 .185 Split sample 2-1 .205 .184 .632+ bootstrap .274 .084
DLBCL Data Method Bias Std Deviation MSE LOOCV -.019 .072 .008 10-fold CV -.007 .063 .006 5-fold CV .004 .07 .007 Split 1-1 .037 .117 .018 Split 2-1 .001 .119 .017 .632+ bootstrap -.006 .049
Simulated Data 40 cases Method Estimate Std Deviation True .078 10-fold .118 .120 Repeated 10-fold .116 .109 5-fold .161 .127 Repeated 5-fold .159 .114 Split 1-1 .345 .185 Repeated split 1-1 .371 .065
Permutation Distribution of Cross-validated Misclassification Rate of a Multivariate Classifier Randomly permute class labels and repeat the entire cross-validation Re-do for all (or 1000) random permutations of class labels Permutation p value is fraction of random permutations that gave as few misclassifications as e in the real data
Gene-Expression Profiles in Hereditary Breast Cancer cDNA Microarrays Parallel Gene Expression Analysis Breast tumors studied: 7 BRCA1+ tumors 8 BRCA2+ tumors 7 sporadic tumors Log-ratios measurements of 3226 genes for each tumor after initial data filtering RESEARCH QUESTION Can we distinguish BRCA1+ from BRCA1– cancers and BRCA2+ from BRCA2– cancers based solely on their gene expression profiles?
BRCA1
BRCA2
Classification of BRCA2 Germline Mutations Classification Method LOOCV Prediction Error Compound Covariate Predictor 14% Fisher LDA 36% Diagonal LDA 1-Nearest Neighbor 9% 3-Nearest Neighbor 23% Support Vector Machine (linear kernel) 18% Classification Tree 45%
Common Problems With Internal Classifier Validation Pre-selection of genes using entire dataset Failure to consider optimization of tuning parameter part of classification algorithm Varma & Simon, BMC Bioinformatics 2006 Erroneous use of predicted class in regression model
Incomplete (incorrect) Cross-Validation Publications are using all the data to select genes and then cross-validating only the parameter estimation component of model development Highly biased Many published complex methods which make strong claims based on incorrect cross-validation. Frequently seen in complex feature set selection algorithms Some software encourages inappropriate cross-validation
Incomplete (incorrect) Cross-Validation Let M(b,D) denote a classification model developed on a set of data D where the model is of a particular type that is parameterized by a scalar b. Use cross-validation to estimate the classification error of M(b,D) for a grid of values of b; Err(b). Select the value of b* that minimizes Err(b). Caution: Err(b*) is a biased estimate of the prediction error of M(b*,D). This error is made in some commonly used methods
Complete (correct) Cross-Validation Construct a learning set D as a subset of the full set S of cases. Use cross-validation restricted to D in order to estimate the classification error of M(b,D) for a grid of values of b; Err(b). Select the value of b* that minimizes Err(b). Use the mode M(b*,D) to predict for the cases in S but not in D (S-D) and compute the error rate in S-D Repeat this full procedure for different learning sets D1 , D2 and average the error rates of the models M(bi*,Di) over the corresponding validation sets S-Di
Does an Expression Profile Classifier Predict More Accurately Than Standard Prognostic Variables? Not an issue of which variables are significant after adjusting for which others or which are independent predictors Predictive accuracy and inference are different The two classifiers can be compared by ROC analysis as functions of the threshold for classification The predictiveness of the expression profile classifier can be evaluated within levels of the classifier based on standard prognostic variables
Does an Expression Profile Classifier Predict More Accurately Than Standard Prognostic Variables? Some publications fit logistic model to standard covariates and the cross-validated predictions of expression profile classifiers This is valid only with split-sample analysis because the cross-validated predictions are not independent
External Validation Should address clinical utility, not just predictive accuracy Therapeutic relevance Should incorporate all sources of variability likely to be seen in broad clinical application Expression profile assay distributed over time and space Real world tissue handling Patients selected from different centers than those used for developing the classifier
Survival Risk Group Prediction Evaluate individual genes by fitting single variable proportional hazards regression models to log signal or log ratio for gene Select genes based on p-value threshold for single gene PH regressions Compute first k principal components of the selected genes Fit PH regression model with the k pc’s as predictors. Let b1 , …, bk denote the estimated regression coefficients To predict for case with expression profile vector x, compute the k supervised pc’s y1 , …, yk and the predictive index = b1 y1 + … + bk yk
Survival Risk Group Prediction LOOCV loop: Create training set by omitting i’th case Develop supervised pc PH model for training set Compute cross-validated predictive index for i’th case using PH model developed for training set Compute predictive risk percentile of predictive index for i’th case among predictive indices for cases in the training set
Survival Risk Group Prediction Plot Kaplan Meier survival curves for cases with cross-validated risk percentiles above 50% and for cases with cross-validated risk percentiles below 50% Or for however many risk groups and thresholds is desired Compute log-rank statistic comparing the cross-validated Kaplan Meier curves
Survival Risk Group Prediction Repeat the entire procedure for all (or large number) of permutations of survival times and censoring indicators to generate the null distribution of the log-rank statistic The usual chi-square null distribution is not valid because the cross-validated risk percentiles are correlated among cases Evaluate statistical significance of the association of survival and expression profiles by referring the log-rank statistic for the unpermuted data to the permutation null distribution
Survival Risk Group Prediction Other approaches to survival risk group prediction have been published The supervised pc method is implemented in BRB-ArrayTools BRB-ArrayTools also provides for comparing the risk group classifier based on expression profiles to one based on standard covariates and one based on a combination of both types of variables
Sample Size Planning References K Dobbin, R Simon. Sample size determination in microarray experiments for class comparison and prognostic classification. Biostatistics 6:27-38, 2005 K Dobbin, R Simon. Sample size planning for developing classifiers using high dimensional DNA microarray data. Biostatistics (In Press)
Sample Size Planning GOAL: Identify genes differentially expressed in a comparison of two pre-defined classes of specimens on dual-label arrays using reference design or single label arrays Compare classes separately by gene with adjustment for multiple comparisons Approximate expression levels (log ratio or log signal) as normally distributed Determine number of samples n/2 per class to give power 1- for detecting mean difference at level
Comparing 2 equal size classes n = 42(z/2 + z)2/2 where = mean log-ratio difference between classes = within class standard deviation of biological replicates z/2, z = standard normal percentiles Choose small, e.g. = .001 Use percentiles of t distribution for improved accuracy
Total Number of Samples for Two Class Comparison Samples Per Class 0.001 0.05 1 (2-fold) 0.5 human tissue 13 0.25 transgenic mice 6 (t approximation)
Dual Label Arrays With Reference Design Pools of k Biological Samples
m = number of technical reps per sample k = number of samples per pool n = total number of arrays = mean difference between classes in log signal 2 = biological variance within class 2 = technical variance = significance level e.g. 0.001 1- = power z = normal percentiles (use t percentiles for better accuracy)
Number of samples pooled per array Number of arrays required Number of samples required 1 25 2 17 34 3 14 42 4 13 52 =0.001, =0.05, =1, 2+22=0.25, 2/2=4
m technical reps n arrays required samples required 1 25 2 42 21 3 60 =0.001 =0.05 =1 2+22=0.25, 2/2=4 m technical reps n arrays required samples required 1 25 2 42 21 3 60 20 4 76 19
Number of Events Needed to Detect Gene Specific Effects on Survival = standard deviation in log2 ratios for each gene = hazard ratio (>1) corresponding to 2-fold change in gene expression = 1/N for 1 expected false positive gene identified per N genes examined = 0.05 for 5% false negative rate
Number of Events Required to Detect Gene Specific Effects on Survival =0.001,=0.05 Hazard Ratio Events Required 2 0.5 26 1.5 76
Sample Size Planning for Classifier Development The expected value (over training sets) of the probability of correct classification PCC(n) should be within of the maximum achievable PCC()
Probability Model Two classes Log expression or log ratio MVN in each class with common covariance matrix m differentially expressed genes p-m noise genes Expression of differentially expressed genes are independent of expression for noise genes All differentially expressed genes have same inter-class mean difference 2 Common variance for differentially expressed genes and for noise genes
Classifier Feature selection based on univariate t-tests for differential expression at significance level Simple linear classifier with equal weights (except for sign) for all selected genes. Power for selecting each of the informative genes that are differentially expressed by mean difference 2 is 1-(n)
For 2 classes of equal prevalence, let 1 denote the largest eigenvalue of the covariance matrix of informative genes. Then
Sample size as a function of effect size (log-base 2 fold-change between classes divided by standard deviation). Two different tolerances shown, . Each class is equally represented in the population. 22000 genes on an array.
Optimal significance level cutoffs for gene selection Optimal significance level cutoffs for gene selection. 50 differentially expressed genes out of 22,000 genes on the microarrays 2δ/σ n=10 n=30 n=50 1 0.167 0.003 0.00068 1.25 0.085 0.0011 0.00035 1.5 0.045 0.00063 0.00016 1.75 0.026 0.00036 0.00006 2 0.015 0.0002 0.00002
γ=0.05, p=22,000 genes, gene standard deviation σ=0.75. Distance 2δ Fold change Sample size n PCC(n), m=1 PCCworst(n) m=1, pmin=0.25 m=5 m=5, pmin=0.25 Adjusted sample size 0.75 1.68 106 0.64 0.34 0.82 0.70 212 1.125 2.18 58 0.72 0.45 0.93 0.87 116 1.5 2.83 38 0.80 0.52 0.98 0.95 76 1.875 3.67 28 0.85 0.63 0.99 0.96 56
a) Power example with 60 samples, 2δ/σ=1/0 a) Power example with 60 samples, 2δ/σ=1/0.71 effect size for differentially expressed genes, alpha = 0.001 cutoffs for gene selection. As the proportion in the under-represented class gets smaller, the power to identify differentially expressed genes decreases.
b) PCC(60) as a function of the proportion in the under-represented class. Parameter settings same as a), with 10 differentially expressed genes among 22,000 total genes. If the proportion in the under-represented class is small (e.g., <20%), then the PCC(60) can decline significantly.
BRB-ArrayTools Integrated software package using Excel-based user interface but state-of-the art analysis methods programmed in R, Java & Fortran Publicly available for non-commercial use http://linus.nci.nih.gov/brb
Selected Features of BRB-ArrayTools Multivariate permutation tests for class comparison to control number and proportion of false discoveries with specified confidence level Permits blocking by another variable, pairing of data, averaging of technical replicates SAM Fortran implementation 7X faster than R versions Extensive annotation for identified genes Internal annotation of NetAffx, Source, Gene Ontology, Pathway information Links to annotations in genomic databases Find genes correlated with quantitative factor while controlling number of proportion of false discoveries Find genes correlated with censored survival while controlling number or proportion of false discoveries Analysis of variance Time course analysis Log intensities for non-reference designs Mixed models
Selected Features of BRB-ArrayTools Gene enhancement analysis. Find Gene Ontology groups and signaling pathways that are differentially expressed among classes Class prediction DLDA, CCP, Nearest Neighbor, Nearest Centroid, Shrunken Centroids, SVM, Random Forests Complete LOOCV, k-fold CV, repeated k-fold, .632+ bootstrap permutation significance of cross-validated error rate
Selected Features of BRB-ArrayTools Clustering tools for class discovery with reproducibility statistics on clusters Internal access to Eisen’s Cluster and Treeview Visualization tools including rotating 3D principal components plot exportable to Powerpoint with rotation controls Extensible via R plug-in feature Tutorials and datasets
Acknowledgements Kevin Dobbin Michael Radmacher Sudhir Varma Yingdong Zhao BRB-ArrayTools Development Team Amy Lam Ming-Chung Li Supriya Menezes