Gene Expression Profiling Illustrated Using BRB-ArrayTools.

Slides:



Advertisements
Similar presentations
Publications Reviewed Searched Medline Hand screening of abstracts & papers Original study on human cancer patients Published in English before December.
Advertisements

Relating Gene Expression to a Phenotype and External Biological Information Richard Simon, D.Sc. Chief, Biometric Research Branch, NCI
Development and Validation of Predictive Classifiers using Gene Expression Profiles Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.
From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion mean(nucs) = bp median(nucs) = 110 bp sd(nucs+ = 17.3.
Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Model generalization Test error Bias, variance and complexity
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Classification and risk prediction
Model and Variable Selections for Personalized Medicine Lu Tian (Northwestern University) Hajime Uno (Kitasato University) Tianxi Cai, Els Goetghebeur,
Differentially expressed genes
Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical.
Evaluation.
. Differentially Expressed Genes, Class Discovery & Classification.
Predictive Classifiers Based on High Dimensional Data Development & Use in Clinical Trial Design Richard Simon, D.Sc. Chief, Biometric Research Branch.
Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical.
Chapter 11 Multiple Regression.
Topic 3: Regression.
Richard Simon, D.Sc. Chief, Biometric Research Branch
Experimental Evaluation
Statistical Comparison of Two Learning Algorithms Presented by: Payam Refaeilzadeh.
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
Statistical Challenges for Predictive Onclogy Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer Institute
Guidelines on Statistical Analysis and Reporting of DNA Microarray Studies of Clinical Outcome Richard Simon, D.Sc. Chief, Biometric Research Branch National.
Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch.
Ensemble Learning (2), Tree and Forest
Use of Genomics in Clinical Trial Design and How to Critically Evaluate Claims for Prognostic & Predictive Biomarkers Richard Simon, D.Sc. Chief, Biometric.
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
Clustering and Classification In Gene Expression Data Carlo Colantuoni Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,
Development and Validation of Prognostic Classifiers using High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.
Some Statistical Aspects of Predictive Medicine
Evaluating Classifiers
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A. Hall 報告人:黃子齊
Classification (Supervised Clustering) Naomi Altman Nov '06.
Analysis of Molecular and Clinical Data at PolyomX Adrian Driga 1, Kathryn Graham 1, 2, Sambasivarao Damaraju 1, 2, Jennifer Listgarten 3, Russ Greiner.
Significance analysis of microarrays (SAM) SAM can be used to pick out significant genes based on differential expression between sets of samples. Currently.
Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.
The Broad Institute of MIT and Harvard Classification / Prediction.
1 Critical Review of Published Microarray Studies for Cancer Outcome and Guidelines on Statistical Analysis and Reporting Authors: A. Dupuy and R.M. Simon.
Repeated Measurements Analysis. Repeated Measures Analysis of Variance Situations in which biologists would make repeated measurements on same individual.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
A A R H U S U N I V E R S I T E T Faculty of Agricultural Sciences Introduction to analysis of microarray data David Edwards.
Gene Expression Profiling. Good Microarray Studies Have Clear Objectives Class Comparison (gene finding) –Find genes whose expression differs among predetermined.
Use of Candidate Predictive Biomarkers in the Design of Phase III Clinical Trials Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Application of Class Discovery and Class Prediction Methods to Microarray Data Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
Cluster validation Integration ICES Bioinformatics.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Validation methods.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
Distinguishing active from non active genes: Main principle: DNA hybridization -DNA hybridizes due to base pairing using H-bonds -A/T and C/G and A/U possible.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Design & Analysis of Microarray Studies for Diagnostic & Prognostic Classification Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Machine Learning – Classification David Fenyő
Classification with Gene Expression Data
Significance analysis of microarrays (SAM)
Significance Analysis of Microarrays (SAM)
PCA, Clustering and Classification by Agnieszka S. Juncker
Significance Analysis of Microarrays (SAM)
Class Prediction Based on Gene Expression Data Issues in the Design and Analysis of Microarray Experiments Michael D. Radmacher, Ph.D. Biometric Research.
Model generalization Brief summary of methods
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Outlines Introduction & Objectives Methodology & Workflow
Presentation transcript:

Gene Expression Profiling Illustrated Using BRB-ArrayTools

Good Microarray Studies Have Clear Objectives Class Comparison (gene finding) –Find genes whose expression differs among predetermined classes Class Prediction –Prediction of predetermined class using information from gene expression profile Response vs no response Class Discovery –Discover clusters of specimens having similar expression profiles –Discover clusters of genes having similar expression profiles

Class Comparison and Class Prediction Not clustering problems Supervised methods

Levels of Replication Technical replicates –RNA sample divided into multiple aliquots Biological replicates –Multiple subjects –Multiple animals –Replication of the tissue culture experiment

Comparing classes or developing classifiers requires independent biological replicates. The power of statistical methods for microarray data depends on the number of biological replicates

Microarray Platforms Single label arrays –Affymetrix GeneChips Dual label arrays –Common reference design –Other designs

Common Reference Design A1A1 R A2A2 B1B1 B2B2 RRR RED GREEN Array 1Array 2Array 3Array 4 A i = ith specimen from class A R = aliquot from reference pool B i = ith specimen from class B

The reference generally serves to control variation in the size of corresponding spots on different arrays and variation in sample distribution over the slide. The reference provides a relative measure of expression for a given gene in a given sample that is less variable than an absolute measure. The reference is not the object of comparison.

The relative measure of expression (e.g. log ratio) will be compared among biologically independent samples from different classes for class comparison and as the predictor variables for class prediction Dye swap technical replicates of the same two rna samples are not necessary with the common reference design.

For two-label direct comparison designs for comparing two classes, dye bias is of concern. It is more efficient to balance the dye-class assignments for independent biological specimens than to do dye swap technical replicates

Balanced Block Design A1A1 A2A2 B2B2 A3A3 B4B4 B1B1 B3B3 A4A4 RED GREEN Array 1Array 2Array 3Array 4 A i = ith specimen from class A B i = ith specimen from class B

Class Comparison Gene Finding

t-test Comparisons of Gene Expression for gene j

Estimation of Within-Class Variance Estimate separately for each gene –Limited degrees-of-freedom (precision) unless number of samples is large –Gene list dominated by genes with small fold changes and small variances Assume all genes have same variance –Poor assumption Random (hierarchical) variance model –Wright G.W. and Simon R. Bioinformatics19: ,2003 –Variances are independent samples from a common distribution; Inverse gamma distribution used – Results in exact F (or t) distribution of test statistics with increased degrees of freedom for error variance –For any normal linear model

Randomized Variance t-test Pr(  -2 =x) = x a-1 exp(-x/b)/  (a)b a

Controlling for Multiple Comparisons Bonferroni type procedures control the probability of making any false positive errors Overly conservative for the context of DNA microarray studies

Simple Procedures Control expected the number of false discoveries to u –Conduct each of k tests at level u/k –e.g. To limit of 10 false discoveries in 10,000 comparisons, conduct each test at p<0.001 level Control FDR –Benjamini-Hochberg procedure

False Discovery Rate (FDR) FDR = Expected proportion of false discoveries among the genes declared differentially expressed

If you analyze n probe sets and select as “significant” the k genes whose p ≤ p* FDR = number of false positives divided by number of positives ~ n p* / k

Multivariate Permutation Procedures (Korn et al. Stat Med 26:4428,2007) Allows statements like: FD Procedure: We are 95% confident that the (actual) number of false discoveries is no greater than 5. FDP Procedure: We are 95% confident that the (actual) proportion of false discoveries does not exceed.10.

Multivariate Permutation Tests Distribution-free –even if they use t statistics Preserve/exploit correlation among tests by permuting each profile as a unit More effective than univariate permutation tests especially with limited number of samples

Additional Procedures “SAM” - Significance Analysis of Microarrays

Class Prediction

Components of Class Prediction Feature (gene) selection –Which genes will be included in the model Select model type –E.g. Diagonal linear discriminant analysis, Nearest-Neighbor, … Fitting parameters (regression coefficients) for model –Selecting value of tuning parameters

Feature Selection Genes that are differentially expressed among the classes at a significance level  (e.g. 0.01) –The  level is selected only to control the number of genes in the model –For class comparison false discovery rate is important –For class prediction, predictive accuracy is important

Complex Gene Selection Small subset of genes which together give most accurate predictions –Genetic algorithms Little evidence that complex feature selection is useful in microarray problems

Linear Classifiers for Two Classes

Fisher linear discriminant analysis –Requires estimating correlations among all genes selected for model Diagonal linear discriminant analysis (DLDA) assumes features are uncorrelated Compound covariate predictor (Radmacher) and Golub’s method are similar to DLDA in that they can be viewed as weighted voting of univariate classifiers

Linear Classifiers for Two Classes Compound covariate predictor Instead of for DLDA

Linear Classifiers for Two Classes Support vector machines with inner product kernel are linear classifiers with weights determined to separate the classes with a hyperplane that minimizes the length of the weight vector

Support Vector Machine

Other Linear Methods Perceptrons Principal component regression Supervised principal component regression Partial least squares Stepwise logistic regression

Other Simple Methods Nearest neighbor classification Nearest k-neighbors Nearest centroid classification Shrunken centroid classification

Nearest Neighbor Classifier To classify a sample in the validation set, determine it’s nearest neighbor in the training set; i.e. which sample in the training set is its gene expression profile is most similar to. –Similarity measure used is based on genes selected as being univariately differentially expressed between the classes –Correlation similarity or Euclidean distance generally used Classify the sample as being in the same class as it’s nearest neighbor in the training set

Nearest Centroid Classifier For a training set of data, select the genes that are informative for distinguishing the classes Compute the average expression profile (centroid) of the informative genes in each class Classify a sample in the validation set based on which centroid in the training set it’s gene expression profile is most similar to.

When p>>n The Linear Model is Too Complex It is always possible to find a set of features and a weight vector for which the classification error on the training set is zero. Why consider more complex models?

Other Methods Top-scoring pairs –Claim that it gives accurate prediction with few pairs because pairs of genes are selected to work well together Random Forest –Very popular in machine learning community –Complex classifier

Comparative studies indicate that linear methods and nearest neighbor type methods often work as well or better than more complex methods for microarray problems because they avoid over-fitting the data.

When There Are More Than 2 Classes Nearest neighbor type methods Decision tree of binary classifiers

Decision Tree of Binary Classifiers Partition the set of classes {1,2,…,K} into two disjoint subsets S 1 and S 2 Develop a binary classifier for distinguishing the composite classes S 1 and S 2 Compute the cross-validated classification error for distinguishing S 1 and S 2 Repeat the above steps for all possible partitions in order to find the partition S 1 and S 2 for which the cross-validated classification error is minimized If S 1 and S 2 are not singleton sets, then repeat all of the above steps separately for the classes in S 1 and S 2 to optimally partition each of them

Evaluating a Classifier Fit of a model to the same data used to develop it is no evidence of prediction accuracy for independent data –Goodness of fit vs prediction accuracy

Class Prediction A set of genes is not a classifier Testing whether analysis of independent data results in selection of the same set of genes is not an appropriate test of predictive accuracy of a classifier

Hazard ratios and statistical significance levels are not appropriate measures of prediction accuracy A hazard ratio is a measure of association –Large values of HR may correspond to small improvement in prediction accuracy Kaplan-Meier curves for predicted risk groups within strata defined by standard prognostic variables provide more information about improvement in prediction accuracy Time dependent ROC curves within strata defined by standard prognostic factors can also be useful

Time-Dependent ROC Curve Sensitivity vs 1-Specificity Sensitivity = prob{M=1|S>T} Specificity = prob{M=0|S<T}

Validation of a Predictor Internal validation –Re-substitution estimate Very biased –Split-sample validation –Cross-validation Independent data validation

Split-Sample Evaluation Split your data into a training set and a test set –Randomly (e.g. 2:1) –By center Training-set –Used to select features, select model type, determine parameters and cut-off thresholds Test-set –Withheld until a single model is fully specified using the training-set. –Fully specified model is applied to the expression profiles in the test-set to predict class labels. –Number of errors is counted

Leave-one-out Cross Validation Leave-one-out cross-validation simulates the process of separately developing a model on one set of data and predicting for a test set of data not used in developing the model

Leave-one-out Cross Validation Omit sample 1 –Develop multivariate classifier from scratch on training set with sample 1 omitted –Predict class for sample 1 and record whether prediction is correct

Leave-one-out Cross Validation Repeat analysis for training sets with each single sample omitted one at a time e = number of misclassifications determined by cross-validation Subdivide e for estimation of sensitivity and specificity

Cross validation is only valid if the test set is not used in any way in the development of the model. Using the complete set of samples to select genes violates this assumption and invalidates cross-validation. With proper cross-validation, the model must be developed from scratch for each leave-one-out training set. This means that feature selection must be repeated for each leave-one-out training set. The cross-validated estimate of misclassification error is an estimate of the prediction error for model fit using specified algorithm to full dataset

Prediction on Simulated Null Data Generation of Gene Expression Profiles 14 specimens (P i is the expression profile for specimen i) Log-ratio measurements on 6000 genes P i ~ MVN(0, I 6000 ) Can we distinguish between the first 7 specimens (Class 1) and the last 7 (Class 2)? Prediction Method Compound covariate prediction (discussed later) Compound covariate built from the log-ratios of the 10 most differentially expressed genes.

Permutation Distribution of Cross- validated Misclassification Rate of a Multivariate Classifier Randomly permute class labels and repeat the entire cross-validation Re-do for all (or 1000) random permutations of class labels Permutation p value is fraction of random permutations that gave as few misclassifications as e in the real data

Gene-Expression Profiles in Hereditary Breast Cancer Breast tumors studied: 7 BRCA1+ tumors 8 BRCA2+ tumors 7 sporadic tumors Log-ratios measurements of 3226 genes for each tumor after initial data filtering cDNA Microarrays Parallel Gene Expression Analysis RESEARCH QUESTION Can we distinguish BRCA1+ from BRCA1– cancers and BRCA2+ from BRCA2– cancers based solely on their gene expression profiles?

BRCA1

BRCA2

Classification of BRCA2 Germline Mutations Classification MethodLOOCV Prediction Error Compound Covariate Predictor14% Fisher LDA36% Diagonal LDA14% 1-Nearest Neighbor9% 3-Nearest Neighbor23% Support Vector Machine (linear kernel) 18% Classification Tree45%

Simulated Data 40 cases, 10 genes selected from 5000 MethodEstimateStd Deviation True.078 Resubstitution LOOCV fold CV fold CV Split sample Split sample bootstrap

Simulated Data 40 cases MethodEstimateStd Deviation True fold Repeated 10-fold fold Repeated 5-fold Split Repeated split

Common Problems With Internal Classifier Validation Pre-selection of genes using entire dataset Failure to consider optimization of tuning parameter part of classification algorithm –Varma & Simon, BMC Bioinformatics 2006 Erroneous use of predicted class in regression model

If you use cross-validation estimates of prediction error for a set of algorithms indexed by a tuning parameter and select the algorithm with the smallest cv error estimate, you do not have an unbiased estimate of the prediction error for the selected model

Does an Expression Profile Classifier Predict More Accurately Than Standard Prognostic Variables? Not an issue of which variables are significant after adjusting for which others or which are independent predictors –Predictive accuracy and inference are different The two classifiers can be compared by ROC analysis as functions of the threshold for classification The predictiveness of the expression profile classifier can be evaluated within levels of the classifier based on standard prognostic variables

Does an Expression Profile Classifier Predict More Accurately Than Standard Prognostic Variables? Some publications fit logistic model to standard covariates and the cross-validated predictions of expression profile classifiers This is valid only with split-sample analysis because the cross-validated predictions are not independent

Survival Risk Group Prediction Evaluate individual genes by fitting single variable proportional hazards regression models to log signal or log ratio for gene Select genes based on p-value threshold for single gene PH regressions Compute first k principal components of the selected genes Fit PH regression model with the k pc’s as predictors. Let b 1, …, b k denote the estimated regression coefficients To predict for case with expression profile vector x, compute the k supervised pc’s y 1, …, y k and the predictive index = b 1 y 1 + … + b k y k

Survival Risk Group Prediction LOOCV loop: –Create training set by omitting i’th case Develop supervised pc PH model for training set Compute cross-validated predictive index for i’th case using PH model developed for training set Compute predictive risk percentile of predictive index for i’th case among predictive indices for cases in the training set

Survival Risk Group Prediction Plot Kaplan Meier survival curves for cases with cross-validated risk percentiles above 50% and for cases with cross- validated risk percentiles below 50% –Or for however many risk groups and thresholds is desired Compute log-rank statistic comparing the cross-validated Kaplan Meier curves

Survival Risk Group Prediction Repeat the entire procedure for all (or large number) of permutations of survival times and censoring indicators to generate the null distribution of the log-rank statistic –The usual chi-square null distribution is not valid because the cross-validated risk percentiles are correlated among cases Evaluate statistical significance of the association of survival and expression profiles by referring the log-rank statistic for the unpermuted data to the permutation null distribution

Survival Risk Group Prediction Other approaches to survival risk group prediction have been published The supervised pc method is implemented in BRB-ArrayTools BRB-ArrayTools also provides for comparing the risk group classifier based on expression profiles to one based on standard covariates and one based on a combination of both types of variables