Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics.

Slides:

Advertisements

Similar presentations

Regulation of Consumer Tests in California AAAS Meeting June 1-2, 2009 Beatrice OKeefe Acting Chief, Laboratory Field Services California Department of.

Advertisements

Yinyin Yuan and Chang-Tsun Li Computer Science Department

Publications Reviewed Searched Medline Hand screening of abstracts & papers Original study on human cancer patients Published in English before December.

Predictive Analysis of Gene Expression Data from Human SAGE Libraries Alexessander Alves* Nikolay Zagoruiko + Oleg Okun § Olga Kutnenko + Irina Borisova.

Relating Gene Expression to a Phenotype and External Biological Information Richard Simon, D.Sc. Chief, Biometric Research Branch, NCI

Transforming Correlative Science to Predictive Personalized Medicine Richard Simon, D.Sc. National Cancer Institute

Clinical Trial Designs for the Evaluation of Prognostic & Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

Yue Han and Lei Yu Binghamton University.

High-dimensional data analysis: Microarrays and multiple testing Mark van de Wiel 1,2 1. Dep. of Mathematics, VU University Amsterdam 2. Dep. of Biostatistics.

Microarray technology and analysis of gene expression data Hillevi Lindroos.

Gene expression analysis summary Where are we now?

Bioinformatics: a Multidisciplinary Challenge Ron Y. Pinter Dept. of Computer Science Technion March 12, 2003.

Differentially expressed genes

‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.

Clustering (Part II) 10/07/09. Outline Affinity propagation Quality evaluation.

. Differentially Expressed Genes, Class Discovery & Classification.

Predictive Classifiers Based on High Dimensional Data Development & Use in Clinical Trial Design Richard Simon, D.Sc. Chief, Biometric Research Branch.

Statistical Challenges for Predictive Onclogy Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer Institute

Quantitative Genetics

Guidelines on Statistical Analysis and Reporting of DNA Microarray Studies of Clinical Outcome Richard Simon, D.Sc. Chief, Biometric Research Branch National.

Topics in the Development and Validation of Gene Expression Profiling Based Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch.

Re-Examination of the Design of Early Clinical Trials for Molecularly Targeted Drugs Richard Simon, D.Sc. National Cancer Institute linus.nci.nih.gov/brb.

Use of Genomics in Clinical Trial Design and How to Critically Evaluate Claims for Prognostic & Predictive Biomarkers Richard Simon, D.Sc. Chief, Biometric.

Predictive Biomarkers and Their Use in Clinical Trial Design Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer Institute

1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.

Clustering and Classification In Gene Expression Data Carlo Colantuoni Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,

Some Statistical Aspects of Predictive Medicine

Gene expression profiling identifies molecular subtypes of gliomas

Proteomics Informatics – Data Analysis and Visualization (Week 13)

A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.

Whole Genome Expression Analysis

Gene Expression Profiling Illustrated Using BRB-ArrayTools.

Analysis and Management of Microarray Data Dr G. P. S. Raghava.

Analysis of Molecular and Clinical Data at PolyomX Adrian Driga 1, Kathryn Graham 1, 2, Sambasivarao Damaraju 1, 2, Jennifer Listgarten 3, Russ Greiner.

A New Oklahoma Bioinformatics Company. Microarray and Bioinformatics.

Statistical Aspects of the Development and Validation of Predictive Classifiers for High Dimensional Data Richard Simon, D.Sc. Chief, Biometric Research.

Microarray - Leukemia vs. normal GeneChip System.

1 Critical Review of Published Microarray Studies for Cancer Outcome and Guidelines on Statistical Analysis and Reporting Authors: A. Dupuy and R.M. Simon.

Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics Brigham Young University Dept. Integrative Biology.

Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.

Gene Expression Profiling. Good Microarray Studies Have Clear Objectives Class Comparison (gene finding) –Find genes whose expression differs among predetermined.

Class Prediction and Discovery Using Gene Expression Data Donna K. Slonim, Pablo Tamayo, Jill P. Mesirov, Todd R. Golub, Eric S. Lander 발표자 : 이인희.

Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks From Nature Medicine 7(6) 2001 By Javed.

Use of Candidate Predictive Biomarkers in the Design of Phase III Clinical Trials Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.

The Use of Predictive Biomarkers in Clinical Trial Design Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer Institute

Introduction to Statistical Analysis of Gene Expression Data Feng Hong Beespace meeting April 20, 2005.

Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.

Adaptive Designs for Using Predictive Biomarkers in Phase III Clinical Trials Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer Institute.

Computational Diagnostics A new research group at the Max Planck Institute for molecular Genetics, Berlin.

Using Predictive Classifiers in the Design of Phase III Clinical Trials Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer Institute.

Application of Class Discovery and Class Prediction Methods to Microarray Data Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics.

Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.

Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics

Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.

Comp. Genomics Recitation 10 4/7/09 Differential expression detection.

Analyzing Expression Data: Clustering and Stats Chapter 16.

ANALYSIS OF GENE EXPRESSION DATA. Gene expression data is a high-throughput data type (like DNA and protein sequences) that requires bioinformatic pattern.

Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.

The Broad Institute of MIT and Harvard Differential Analysis.

Tutorial I: Missing Value Analysis

Case Study: Characterizing Diseased States from Expression/Regulation Data Tuck et al., BMC Bioinformatics, 2006.

Moving From Correlative Science to Predictive Medicine Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer Institute

Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.

CCLE Cancer Cell Line Encyclopedia Alexey Erohskin.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

FINAL PROJECT- Key dates

Gene expression.

Computational Diagnostics

Class Prediction Based on Gene Expression Data Issues in the Design and Analysis of Microarray Experiments Michael D. Radmacher, Ph.D. Biometric Research.

Cancer Cell Line Encyclopedia

Presentation transcript:

Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics National Cancer Institute

All cells of a multi-cellular organism contain essentially the same DNA Cells differ in function based on the spectra of which genes are expressed and the level of expression Proteins do the work of cells and gene expression determines the intra-cellular concentration of proteins mRNA is an intermediate product of gene expression; a gene is transcribed into a mRNA molecule which is then translated into a protein molecule

Types of DNA Microarrays mRna transcript quantification Genomic DNA sequence determination –SNP identification –Genotyping Detecting gene deletions or gene duplications

Types of Microarrays DNA microarrays Tissue microarrays Protein microarrays

[Affymetrix] Hybridization Array

Biology in Transition Biotechnology –Restriction enzymes –Ligases –Polymerases –PCR Instruments, Tools, Reagents and Information Resources of Major Impact –DNA sequencing –Functional whole genomic assays

How to Deal With the Plethora of Data Development of software tools Training of biologists to use tools Collaboration with mathematical & computational scientists Training of mathematical & computational scientists

Bioinformatics An ambiguous term that helps further confuse people who are sometimes already confused Refers to a range of activities all of which involve multi-disciplinary collaboration among biological, mathematical, computational scientists and software engineers Organizations searching for structures that will support quality inter-disciplinary research in bioinformatics

Organizing for Bioinformatics Collaborative, not service oriented Enable extensive interaction and education Enable scientists to be stimulated by important problems and to accomplish organizational and personal goals in solving them

Molecular Statistics & Bioinformatics Section Utilize mathematical and computational sciences in conjunction with data from genomics & high thruput technologies to elucidate the biological basis of cancer –translating this to effective means of eradicating cancer Train statisticians, mathematicians, physical and biological scientists in cancer computational biology

Microarray Research Collaborative data analysis Methodology development Software development

Microarray Myths That the greatest challenge is managing the mass of micro-array data That pattern-recognition or data mining are the most appropriate paradigm for the analysis of micro-array data That pre-packaged analysis tools are a substitute for collaboration with statistical scientists in complex problems That statistical collaboration can be a service function That statisticians can be effective collaborators without substantial knowledge of biology and microarray technology

Applications of DNA Microarrays to Cancer Research Identify genes and pathways involved in oncogenesis –Transgenic mouse models –Profiling pre-cancerous lesions Identifying molecular targets for –therapeutics –early detection

Applications of DNA Microarrays to Cancer Research Diagnostic classification –For identifying disease subsets with distinctive pathogenesis –For selecting therapy Large cell lymphoma Stage I breast cancer

DNA Microarray Analytics Design issues –Arrays –Specimens Labeling Replication Image analysis –Pixels to feature Feature analysis –Background adjustment –Normalization –Features to genes –Normalization Analysis of biological objectives

Method of Analysis Should Be Tailored to Objectives Class discovery –Identifying expression profiles characteristic of non-predefined subsets of tumors Class/phenotype prediction –Identifying expression profiles that distinguish predefined subsets of tumors

Components of Class Prediction Establish that expression “profiles” differ to a statistically significant degree and that differences observed are not due to examination of thousands of genes Identify genes that account for the differences between classes Develop multi-gene classifier to predict the class for a new sample and estimate the misclassification rates

Do Expression Profiles Differ for Two Defined Classes of Arrays? Not a clustering problem –Global similarity measures generally used for clustering arrays may not distinguish classes Supervised vs unsupervised methods Requires multiple biological samples from each class

Do Expression Profiles Differ for Two Defined Classes of Arrays? Global test –Number of genes significantly differentially expressed among classes at specified nominal significance level –Cross-validated mis-classification rate Multiple comparison adjustment for finding differentially expressed genes –Experiment-wise error –Univariate screening with p<0.001 threshold –False discovery rate

training set test set specimens log-expression ratios specimens log-expression ratios full data set Non-cross-validated Prediction Cross-validated Prediction (Leave-one-out method) 1. Prediction rule is built using full data set. 2. Rule is applied to each specimen for class prediction. 1. Full data set is divided into training and test sets (test set contains 1 specimen). 2. Prediction rule is built using the training set. 3. Rule is applied to the specimen in the test set for class prediction. 4. Process is repeated until each specimen has appeared once in the test set.

Prediction on Simulated Null Data Generation of Gene Expression Profiles 14 specimens (P i is the expression profile for specimen i) Log-ratio measurements on 6000 genes P i ~ MVN(0, I 6000 ) Can we distinguish between the first 7 specimens (Class 1) and the last 7 (Class 2)? Prediction Method Compound covariate prediction (discussed later) Compound covariate built from the log-ratios of the 10 most differentially expressed genes.

Exact Permutation Test Premise: Under the null hypothesis of no systematic difference in expression profiles between the two classes, it can be assumed that assignment of class labels to expression profiles is purely coincidental. Performing the test 1. Consider every possible permutation of the class labels among the gene expression profiles. 2. Determine the proportion of the permutations that result in a misclassification error rate less than or equal to the observed error rate. 3. This proportion is the achieved significance level in a test of the null hypothesis.

Examining all permutations is computationally burdensome. Instead, a Monte Carlo method is used… n perm permutations of the labels are randomly generated. The proportion of these permutations that have m or fewer misclassifications is an estimate of the achieved significance level in a test of the null hypothesis. n perm is chosen such that the variability in the estimate is less than an acceptable level. If the true proportion of permutations with m £ 2 is 0.05, n perm = 2000 ensures the coefficient of variation of the estimate of the achieved significance level is less than 0.1. Monte Carlo Permutation Test

Gene-Expression Profiles in Hereditary Breast Cancer Breast tumors studied: 7 BRCA1+ tumors 8 BRCA2+ tumors 7 sporadic tumors Log-ratios measurements of 3226 genes for each tumor after initial data filtering cDNA Microarrays Parallel Gene Expression Analysis RESEARCH QUESTION Can we distinguish BRCA1+ from BRCA1– cancers and BRCA2+ from BRCA2– cancers based solely on their gene expression profiles?

The Compound Covariate Predictor (CCP) We consider only genes that are differentially expressed between the two groups (using a two-sample t-test with small  ). The CCP –Motivated by J. Tukey, Controlled Clinical Trials, 1993 –Simple approach that may serve better than complex multivariate analysis –A compound covariate is built from the basic covariates (log-ratios) t j is the two-sample t-statistic for gene j. x ij is the log-ratio measure of sample i for gene j. Sum is over all differentially expressed genes. Threshold of classification: midpoint of the CCP means for the two classes.

Accuracy of class prediction as selection stringency increases

Advantages of Compound Covariate Classifier Good feature selection Does not over-fit data –Incorporates influence of multiple predictive variables without attempting to select the best small subset of variables –Does not attempt to model the multivariate interactions among the predictors and outcome

Extensions Adjustment for covariates Paired samples Survival data Other classification methods More than 2 classes

Class Discovery For determining whether a set of tumors is homogeneous with regard to expression profile

Class Discovery Methods Cluster analysis Multi-dimensional Scaling

1 - correlation Melanoma Gene Expression Data 19 tumor cluster of interest Q: Can gene expression profiles of melanoma be used to distinguish sub-classes of disease? (M. Bittner et al., Nature Genetics Aug 2000)

Validation of Clusters Clustering algorithms find clusters, even when they are spurious Clusters found may change with re-assaying tumors or selection of new tumors

Clustering Arrays Cluster significance Cluster reproducibility

Add perturbation noise to original data Re-cluster perturbed data to assess stability of original clusters D: Proportion of pairs of samples in a specified cluster of the original data that are in separate clusters after perturbation R: Average number of specimens lost or gained in a specified cluster || C  P(C) - C  P(C) ||

Melanoma Data: mn-error Method - Individual Clusters

Test of Cluster Significance Multivariate Gaussian null hypothesis Project to subspace determined by first three principal components Compute EDF of nearest neighbor Euclidean distances between samples Compare the NN EDF observed to that expected under the null distribution using a squared difference discrepancy metric Estimate null distribution by sampling from 3D Gaussian distribution with mean and covariance matrix corresponding to observed data

BRB ArrayTools: An integrated package for the analysis of DNA microarray data

BRB ArrayTools Design Objectives Easy user interface –Excel front-end Ease of data loading –integrated Drill-down linkage to genomic databases Educating biologists in microarray data analysis Powerful analytic & visualization tools Easily extensible –R backend Portable –Non-proprietary Ease of development –R back-end

Collaborators Molecular Statistics & Bioinformatics –Kevin Dobbin –Lisa McShane –Amy Peng –Michael Radmacher –Joanna Shih –George Wright –Yingdong Zhao