Alexander Statnikov Discovery Systems Laboratory Department of Biomedical Informatics Vanderbilt University 10/3/2007 1.

Slides:



Advertisements
Similar presentations
Alexander Statnikov1, Douglas Hardin1,2, Constantin Aliferis1,3
Advertisements

Yinyin Yuan and Chang-Tsun Li Computer Science Department
Publications Reviewed Searched Medline Hand screening of abstracts & papers Original study on human cancer patients Published in English before December.
Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Minimum Redundancy and Maximum Relevance Feature Selection
Correlation Aware Feature Selection Annalisa Barla Cesare Furlanello Giuseppe Jurman Stefano Merler Silvano Paoli Berlin – 8/10/2005.
Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”
Model Assessment, Selection and Averaging
Evaluation (practice). 2 Predicting performance  Assume the estimated error rate is 25%. How close is this to the true error rate?  Depends on the amount.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
Differentially expressed genes
Evaluation.
Midterm Review Goodness of Fit and Predictive Accuracy
Integrating domain knowledge with statistical and data mining methods for high-density genomic SNP disease association analysis Dinu et al, J. Biomedical.
BIOST 536 Lecture 9 1 Lecture 9 – Prediction and Association example Low birth weight dataset Consider a prediction model for low birth weight (< 2500.
Experimental Evaluation
Evaluation of Results (classifiers, and beyond) Biplav Srivastava Sources: [Witten&Frank00] Witten, I.H. and Frank, E. Data Mining - Practical Machine.
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
Guidelines on Statistical Analysis and Reporting of DNA Microarray Studies of Clinical Outcome Richard Simon, D.Sc. Chief, Biometric Research Branch National.
Attention Deficit Hyperactivity Disorder (ADHD) Student Classification Using Genetic Algorithm and Artificial Neural Network S. Yenaeng 1, S. Saelee 2.
A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.
Multiple Choice Questions for discussion
2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,
Whole Genome Expression Analysis
Efficient Model Selection for Support Vector Machines
Gene Expression Profiling Illustrated Using BRB-ArrayTools.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Bootstrap and Cross-Validation Bootstrap and Cross-Validation.
Chapter 15 Data Analysis: Testing for Significant Differences.
Participation in the NIPS 2003 Challenge Theodor Mader ETH Zurich, Five Datasets were provided for experiments: ARCENE: cancer diagnosis.
Jesse Gillis 1 and Paul Pavlidis 2 1. Department of Psychiatry and Centre for High-Throughput Biology University of British Columbia, Vancouver, BC Canada.
Case(Control)-Free Multi-SNP Combinations in Case-Control Studies Dumitru Brinza and Alexander Zelikovsky Combinatorial Search (CS) for Disease-Association:
GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.
The Broad Institute of MIT and Harvard Classification / Prediction.
1 Critical Review of Published Microarray Studies for Cancer Outcome and Guidelines on Statistical Analysis and Reporting Authors: A. Dupuy and R.M. Simon.
Scenario 6 Distinguishing different types of leukemia to target treatment.
A A R H U S U N I V E R S I T E T Faculty of Agricultural Sciences Introduction to analysis of microarray data David Edwards.
Gene Expression Profiling. Good Microarray Studies Have Clear Objectives Class Comparison (gene finding) –Find genes whose expression differs among predetermined.
A Short Overview of Microarrays Tex Thompson Spring 2005.
Classification of microarray samples Tim Beißbarth Mini-Group Meeting
+ Get Rich and Cure Cancer with Support Vector Machines (Your Summer Projects)
Jianfeng Xu, M.D., Dr.PH Professor of Public Health and Cancer Biology Director, Program for Genetic and Molecular Epidemiology of Cancer Associate Director,
1 Risk Assessment Tests Marina Kondratovich, Ph.D. OIVD/CDRH/FDA March 9, 2011 Molecular and Clinical Genetics Panel for Direct-to-Consumer (DTC) Genetic.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Sequential & Multiple Hypothesis Testing Procedures for Genome-wide Association Scans Qunyuan Zhang Division of Statistical Genomics Washington University.
Application of Class Discovery and Class Prediction Methods to Microarray Data Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics.
Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Tutorial I: Missing Value Analysis
Empirical Bayes Analysis of Variance Component Models for Microarray Data S. Feng, 1 R.Wolfinger, 2 T.Chu, 2 G.Gibson, 3 L.McGraw 4 1. Department of Statistics,
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
Jump to first page Inferring Sample Findings to the Population and Testing for Differences.
Intro. ANN & Fuzzy Systems Lecture 16. Classification (II): Practical Considerations.
Hypothesis Testing. Statistical Inference – dealing with parameter and model uncertainty  Confidence Intervals (credible intervals)  Hypothesis Tests.
Power and Meta-Analysis Dr Geraldine M. Clarke Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for.
Data Science Credibility: Evaluating What’s Been Learned
Classification with Gene Expression Data
Disease risk prediction
Introduction to translational and clinical bioinformatics Connecting complex molecular information to clinically relevant decisions using molecular.
Genome Wide Association Studies using SNP
Single Sample Expression-Anchored Mechanisms Predict Survival in Head and Neck Cancer Yang et al Presented by Yves A. Lussier MD PhD The University.
Lecture 16. Classification (II): Practical Considerations
Presentation transcript:

Alexander Statnikov Discovery Systems Laboratory Department of Biomedical Informatics Vanderbilt University 10/3/2007 1

Project history Joint project with Chun Li and Constantin Aliferis Cancer Research 2005 paper by Hu et al.: “Genome-Wide Association Study in Esophageal Cancer Using GeneChip Mapping 10K Array” Reported near-perfect classification of cancer patients & healthy controls on the basis of only SNP data from a case-control GWA study. This finding suggests that esophageal cancer is a solely genetic disease… Initial idea of Chun Li At DSL we had independently obtained the GWA dataset prior to Chun and Constantin have initiated this project 2

Background SNPs make up >90% of all human genetic variation and have been extensively studied for functional relationships between phenotype & genotype. Modern high-throughput genotyping technologies allow fast evaluation of SNPs on a genome-wide scale at a relatively low cost. During last 2 years, several studies have reported success in using SNP genotyping assays in GWA studies in cancer. Probably, the strongest result is reported in the study by Hu et al. 3

Claims of Hu et al. “Using the generalized linear model (GLM) with adjustment for potential confounders and multiple comparisons, we identified 37 SNPs associated with disease.” “When the 37 SNPs identified from the GLM recessive mode were used in a principal components analysis, the first principal component correctly predicted 46 of 50 cases and 47 of 50 controls.” […] “The permutation tests indicated that our PCA classification can be generalized.” 4

5

Study dataset & its preparation Study dataset: 50 esophageal squamous cell carcinoma patients 50 healthy controls (matched by age, sex, place of residence) 10k Affymetrix SNP arrays with 11,555 SNPs Additional variables: Age Tobacco use Alcohol consumption Family history Consumption of pickled vegetables Removed ~1.5k SNPs to minimize genotyping errors Implemented recessive A encoding Imputed missing genotypes 6

SNP selection: Original method of Hu et al. (denoted as GLM1) Fit a GLM model using data for all 100 subjects: Probability(Cancer) = 1 / (1 – exp(-f)), where f = a + b ∙ SNP + c ∙ family history + d ∙ alcohol consumption Obtain deviances: D 1 - deviance of the above fitted model D 0 - deviance of the null model (without predictor variables) From χ 2 distribution, compute a p-value for the test statistic D 0 -D 1 with 3 degrees of freedom Perform Bonferroni correction at 0.05 alpha level 7

SNP selection: Unbiased GLM-based method (denoted as GLM2) Fit a GLM model using data for all 100 subjects: Probability(Cancer) = 1 / (1 – exp(-f)), where f = a + b ∙ SNP + c ∙ family history + d ∙ alcohol consumption Obtain deviances: D 1 - deviance of the above fitted model D 0 ΄- deviance of the model with family history and alcohol consumption From χ 2 distribution, compute a p-value for the test statistic D 0 ΄ -D 1 with 1 degree of freedom Perform Bonferroni correction at 0.05 alpha level 8

Recap of SNP selection methods 9 Method GLM1 (Hu et al.) GLM2 (Current study) D1 SNP, family history, alcohol consumption D0Null family history, alcohol consumption Degrees of freedom 31

Classification: Original method of Hu et al. Perform principal component analysis (PCA) on selected SNPs using all 100 subjects in the dataset. Extract the first principal component (PC1). Use the following rule to classify each of the same 100 subjects as used for the PCA: If PC1 > 0, classify as control, otherwise classify as case 10

Evaluation of classification performance Hu et al. used proportion of correct classifications; their classifier is trained and tested in the same dataset We employ area under ROC curve performance metric and repeated 10-fold cross-validation scheme 11 SNP dataset (100 subjects) …

Reproducing findings of Hu et al. 12 Using GLM1 method, Hu et al. reported 37 significant SNPs, we found 226! Apparently, they used an extra filtering step that was not reported in the paper (personal comm. with their PI). Nevertheless, the application of PCA-based classifier (as in Hu et al.) to GLM1 significant SNPs resulted in 0.93 proportion of correct classifications and 0.98 AUC.  Major findings are reproduced using methods of Hu et al.

Bias in SNP selection method GLM1 of Hu et al. 13 Calculation of p-values in GLM1 does not reflect significance of the SNP, but the significance of 3 variables combined (SNP, family history, and alcohol consumption) Family history & alcohol consumption are strong risk factors  p-value is biased towards 0.

Bias in SNP selection method GLM1 by Hu et al. 14 Bonferroni adjusted α- level On the contrary, GLM2 reflects significance of SNPs and does not suffer from the above bias: Its distribution of SNP p-values is uniform It returns no SNPs significant at the Bonferroni adjusted alpha-level The distribution of SNP p-values for method GLM1 is not uniform: most p-values are <10 -3

Empirical demonstration of bias in SNP selection method 15 Main idea: Create a null distribution where SNPs are completely unrelated to the response variable and see how frequently methods GLM1 and GLM2 find statistically significant SNPs. 1. Permute all subjects in the SNP data while leaving the response variable, family history of esophageal cancer, and alcohol consumption intact. 2. Apply GLM1 and GLM2 to the permuted SNP data. Repeat 1,000 times

Results of permutation experiments 16 GLM1 found significant SNPs in all 1000 permutations! The number of significant SNPs found in a permuted dataset ranges from 185 to 1,938 (357 on average). GLM2 found significant SNPs in only 48/1000 permutations. The number of significant SNPs found in a permuted dataset ranges from 1 to 3.  GLM1 is biased, while GLM2 is not.

Bias in the classification performance estimate of Hu et al. 17 All data-analysis methods of Hu et al. use data for all subjects. Neither cross-validation nor independent sample validation were performed. We repeated their data-analysis (GLM1+PCA) embedded in the repeated 10-fold cross-validation design. The resulting performance is only 0.68 AUC (versus 0.98 AUC).  0.30 AUC bias (overestimation) in the reported results

Empirical demonstration of performance estimation bias 18 Main idea: Create a null distribution where SNPs are completely unrelated to the response variable (i.e. AUC=0.5), apply GLM1+PCA methodology and record resulting performance estimates. 1. Permute all subjects in the SNP data while leaving the response variable, family history of esophageal cancer, and alcohol consumption intact. 2. Apply GLM1 to the permuted SNP data. 3. Build and apply classifier using PCA. 4. Estimate classification performance (AUC). Repeat 1,000 times

Results of permutation experiments 19 Classification performance of GLM1+PCA; both methods applied as in Hu et al. to all data (no cross- validation): 0.99 AUC Classification performance of GLM1+PCA; GLM1 applied to all data, PCA applied by cross-validation (incomplete cross-validation): 0.98 AUC Classification performance by GLM1+PCA applied by cross-validation: 0.50 AUC  AUC bias (overestimation) under the null

20

Classification: Support Vector Machines (SVMs) Supervised baseline technique for many types high- throughput data (microarray, proteomics, etc). Trained and applied by cross-validation 21 kernel

SNP selection for fitting SVMs: Recursive Feature Elimination Among the best performing techniques for the analysis of microarray gene expression data Applied only to a training set during cross-validation 10,000 SNPs SVM model Performance estimate 5,000 SNPs Important for classification Not important for classification SVM model Performance estimate 2,500 SNPs Important for classification Not important for classification Discarded … 22

Classification results: repeated 10-fold cross-valid. estimates 23 “+” denotes building of classifier by ensembling technique

24

Feedback on our analysis from Hu et al Concerning bias in SNP selection: “If we use p-values to rank the SNPs, the two methods [GLM1 and GLM2] will give the same order.” Our comment: Ranking of SNPs is irrelevant because the method of Hu et al. (GLM1) as described and used in their paper is the method for selection (and not ranking) of SNPs.

Feedback on our analysis from Hu et al Concerning bias in estimation of classifier performance: “It was not our purpose to develop a classifier in this initial pilot effort.” “…we made these calculations as a frame of reference only.” The authors presented results of their “cross-validation effort”. SNPs were selected by GLM1 on all 100 subjects and the classifier was trained and tested by cross-validation (2/3 of data is used for training and 1/3 of data is used for testing). This cross- validation procedure was repeated 1,000 times with different splits into training and testing set.

Feedback on our analysis from Hu et al. 27 The authors obtain the following histogram of classification performance estimates Our comment: These results are expected because their SNP selection procedure utilizes both training and testing data. This is “incomplete cross- validation” and is shown to cause biased performance estimation of the classifier. Proportion of correct classifications

Publications 28 Statnikov A, Li C, Aliferis CF (2007) “Effects of Environment, Genetics and Data Analysis Pitfalls in an Esophageal Cancer Genome-Wide Association Study.” PLoS ONE 2(9): e958. Statnikov A, Li C, Aliferis CF (2007) “A statistical reappraisal of the findings of an esophageal cancer genome-wide association study.” Cancer Research, (accepted).

Conclusions 29 Data-analysis pitfalls in Hu et al. led researchers to (1) identify non-statistically significant SNPs and (2) derive biased estimates of classification performance. Environmental factors and family history have modest association with the disease, while SNPs do not appear to be associated. It is crucially important to have sound statistical analysis in genome-wide association studies. The amount of work involved in demonstration of errors (even obvious), correcting the analysis, communicating with authors, and publishing the rebuttal is significantly greater than publishing the original paper!