Presentation is loading. Please wait.

Presentation is loading. Please wait.

Biomarker and Classifier Selection in Diverse Genetic Datasets J AMES L INDSAY 1 E D H EMPHILL 2 C HIH L EE 1 I ON M ANDOIU 1 C RAIG N ELSON 2 U NIVERSITY.

Similar presentations


Presentation on theme: "Biomarker and Classifier Selection in Diverse Genetic Datasets J AMES L INDSAY 1 E D H EMPHILL 2 C HIH L EE 1 I ON M ANDOIU 1 C RAIG N ELSON 2 U NIVERSITY."— Presentation transcript:

1 Biomarker and Classifier Selection in Diverse Genetic Datasets J AMES L INDSAY 1 E D H EMPHILL 2 C HIH L EE 1 I ON M ANDOIU 1 C RAIG N ELSON 2 U NIVERSITY O F C ONNECTICUT 1 D EPARTMENT OF C OMPUTER S CIENCE AND E NGINEERING 2 D EPARTMENT OF M OLECULAR AND C ELL B IOLOGY

2 Motivation 1: Cell-type Identification The Question: Smallest # of genes to identify each cluster: B: Bone C: Myeloid D: Endothelial Available Data: Literature annotated present/absent 50 cell types, 600 genes in mesoderm lineage. In collaboration with: Dr. Hector Leonardo Aguila, UCHC

3 Motivation 2: Clinical Diagnostics Validation Study of Existing Gene Expression Signatures for Anti-TNF Treatment in Patients with Rheumatoid Arthritis, PLoS One 2012 Study# genesSensitivity (%)Specificity (%) Lequerre207161 Stuhlmuller117956 Stuhlmuller826756 Lequerre87128 Sekiguchi187128 Julia89217 Stuhlmuller37117 Tanio86733

4 Multi-class Classification Problem Multi-class Classification There are 2 or more classes Supervised learning Key Problems: 1. Feature Selection: What are the most predictive biomarkers? 2. Classification: What is the best classification algorithm?

5 Challenges Different types of data Gene expression Epigenetic data Methylation Histone modification Proteomics Metabolomics Phenotypes Different Platforms Microarray Sequencing In-situ hybridization Different Resolutions Discrete vs Continuous Sparse vs Complete

6 Minimal Unique Marker Panel Selection (Mumps) Pipeline Feature Selection Classification Parameterize each combination of feature selection and classification algorithms Inner Cross-validation Rank Models by AUC Outer Cross-validation Output: the best features and classifier Input: # of biomarkers: Nested Cross Validation

7 Feature Selection (SVM)-recursive feature elimination (RFE) ANOVA F-value Random Forests Extra Trees Algorithms Correlation Cosine K-Nearest Neighbors (KNN) Support Vector Machine (SVM) Decision Tree Random Forests Extra Trees Gradient Boosting Classification

8 Datasets From Broad Institute Affymetrix Gene expression microarray 15 hematopoietic cell types 82 samples 4-7 samples per cell type. Multiple Sources 70 samples Approximately 3-7 samples per cell type. Affymetrix & Illumina Bead Array Different labs

9 Experiments Complete Complete gene expression profile from microarray datasets. Simulated Sparse 70% and 50% missing data Coverage of a marker followed a Beta distribution. The fraction of cell types having known expression statuses for a marker. Fifteen simulations Cross-validation 3-fold, stratified # features: 2, 8, 16, 32, 64, 96, 128, 256, and 384 Best set of features and classifier for each # features External validation Use Broad data as training Test against external datasets

10 Performance: Complete Data

11 By Algorithm: Complete Data

12 Performance: 70% Missing

13 Summary: Best Algorithms Complete70% missing # of markersFSCLFSCL 2RFEKNN RFEExtra Trees 8 RFECosineRFECosine 1616 RFECosineRFECosine 32 RFECosineRFECosine 64 RFECosineRFECosine 96 RFECosineRFECorrelation 128 RFECosineRFECorrelation 256 RFECosineRFECorrelation 384 RFECorrelationRFECorrelation

14 Why the Big Gap? Cross-platform normalization Similarities in cell- types Over-fitting Correlation: Broad vs External

15 Mesoderm Cell-type Identification Anti-TNF Responsivness Motivation Results # genesAUC 8 73 % 16 74 % 32 76 % 64 78 % 96 87 % 128 91 % 256 91 % 384 92 % Study# genes Sensitivity (%) Specificity (%) Lequerre207161 Stuhlmuller117956 Stuhlmuller826756 Lequerre87128 Sekiguchi187128 Julia89217 Stuhlmuller37117 Tanio86733 UCONN883 UCONN20489496

16 Future Work Broader Data-types NCI-60 microarray mRNA microarray microRNA copy number variation protein array SNPs … Minimizing over fitting Cross-platform normalization Different Data types Integrate multiple data types simultaneously

17 Conclusion and Thanks Thanks to: Ed Hemphill Chih Lee Ion Mandoiu Craig Nelson Smpl Bio A commercial service coming in late 2013

18 D ON ’ T G O B EYOND, T IS A S ILLY P LACE Extra Slides

19 Experiment Overview Parameterize each combination of feature selection and classification algorithms Output the best features and classifier Feature Selection Classification Inner Cross-validation Rank Models by AUC Outer Cross-validation Input: # of biomarkers: Nested Cross Validation Test Best Model Output: AUC of best features / classifier Broad Data External Testing

20 Performance: 50% Missing


Download ppt "Biomarker and Classifier Selection in Diverse Genetic Datasets J AMES L INDSAY 1 E D H EMPHILL 2 C HIH L EE 1 I ON M ANDOIU 1 C RAIG N ELSON 2 U NIVERSITY."

Similar presentations


Ads by Google