Download presentation
Presentation is loading. Please wait.
Published bySamantha Hines Modified over 9 years ago
1
Biomarker and Classifier Selection in Diverse Genetic Datasets J AMES L INDSAY 1 E D H EMPHILL 2 C HIH L EE 1 I ON M ANDOIU 1 C RAIG N ELSON 2 U NIVERSITY O F C ONNECTICUT 1 D EPARTMENT OF C OMPUTER S CIENCE AND E NGINEERING 2 D EPARTMENT OF M OLECULAR AND C ELL B IOLOGY
2
Motivation 1: Cell-type Identification The Question: Smallest # of genes to identify each cluster: B: Bone C: Myeloid D: Endothelial Available Data: Literature annotated present/absent 50 cell types, 600 genes in mesoderm lineage. In collaboration with: Dr. Hector Leonardo Aguila, UCHC
3
Motivation 2: Clinical Diagnostics Validation Study of Existing Gene Expression Signatures for Anti-TNF Treatment in Patients with Rheumatoid Arthritis, PLoS One 2012 Study# genesSensitivity (%)Specificity (%) Lequerre207161 Stuhlmuller117956 Stuhlmuller826756 Lequerre87128 Sekiguchi187128 Julia89217 Stuhlmuller37117 Tanio86733
4
Multi-class Classification Problem Multi-class Classification There are 2 or more classes Supervised learning Key Problems: 1. Feature Selection: What are the most predictive biomarkers? 2. Classification: What is the best classification algorithm?
5
Challenges Different types of data Gene expression Epigenetic data Methylation Histone modification Proteomics Metabolomics Phenotypes Different Platforms Microarray Sequencing In-situ hybridization Different Resolutions Discrete vs Continuous Sparse vs Complete
6
Minimal Unique Marker Panel Selection (Mumps) Pipeline Feature Selection Classification Parameterize each combination of feature selection and classification algorithms Inner Cross-validation Rank Models by AUC Outer Cross-validation Output: the best features and classifier Input: # of biomarkers: Nested Cross Validation
7
Feature Selection (SVM)-recursive feature elimination (RFE) ANOVA F-value Random Forests Extra Trees Algorithms Correlation Cosine K-Nearest Neighbors (KNN) Support Vector Machine (SVM) Decision Tree Random Forests Extra Trees Gradient Boosting Classification
8
Datasets From Broad Institute Affymetrix Gene expression microarray 15 hematopoietic cell types 82 samples 4-7 samples per cell type. Multiple Sources 70 samples Approximately 3-7 samples per cell type. Affymetrix & Illumina Bead Array Different labs
9
Experiments Complete Complete gene expression profile from microarray datasets. Simulated Sparse 70% and 50% missing data Coverage of a marker followed a Beta distribution. The fraction of cell types having known expression statuses for a marker. Fifteen simulations Cross-validation 3-fold, stratified # features: 2, 8, 16, 32, 64, 96, 128, 256, and 384 Best set of features and classifier for each # features External validation Use Broad data as training Test against external datasets
10
Performance: Complete Data
11
By Algorithm: Complete Data
12
Performance: 70% Missing
13
Summary: Best Algorithms Complete70% missing # of markersFSCLFSCL 2RFEKNN RFEExtra Trees 8 RFECosineRFECosine 1616 RFECosineRFECosine 32 RFECosineRFECosine 64 RFECosineRFECosine 96 RFECosineRFECorrelation 128 RFECosineRFECorrelation 256 RFECosineRFECorrelation 384 RFECorrelationRFECorrelation
14
Why the Big Gap? Cross-platform normalization Similarities in cell- types Over-fitting Correlation: Broad vs External
15
Mesoderm Cell-type Identification Anti-TNF Responsivness Motivation Results # genesAUC 8 73 % 16 74 % 32 76 % 64 78 % 96 87 % 128 91 % 256 91 % 384 92 % Study# genes Sensitivity (%) Specificity (%) Lequerre207161 Stuhlmuller117956 Stuhlmuller826756 Lequerre87128 Sekiguchi187128 Julia89217 Stuhlmuller37117 Tanio86733 UCONN883 UCONN20489496
16
Future Work Broader Data-types NCI-60 microarray mRNA microarray microRNA copy number variation protein array SNPs … Minimizing over fitting Cross-platform normalization Different Data types Integrate multiple data types simultaneously
17
Conclusion and Thanks Thanks to: Ed Hemphill Chih Lee Ion Mandoiu Craig Nelson Smpl Bio A commercial service coming in late 2013
18
D ON ’ T G O B EYOND, T IS A S ILLY P LACE Extra Slides
19
Experiment Overview Parameterize each combination of feature selection and classification algorithms Output the best features and classifier Feature Selection Classification Inner Cross-validation Rank Models by AUC Outer Cross-validation Input: # of biomarkers: Nested Cross Validation Test Best Model Output: AUC of best features / classifier Broad Data External Testing
20
Performance: 50% Missing
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.