Biomarker and Classifier Selection in Diverse Genetic Datasets J AMES L INDSAY 1 E D H EMPHILL 2 C HIH L EE 1 I ON M ANDOIU 1 C RAIG N ELSON 2 U NIVERSITY.

Biomarker and Classifier Selection in Diverse Genetic Datasets J AMES L INDSAY 1 E D H EMPHILL 2 C HIH L EE 1 I ON M ANDOIU 1 C RAIG N ELSON 2 U NIVERSITY O F C ONNECTICUT 1 D EPARTMENT OF C OMPUTER S CIENCE AND E NGINEERING 2 D EPARTMENT OF M OLECULAR AND C ELL B IOLOGY

Motivation 1: Cell-type Identification The Question: Smallest # of genes to identify each cluster: B: Bone C: Myeloid D: Endothelial Available Data: Literature annotated present/absent 50 cell types, 600 genes in mesoderm lineage. In collaboration with: Dr. Hector Leonardo Aguila, UCHC

Motivation 2: Clinical Diagnostics Validation Study of Existing Gene Expression Signatures for Anti-TNF Treatment in Patients with Rheumatoid Arthritis, PLoS One 2012 Study# genesSensitivity (%)Specificity (%) Lequerre207161 Stuhlmuller117956 Stuhlmuller826756 Lequerre87128 Sekiguchi187128 Julia89217 Stuhlmuller37117 Tanio86733

Multi-class Classification Problem Multi-class Classification There are 2 or more classes Supervised learning Key Problems: 1. Feature Selection: What are the most predictive biomarkers? 2. Classification: What is the best classification algorithm?

Challenges Different types of data Gene expression Epigenetic data Methylation Histone modification Proteomics Metabolomics Phenotypes Different Platforms Microarray Sequencing In-situ hybridization Different Resolutions Discrete vs Continuous Sparse vs Complete

Minimal Unique Marker Panel Selection (Mumps) Pipeline Feature Selection Classification Parameterize each combination of feature selection and classification algorithms Inner Cross-validation Rank Models by AUC Outer Cross-validation Output: the best features and classifier Input: # of biomarkers: Nested Cross Validation

Feature Selection (SVM)-recursive feature elimination (RFE) ANOVA F-value Random Forests Extra Trees Algorithms Correlation Cosine K-Nearest Neighbors (KNN) Support Vector Machine (SVM) Decision Tree Random Forests Extra Trees Gradient Boosting Classification

Datasets From Broad Institute Affymetrix Gene expression microarray 15 hematopoietic cell types 82 samples 4-7 samples per cell type. Multiple Sources 70 samples Approximately 3-7 samples per cell type. Affymetrix & Illumina Bead Array Different labs

Experiments Complete Complete gene expression profile from microarray datasets. Simulated Sparse 70% and 50% missing data Coverage of a marker followed a Beta distribution. The fraction of cell types having known expression statuses for a marker. Fifteen simulations Cross-validation 3-fold, stratified # features: 2, 8, 16, 32, 64, 96, 128, 256, and 384 Best set of features and classifier for each # features External validation Use Broad data as training Test against external datasets

Performance: Complete Data

By Algorithm: Complete Data

Performance: 70% Missing

Summary: Best Algorithms Complete70% missing # of markersFSCLFSCL 2RFEKNN RFEExtra Trees 8 RFECosineRFECosine 1616 RFECosineRFECosine 32 RFECosineRFECosine 64 RFECosineRFECosine 96 RFECosineRFECorrelation 128 RFECosineRFECorrelation 256 RFECosineRFECorrelation 384 RFECorrelationRFECorrelation

Why the Big Gap? Cross-platform normalization Similarities in cell- types Over-fitting Correlation: Broad vs External

Mesoderm Cell-type Identification Anti-TNF Responsivness Motivation Results # genesAUC 8 73 % 16 74 % 32 76 % 64 78 % 96 87 % 128 91 % 256 91 % 384 92 % Study# genes Sensitivity (%) Specificity (%) Lequerre207161 Stuhlmuller117956 Stuhlmuller826756 Lequerre87128 Sekiguchi187128 Julia89217 Stuhlmuller37117 Tanio86733 UCONN883 UCONN20489496

Future Work Broader Data-types NCI-60 microarray mRNA microarray microRNA copy number variation protein array SNPs … Minimizing over fitting Cross-platform normalization Different Data types Integrate multiple data types simultaneously

Conclusion and Thanks Thanks to: Ed Hemphill Chih Lee Ion Mandoiu Craig Nelson Smpl Bio A commercial service coming in late 2013

D ON ’ T G O B EYOND, T IS A S ILLY P LACE Extra Slides

Experiment Overview Parameterize each combination of feature selection and classification algorithms Output the best features and classifier Feature Selection Classification Inner Cross-validation Rank Models by AUC Outer Cross-validation Input: # of biomarkers: Nested Cross Validation Test Best Model Output: AUC of best features / classifier Broad Data External Testing

Performance: 50% Missing

Biomarker and Classifier Selection in Diverse Genetic Datasets J AMES L INDSAY 1 E D H EMPHILL 2 C HIH L EE 1 I ON M ANDOIU 1 C RAIG N ELSON 2 U NIVERSITY.

Similar presentations

Presentation on theme: "Biomarker and Classifier Selection in Diverse Genetic Datasets J AMES L INDSAY 1 E D H EMPHILL 2 C HIH L EE 1 I ON M ANDOIU 1 C RAIG N ELSON 2 U NIVERSITY."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Biomarker and Classifier Selection in Diverse Genetic Datasets J AMES L INDSAY 1 E D H EMPHILL 2 C HIH L EE 1 I ON M ANDOIU 1 C RAIG N ELSON 2 U NIVERSITY.

Similar presentations

Presentation on theme: "Biomarker and Classifier Selection in Diverse Genetic Datasets J AMES L INDSAY 1 E D H EMPHILL 2 C HIH L EE 1 I ON M ANDOIU 1 C RAIG N ELSON 2 U NIVERSITY."— Presentation transcript:

Similar presentations

About project

Feedback