Download presentation
Presentation is loading. Please wait.
Published bySimona Moravcová Modified over 6 years ago
1
Introduction to translational and clinical bioinformatics Connecting complex molecular information to clinically relevant decisions using molecular profiles Constantin F. Aliferis M.D., Ph.D., FACMI Director, NYU Center for Health Informatics and Bioinformatics Informatics Director, NYU Clinical and Translational Science Institute Director, Molecular Signatures Laboratory, Associate Professor, Department of Pathology, Adjunct Associate Professor in Biostatistics and Biomedical Informatics, Vanderbilt University Alexander Statnikov Ph.D. Director, Computational Causal Discovery laboratory Assistant Professor, NYU Center for Health Informatics and Bioinformatics, General Internal Medicine
2
Overview Session #1: Basic Concepts
Session #2: High-throughput assay technologies Session #3: Computational data analytics Session #4: Case study / practical applications Session #5: Hands-on computer lab exercise
3
Building Cancer Diagnostic Models from Microarray Gene Expression Data
samples Clinical Researcher Microarray Lab microarray data report, dx model microarray data, hypothesis Execution of experiments Biostatistician Bioinformatician / Computer Scientist Write-up of the report Analysis of results Results, dx model Programmer Experimental design microarray data; plan of experiments Weeks or Months
4
Automated Diagnostic System GEMS: Gene Expression Model Selector
samples Clinical Researcher Microarray Lab microarray data report, dx model Biostatistician Bioinformatician / Computer Scientist microarray data, hypothesis report, dx model microarray data, hypothesis Automated System Outputs high quality models for cancer diagnosis from gene expression data; Produces reliable performance estimates; Allows to apply models to unseen patients; Discovers target biomarker candidates. Execution of experiments Write-up of the report Analysis of results Results, dx model Programmer Experimental design microarray data; plan of experiments Implement best known diagnostic methodologies Use sound techniques for model selection & performance estimation Weeks or Months Minutes or hours
5
Other Systems for Supervised Analysis of Microarray Data
There exist many good software packages for supervised analysis of microarray data, but… Neither system provides a protocol for data analysis that precludes overfitting. A typical software either offers an overabundance of algorithms or algorithms with unknown performance. Thus is it not clear to the user how to choose an optimal algorithm for a given data analysis task. The software packages address needs of experienced analysts. However, there is a need to use this software (and still achieve good results) by users who know little about data analysis (e.g., biologists and clinicians). We built a system that addresses all above problems.
6
What Does Prior Research Suggest About the Best Performing Methods?
Limited range of methods & datasets per study No description of parameter optimization of learners Different experimental designs are employed Overfitting [Ntzani et al., Lancet 2003]: 74% no validation 13% incomplete cross-validation 13% implemented cross-validation correctly The available meta-analyses are not aimed at identification of best performing methodologies 193 primary studies 2 meta-analyses Cannot specify a small set of best performing diagnostic algorithms; Have to to perform evaluation de novo
7
Algorithmic Evaluations to Inform Development of the System
8
1st Algorithmic Evaluation Study
Main Goal: Investigate which ones among the many powerful classifiers currently available for gene expression diagnosis perform the best across many datasets and cancer types. Accuracy, % 20 40 60 80 100 Multiple specialized classification methods (original primary studies) Multiclass SVMs (this study) Results: Multi-class SVMs are the best family among the tested algorithms outperforming KNN, NN, PNN, DT, and WV. Gene selection in some cases improves classification performance of all classifiers, especially of non-SVM algorithms; Ensemble classification does not improve performance; Obtained results favorably compare with literature. Statnikov A, Aliferis CF, Tsamardinos I, Hardin D, Levy S. A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics, 2005, 21:
9
2nd Algorithmic Evaluation Study
Main Goal: Determine feature selection algorithms (applicable to high-dimensional microarray gene expression or mass-spectrometry data) that significantly reduce the number of predictors, maintaining optimal classification performance. Result: Markov Blanket techniques (e.g., HITON) provide the smallest subsets of predictors that achieve optimal classification performance. Very preliminary report. Aliferis CF, Tsamardinos I, Statnikov A. HITON: A novel Markov Blanket algorithm for optimal variable selection. AMIA Symposium, 2003.
10
Algorithms Implemented in GEMS
One-Versus-Rest One-Versus-One DAGSVM Method by WW Classifiers Method by CS MC-SVM Cross-Validation Designs N-Fold Cross-Validation LOOCV Nested N-Fold Cross-Validation Nested LOOCV [a, b] (x – MEAN(x)) / STD(x) x / STD(x) x / MEAN(x) Normalization Techniques x / MEDIAN(x) x / NORM(x) x – MEAN(x) x – MEDIAN(x) ABS(x) x + ABS(x) S2N One-Versus-Rest S2N One-Versus-One Non-param. ANOVA BW ratio Gene Selection Methods HITON_MB HITON_PC Performance Metrics Accuracy RCI AUC ROC
11
System Description
12
GEMS Inputs & Outputs 1. Dataset & outcome or diagnostic labels
2. Optional: gene names and/or accession numbers 3. Various choices of parameters for the analysis (defaults) 4. In application mode: previously saved model GEMS 1. Classification model 2. Performance estimate 3. In application mode: the model’s diagnoses/predictions & overall performance 4. A reduced set of genes 5. Links from the genes to literature and other resources.
13
Cross-Validation Design
Performance estimation by Cross-Validation What if classifier is parametric? train valid Model selection / parameter optimization by (nested) cross-validation data train test train test train test
14
Software Architecture
Client (Wizard-Like User Interface) Generate a classification model Estimate classification performance Apply existing model to a new set of patients Generate a classification model and estimate its performance Normalization S2N One-Versus-Rest S2N One-Versus-One Non-param. ANOVA BW ratio Gene Selection One-Versus-Rest One-Versus-One DAGSVM Method by WW Classification by MC-SVM Method by CS Cross-Validation Loop for Performance Est. N-Fold CV LOOCV Report Generator X for Model Selection I II Accuracy RCI AUC ROC Performance Computation HITON_PC HITON_MB Computational Engine
15
User Interface
16
Steps in User Interface
Task selection Dataset specification Cross-validation design Normalization Logging Performance metric Gene selection Classification Report generation Analysis execution
17
An Evaluation of the System:
Apply GEMS to datasets not involved in algorithmic evaluation and compare results with ones obtained by human analysts and published in the literature; Verify generalizability of models produced by GEMS in cross-dataset applications. Statnikov A, Tsamardinos I, Aliferis CF. GEMS: A system for decision support and discovery from array gene expression data. (to appear) International Journal of Medical Informatics, 2005.
18
Evaluation Using New Datasets
Comparison with literature Analyzes were completed within 10-30 minutes with GEMS.
19
Verify Generalizability of Models in Cross-Dataset Applications
20
Live Demonstration of GEMS
21
Scenario 1: Binary classification model development and evaluation using a lung cancer microarray gene expression dataset. Go from simple to more complex scenario. Excite the audience.
22
Live Demo of GEMS (Scenario 1) Binary classification model development and evaluation
User Various parameters Lung cancer dataset from Bhattacharjee, 2001: Diagnostic task: Lung cancer vs normal tissues Microarray platform: Affymetrix U95A Number of oligonucleotides: 12,600* Number of patients: 203 List of genes Cross-validation performance estimate (AUC) Classification model
23
Scenario 2: Multicategory classification model development and evaluation using a small round blood cell tumor microarray gene expression dataset.
24
Live Demo of GEMS (Scenario 2) Multicategory classification model development and evaluation
User Various parameters Lung cancer dataset from Khan, 2001: Diagnostic task: Ewing Sarcoma vs rhabdomyosarcoma vs Burkitt Lymphoma vs neuroblastoma Microarray platform: cDNA Number of probes: 2,308 Number of patients: 63 List of genes Cross-validation performance estimate (AUC) Classification model
25
Scenario 3: Validating the reproducibility of genes selected in Scenario 1 using another lung cancer microarray gene expression dataset.
26
(Bhattacharjee’s data)
Live Demo of GEMS (Scenario 3) Are selected genes reproducible in another dataset? List of genes From Scenario 1 (Bhattacharjee’s data) GEMS User Various parameters Lung cancer dataset from Beer, 2002: Diagnostic task: Lung cancer vs normal tissues Microarray platform: Affymetrix HuGeneFL Number of oligonucleotides: 7,129* Number of patients: 96 Cross-validation performance estimate (AUC) Classification model
27
Scenario 4: Verifying generalizability of the classification model produced in Scenario 1 using another lung cancer microarray gene expression dataset.
28
(Bhattacharjee’s data)
Live Demo of GEMS (Scenario 4) Is constructed classification model generalizable in another microarray dataset? Classification model From Scenario 1 (Bhattacharjee’s data) GEMS User Various parameters Lung cancer dataset from Beer, 2002: Diagnostic task: Lung cancer vs normal tissues Microarray platform: Affymetrix HuGeneFL Number of oligonucleotides: 7,129* Number of patients: 96 Performance estimate (AUC)
29
GEMS in a Nutshell The system is fully automated, yet provides many optional features for the seasoned analyst. The system is based on a nested cross-validation design that avoids overfitting. GEMS’s algorithms were chosen after the two extensive algorithmic evaluations. After the system was built, it was validated in cross-dataset applications and also using new datasets. GEMS has an intuitive wizard-like user interface which abstracts data analysis process. GEMS possesses a convenient client-server architecture. We understand that there are many good software systems for the analysis of microarray gene expression data. Furthermore, we are using some of other systems ourselves. However, GEMS has several distinctive features that make it unique and very useful for the community…
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.