David Amar, Tom Hait, and Ron Shamir

Pathways as robust biomarkers for cancer classification: the power of big expression data
David Amar, Tom Hait, and Ron Shamir Blavatnik School of Computer Science Tel Aviv University

Motivation and introduction

Comparative genomics Standard expression experiments: cases vs. controls -> differential genes -> interpretation Problems Small number of samples Non-specific signal Interpretation of a gene set/ gene ranking Goal: find specific changes for a tested disease E.g., an up-regulated pathway Crucial for clinical studies

Previous integrative classification studies
Huang et al PNAS (9,160 samples); Schmid et al. PNAS 2012 (3,030); Lee et al. Bioinformatics 2013 (~14,000) Multilabel classification Global expression patterns Only 1-3 platforms Many datasets were removed from GEO No “healthy” class (Huang);No diseases (Lee) Pathprint (Altschuler et al. 2013) Use pathways Tissue classification (as in Lee et al.)

Integrating pathways and molecular profiles
Enrichment tests Improves interpretability GSEA\GSA Ranked based Higher statistical power Classification Extract pathway features Example: given a pathway remove non-differential genes Not clear if prediction performance improves compared to using genes (Staiger et al. 2013)

Pathway-based gene expression database

Y XP Expression profiles Single sample analysis Sample labels Samples
Pathways KEGG Reactome Biocarta NCI Expression profiles GSE GDS TCGA Platform data Single sample analysis g1, g2 ,g3, … , gk Ranked genes\ transcripts Sample j Weighted ranks w1, w2 ,w3, … , wk Standardized profile low expression high Sample labels Disease Dataset\sample description Single sample - single pathway analysis For each pathway Mean SD Y Samples XP Pathway features

Single sample analysis
Input: an expression profile of a sample A vector of real values for each patient Step 1: rank the genes Step 2: calculate a score for each gene Rank of gene g in sample s Total number of ranked genes (Yang et al. 2012,2013)

Pathway features 1723 pathways in total
Pathway DBs KEGG Reactome Biocarta NCI 1723 pathways in total Covering 7842 genes Mean size: (median 15) Score all genes that are in the pathway databases Pathway statistics: Mean score Standard deviation Skewness KS test

Patient labels Unite ~180 datasets, >14,000 samples
Public databases contain ‘free text’ Problem: automatic mapping fails, example: GDS4358:” lymph-node biopsies from classic Hodgkins lymphoma HIV- patients before ABVD chemotherapy” MetaMap top score: “HIV infections” Solution: manual analysis Read descriptions and papers

Current microarray data
Pathway features Data from GEO 13,314 samples 17 platforms Sample annotation Ignore terms with less than 100 samples 5 datasets 48 disease terms XP Samples Disease terms {0,1} Disease terms Y Samples

Analysis and results

Multi-label classification algorithms
Learn a single classifier for each disease Ignore class dependencies Adaptation: Bayesian Correction Learn single classifiers Correct errors using the DO DAG Transformation: use the label power sets and learn a multiclass model Using RF: multi-label trees Was better than most approaches in an experimental study (Madjarov et al. 2012)

How to validate an classifier?
Use leave-dataset out cross-validation Global AUC scores: each prediction Pij vs the correct label Yij Disease based AUC scores: consider each column separately The output of a multi-label learner Probabilities [0,1] Disease terms {0,1} P Y Samples Samples Test set

A problem (!) P Y What is in the background? For a disease D define:
Positives: disease samples Negatives: direct controls Background controls Y Example: 500 positives 500 negatives 10000 BGCs

Multistep validation It is recommended to use several scores (Lee et al. 2013) Measure global AUPR For each disease we calculate three scores Measure Used (additional) information AUPR: check separation between positives and all others Sick vs. not sick ROC: test for separation between positives and negatives Direct use of negatives Meta analysis p-value: calculate the overall separation significance within the original datasets (a p-value) Mapping of samples to datasets

Meta analysis q-value < 0.001 (filled boxes)
Performance results AUPR Positives vs. negatives ROC Meta analysis q-value < (filled boxes)

Performance results 8.5% improvement in recall, 12% in precision, compared to Huang et al.

Validation on RNA-Seq Data from TCGA: 1,699 samples

Pathway-Disease network
Steps (for each of the selected diseases): Disease-pathway edges RF importance: Select the top features Test for disease relevance Add edges between diseases Use the DO structure Add edges between pathways Based on significant overlap in genes

Network overview Down Up

Cancer network Down Up

Cardiovascular disease
Down Up

Gastric cancers

Summary Large scale integration Multi-label learning
Careful validation Pathway based features as biomarkers Summary of the results in a network Currently Add genes: overcome missing values Shows improvement in validation

Acknowledgements Ron Shamir Tom Hait

David Amar, Tom Hait, and Ron Shamir

Similar presentations

Presentation on theme: "David Amar, Tom Hait, and Ron Shamir"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

David Amar, Tom Hait, and Ron Shamir

Similar presentations

Presentation on theme: "David Amar, Tom Hait, and Ron Shamir"— Presentation transcript:

Similar presentations

About project

Feedback