Download presentation
Presentation is loading. Please wait.
1
Pathways as robust biomarkers for cancer classification: the power of big expression data
David Amar, Tom Hait, and Ron Shamir Blavatnik School of Computer Science Tel Aviv University
2
Motivation and introduction
3
Comparative genomics Standard expression experiments: cases vs. controls -> differential genes -> interpretation Problems Small number of samples Non-specific signal Interpretation of a gene set/ gene ranking Goal: find specific changes for a tested disease E.g., an up-regulated pathway Crucial for clinical studies
4
Previous integrative classification studies
Huang et al PNAS (9,160 samples); Schmid et al. PNAS 2012 (3,030); Lee et al. Bioinformatics 2013 (~14,000) Multilabel classification Global expression patterns Only 1-3 platforms Many datasets were removed from GEO No “healthy” class (Huang);No diseases (Lee) Pathprint (Altschuler et al. 2013) Use pathways Tissue classification (as in Lee et al.)
5
Integrating pathways and molecular profiles
Enrichment tests Improves interpretability GSEA\GSA Ranked based Higher statistical power Classification Extract pathway features Example: given a pathway remove non-differential genes Not clear if prediction performance improves compared to using genes (Staiger et al. 2013)
6
Pathway-based gene expression database
7
Y XP Expression profiles Single sample analysis Sample labels Samples
Pathways KEGG Reactome Biocarta NCI Expression profiles GSE GDS TCGA Platform data Single sample analysis g1, g2 ,g3, … , gk Ranked genes\ transcripts Sample j Weighted ranks w1, w2 ,w3, … , wk Standardized profile low expression high Sample labels Disease Dataset\sample description Single sample - single pathway analysis For each pathway Mean SD Y Samples XP Pathway features
8
Single sample analysis
Input: an expression profile of a sample A vector of real values for each patient Step 1: rank the genes Step 2: calculate a score for each gene Rank of gene g in sample s Total number of ranked genes (Yang et al. 2012,2013)
9
Pathway features 1723 pathways in total
Pathway DBs KEGG Reactome Biocarta NCI 1723 pathways in total Covering 7842 genes Mean size: (median 15) Score all genes that are in the pathway databases Pathway statistics: Mean score Standard deviation Skewness KS test
10
Patient labels Unite ~180 datasets, >14,000 samples
Public databases contain ‘free text’ Problem: automatic mapping fails, example: GDS4358:” lymph-node biopsies from classic Hodgkins lymphoma HIV- patients before ABVD chemotherapy” MetaMap top score: “HIV infections” Solution: manual analysis Read descriptions and papers
11
Current microarray data
Pathway features Data from GEO 13,314 samples 17 platforms Sample annotation Ignore terms with less than 100 samples 5 datasets 48 disease terms XP Samples Disease terms {0,1} Disease terms Y Samples
12
Analysis and results
13
Multi-label classification algorithms
Learn a single classifier for each disease Ignore class dependencies Adaptation: Bayesian Correction Learn single classifiers Correct errors using the DO DAG Transformation: use the label power sets and learn a multiclass model Using RF: multi-label trees Was better than most approaches in an experimental study (Madjarov et al. 2012)
14
How to validate an classifier?
Use leave-dataset out cross-validation Global AUC scores: each prediction Pij vs the correct label Yij Disease based AUC scores: consider each column separately The output of a multi-label learner Probabilities [0,1] Disease terms {0,1} P Y Samples Samples Test set
15
A problem (!) P Y What is in the background? For a disease D define:
Positives: disease samples Negatives: direct controls Background controls Y Example: 500 positives 500 negatives 10000 BGCs
16
Multistep validation It is recommended to use several scores (Lee et al. 2013) Measure global AUPR For each disease we calculate three scores Measure Used (additional) information AUPR: check separation between positives and all others Sick vs. not sick ROC: test for separation between positives and negatives Direct use of negatives Meta analysis p-value: calculate the overall separation significance within the original datasets (a p-value) Mapping of samples to datasets
17
Meta analysis q-value < 0.001 (filled boxes)
Performance results AUPR Positives vs. negatives ROC Meta analysis q-value < (filled boxes)
18
Performance results 8.5% improvement in recall, 12% in precision, compared to Huang et al.
19
Validation on RNA-Seq Data from TCGA: 1,699 samples
20
Pathway-Disease network
Steps (for each of the selected diseases): Disease-pathway edges RF importance: Select the top features Test for disease relevance Add edges between diseases Use the DO structure Add edges between pathways Based on significant overlap in genes
21
Network overview Down Up
22
Cancer network Down Up
23
Cardiovascular disease
Down Up
24
Gastric cancers
25
Summary Large scale integration Multi-label learning
Careful validation Pathway based features as biomarkers Summary of the results in a network Currently Add genes: overcome missing values Shows improvement in validation
26
Acknowledgements Ron Shamir Tom Hait
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.