David Amar, Tom Hait, and Ron Shamir

Slides:



Advertisements
Similar presentations
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

Publications Reviewed Searched Medline Hand screening of abstracts & papers Original study on human cancer patients Published in English before December.
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
SPARCLE = SPArse ReCovery of Linear combinations of Expression Presented by: Daniel Labenski Seminar in Algorithmic Challenges in Analyzing Big Data in.
Ron Shamir. Education BS – Mathematics Hebrew University PhD – Operations Research Berkley.
Cluster analysis of networks generated through homology: automatic identification of important protein communities involved in cancer metastasis Jonsson.
Gene Co-expression Network Analysis BMI 730 Kun Huang Department of Biomedical Informatics Ohio State University.
ONCOMINE: A Bioinformatics Infrastructure for Cancer Genomics
Introduction to molecular networks Sushmita Roy BMI/CS 576 Nov 6 th, 2014.
 2 Outline  Review of major computational approaches to facilitate biological interpretation of  high-throughput microarray  and RNA-Seq experiments.
Gene Set Enrichment Analysis Petri Törönen petri(DOT)toronen(AT)helsinki.fi.
Introduction The goal of translational bioinformatics is to enable the transformation of increasingly voluminous genomic and biological data into diagnostics.
A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.
EnrichNet: network-based gene set enrichment analysis Presenter: Lu Liu.
Radiogenomics in glioblastoma multiforme
Jesse Gillis 1 and Paul Pavlidis 2 1. Department of Psychiatry and Centre for High-Throughput Biology University of British Columbia, Vancouver, BC Canada.
GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.
The Broad Institute of MIT and Harvard Classification / Prediction.
Construction of cancer pathways for personalized medicine | Presented By Date Construction of cancer pathways for personalized medicine Predictive, Preventive.
Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
Nuria Lopez-Bigas Methods and tools in functional genomics (microarrays) BCO17.
Gene set analyses of genomic datasets Andreas Schlicker Jelle ten Hoeve Lodewyk Wessels.
Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Case Study: Characterizing Diseased States from Expression/Regulation Data Tuck et al., BMC Bioinformatics, 2006.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Multiplatform Analysis of 12 Cancer Types Reveals Molecular Classification within and across Tissues of Origin Hoadley, KA et al. Cell 158(4):
Eigengenes as biological signatures Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University 5.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
1 Joint Analysis of Multiple Cancer Types for Revealing Disease- Specific Genomic Events David Amar and Ron Shamir School of Computer Science Tel Aviv.
Eigengenes as biological signatures Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University 3.
Topologically inferring risk-active pathways toward precise cancer classification by directed random walk Topologically inferring risk-active pathways.
A graph-based integration of multiple layers of cancer genomics data (Progress Report) Do Kyoon Kim 1.
Classification with Gene Expression Data
1. SELECTION OF THE KEY GENE SET 2. BIOLOGICAL NETWORK SELECTION
Predicting Recurrence in Clear Cell Renal Cell Carcinoma
Hyunghoon Cho, Bonnie Berger, Jian Peng  Cell Systems 
An Artificial Intelligence Approach to Precision Oncology
Classifiers!!! BCH339N Systems Biology / Bioinformatics – Spring 2016
Introduction to translational and clinical bioinformatics Connecting complex molecular information to clinically relevant decisions using molecular.
Global Transcriptional Dysregulation in Breast Cancer
Two études on modularity
Molecular Classification of Cancer
Impact of Formal Methods in Biology and Medicine Final Review
Impact of Formal Methods in Biology and Medicine
Claudio Lottaz and Rainer Spang
Impact of Formal Methods in Biology and Medicine
Low Dimensionality in Gene Expression Data Enables the Accurate Extraction of Transcriptional Programs from Shallow Sequencing  Graham Heimberg, Rajat.
Gene Expression Omnibus (GEO)
Boosting For Tumor Classification With Gene Expression Data
Gene expression analysis
Pathway Informatics December 5, 2018 Ansuman Chattopadhyay, PhD
Schedule for the Afternoon
Transcriptional Landscape of Cardiomyocyte Maturation
Volume 11, Issue 6, Pages (May 2015)
Recurrence-Associated Long Non-coding RNA Signature for Determining the Risk of Recurrence in Patients with Colon Cancer  Meng Zhou, Long Hu, Zicheng.
Volume 4, Issue 3, Pages (August 2013)
Supple algorithms and data integration for understanding diseases
Evaluation of re-annotation for non-melanogaster Drosophila species.
Integrating human omics data to prioritize candidate genes
Single Sample Expression-Anchored Mechanisms Predict Survival in Head and Neck Cancer Yang et al Presented by Yves A. Lussier MD PhD The University.
János Fekete, Balázs Győrffy
Altered Caspase-8 Expression
Deep Learning in Bioinformatics
Didi Amar and Tom Hait Group meeting October 2013
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
Hyunghoon Cho, Bonnie Berger, Jian Peng  Cell Systems 
Claudio Lottaz and Rainer Spang
Presentation transcript:

Pathways as robust biomarkers for cancer classification: the power of big expression data David Amar, Tom Hait, and Ron Shamir Blavatnik School of Computer Science Tel Aviv University

Motivation and introduction

Comparative genomics Standard expression experiments: cases vs. controls -> differential genes -> interpretation Problems Small number of samples Non-specific signal Interpretation of a gene set/ gene ranking Goal: find specific changes for a tested disease E.g., an up-regulated pathway Crucial for clinical studies

Previous integrative classification studies Huang et al. 2010 PNAS (9,160 samples); Schmid et al. PNAS 2012 (3,030); Lee et al. Bioinformatics 2013 (~14,000) Multilabel classification Global expression patterns Only 1-3 platforms Many datasets were removed from GEO No “healthy” class (Huang);No diseases (Lee) Pathprint (Altschuler et al. 2013) Use pathways Tissue classification (as in Lee et al.)

Integrating pathways and molecular profiles Enrichment tests Improves interpretability GSEA\GSA Ranked based Higher statistical power Classification Extract pathway features Example: given a pathway remove non-differential genes Not clear if prediction performance improves compared to using genes (Staiger et al. 2013)

Pathway-based gene expression database

Y XP Expression profiles Single sample analysis Sample labels Samples Pathways KEGG Reactome Biocarta NCI Expression profiles GSE GDS TCGA Platform data Single sample analysis g1, g2 ,g3, … , gk Ranked genes\ transcripts Sample j Weighted ranks w1, w2 ,w3, … , wk Standardized profile low expression high Sample labels Disease Dataset\sample description Single sample - single pathway analysis For each pathway Mean SD Y Samples XP Pathway features

Single sample analysis Input: an expression profile of a sample A vector of real values for each patient Step 1: rank the genes Step 2: calculate a score for each gene Rank of gene g in sample s Total number of ranked genes (Yang et al. 2012,2013)

Pathway features 1723 pathways in total Pathway DBs KEGG Reactome Biocarta NCI 1723 pathways in total Covering 7842 genes Mean size: 36.35 (median 15) Score all genes that are in the pathway databases Pathway statistics: Mean score Standard deviation Skewness KS test

Patient labels Unite ~180 datasets, >14,000 samples Public databases contain ‘free text’ Problem: automatic mapping fails, example: GDS4358:” lymph-node biopsies from classic Hodgkins lymphoma HIV- patients before ABVD chemotherapy” MetaMap top score: “HIV infections” Solution: manual analysis Read descriptions and papers

Current microarray data Pathway features Data from GEO 13,314 samples 17 platforms Sample annotation Ignore terms with less than 100 samples 5 datasets 48 disease terms XP Samples Disease terms {0,1} Disease terms Y Samples

Analysis and results

Multi-label classification algorithms Learn a single classifier for each disease Ignore class dependencies Adaptation: Bayesian Correction Learn single classifiers Correct errors using the DO DAG Transformation: use the label power sets and learn a multiclass model Using RF: multi-label trees Was better than most approaches in an experimental study (Madjarov et al. 2012)

How to validate an classifier? Use leave-dataset out cross-validation Global AUC scores: each prediction Pij vs the correct label Yij Disease based AUC scores: consider each column separately The output of a multi-label learner Probabilities [0,1] Disease terms {0,1} P Y Samples Samples Test set

A problem (!) P Y What is in the background? For a disease D define: Positives: disease samples Negatives: direct controls Background controls Y Example: 500 positives 500 negatives 10000 BGCs

Multistep validation It is recommended to use several scores (Lee et al. 2013) Measure global AUPR For each disease we calculate three scores Measure Used (additional) information AUPR: check separation between positives and all others Sick vs. not sick ROC: test for separation between positives and negatives Direct use of negatives Meta analysis p-value: calculate the overall separation significance within the original datasets (a p-value) Mapping of samples to datasets

Meta analysis q-value < 0.001 (filled boxes) Performance results AUPR Positives vs. negatives ROC Meta analysis q-value < 0.001 (filled boxes)

Performance results 8.5% improvement in recall, 12% in precision, compared to Huang et al.

Validation on RNA-Seq Data from TCGA: 1,699 samples

Pathway-Disease network Steps (for each of the selected diseases): Disease-pathway edges RF importance: Select the top features Test for disease relevance Add edges between diseases Use the DO structure Add edges between pathways Based on significant overlap in genes

Network overview Down Up

Cancer network Down Up

Cardiovascular disease Down Up

Gastric cancers

Summary Large scale integration Multi-label learning Careful validation Pathway based features as biomarkers Summary of the results in a network Currently Add genes: overcome missing values Shows improvement in validation

Acknowledgements Ron Shamir Tom Hait