Didi Amar and Tom Hait Group meeting October 2013

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for 1 Lecture Notes for E Alpaydın 2010.

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.

Microarray GEO – Microarray sets database

Gene Co-expression Network Analysis BMI 730 Kun Huang Department of Biomedical Informatics Ohio State University.

ONCOMINE: A Bioinformatics Infrastructure for Cancer Genomics

MACHINE LEARNING 6. Multivariate Methods 1. Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Motivating Example  Loan.

Tutorial 8 Clustering 1. General Methods –Unsupervised Clustering Hierarchical clustering K-means clustering Expression data –GEO –UCSC –ArrayExpress.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Introduction The goal of translational bioinformatics is to enable the transformation of increasingly voluminous genomic and biological data into diagnostics.

A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.

A Bioinformatics Meta-analysis of Differentially Expressed Genes in Colorectal Cancer Simon Chan, Thursday Trainee Seminar – October 11.

EnrichNet: network-based gene set enrichment analysis Presenter: Lu Liu.

Jesse Gillis 1 and Paul Pavlidis 2 1. Department of Psychiatry and Centre for High-Throughput Biology University of British Columbia, Vancouver, BC Canada.

The Broad Institute of MIT and Harvard Classification / Prediction.

Gene expression analysis

A A R H U S U N I V E R S I T E T Faculty of Agricultural Sciences Introduction to analysis of microarray data David Edwards.

Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.

Today Ensemble Methods. Recap of the course. Classifier Fusion

Tutorial 7 Gene expression analysis 1. Expression data –GEO –UCSC –ArrayExpress General clustering methods –Unsupervised Clustering Hierarchical clustering.

Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.

Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.

Gene Expression Omnibus (GEO)

MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.

Gene set analyses of genomic datasets Andreas Schlicker Jelle ten Hoeve Lodewyk Wessels.

Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

The Broad Institute of MIT and Harvard Differential Analysis.

Machine Learning 5. Parametric Methods.

Learning disjunctions in Geronimo’s regression trees Felix Sanchez Garcia supervised by Prof. Dana Pe’er.

Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.

… Algo 1 Algo 2 Algo 3 Algo N Meta-Learning Algo.

Multiplatform Analysis of 12 Cancer Types Reveals Molecular Classification within and across Tissues of Origin Hoadley, KA et al. Cell 158(4):

Eigengenes as biological signatures Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University 5.

Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.

CHAPTER 3: BAYESIAN DECISION THEORY. Making Decision Under Uncertainty Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Gene Set Analysis using R and Bioconductor Daniel Gusenleitner

1 Joint Analysis of Multiple Cancer Types for Revealing Disease- Specific Genomic Events David Amar and Ron Shamir School of Computer Science Tel Aviv.

Eigengenes as biological signatures Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University 3.

Canadian Bioinformatics Workshops

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

David Amar, Tom Hait, and Ron Shamir

Chapter 12 – Discriminant Analysis

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Classification with Gene Expression Data

Hyunghoon Cho, Bonnie Berger, Jian Peng Cell Systems

An Artificial Intelligence Approach to Precision Oncology

Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani

COMP61011 : Machine Learning Ensemble Models

Prediction of Optimal Cancer Drug therapies via SVM

Claudio Lottaz and Rainer Spang

CHAPTER 3: Bayesian Decision Theory

Combining Base Learners

Gene Expression Omnibus (GEO)

Boosting For Tumor Classification With Gene Expression Data

Gene expression analysis

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Integrating human omics data to prioritize candidate genes

Single Sample Expression-Anchored Mechanisms Predict Survival in Head and Neck Cancer Yang et al Presented by Yves A. Lussier MD PhD The University.

Differential gene expression in whole blood from SJIA patients and healthy controls. A. Data were normalized in Beadstudio using the "average" method and.

Jia-Bin Huang Virginia Tech

Hyunghoon Cho, Bonnie Berger, Jian Peng Cell Systems

Claudio Lottaz and Rainer Spang

Presentation transcript:

Didi Amar and Tom Hait Group meeting October 2013 Integrating expression and pathway databases for revealing disease biomarkers Didi Amar and Tom Hait Group meeting October 2013

Motivation Study expression profiles of many diseases and healthy controls Unlike standard case-control studies we have many negative controls Use thousands of expression profiles Requires integration of many technologies Utilize pathway information Better interpretability Better classification (?)

Background – classification Example: Credit scoring Differentiating between low-risk and high-risk customers from their income and savings Output: class assignment or probabilities Discriminant: IF income > θ1 AND savings > θ2 THEN low-risk ELSE high-risk Based on Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Performance measures Use cross-validation Measure performance: Accuracy ROC score Precision Recall F-measure AUPR …

Weighted and non-symmetric ROC score idea: Given two classes A and B Rank all samples according to predicted probability for assignment to class A Ask if the samples of A tend to get high ranks P(A) Multiclass: calculate the ROC of each class and report the weighted average

Huang et al. PNAS 2010 Analyzed 110 datasets Each dataset was automatically assigned to a disease term For each dataset a profile was calculated Used two affymetrix platforms:133A, 133A Plus 2 ~8500 shared genes Each sample was analyzed vs. the controls of the dataset Log of rank ratio of each gene Nice classification performance: 82% precision at 0.2 recall

Huang et al. PNAS 2010 Pitfalls: Only two platforms No single sample analysis Several datasets were removed from GEO Probably due to statistical problems in the data No “healthy” class Interpretability

Single sample pathway analysis (Yang et al. 2012,2013) Limited integration of datasets Used gene expression data for patient classification and recurrence survival Improve interpretability of the classifiers by integrating pathway information Keep classification performance high Standardize across multi datasets

Sample analysis Input: an expression profile of a sample A vector of real values for each patient Step 1: rank the genes Step 2: calculate a score for each gene Rank of gene g in sample s Total number of ranked genes

Gene set features Gene sets: GO terms or KEGG pathways Define the normalized centroid (NC) of a gene set to be the average weights of the genes Calculate the NC score for each gene set and its complement

Gene set features Calculate the NC score for each gene set and its complement

Gene set features The score of a gene set is the difference between the two Which results in the FAIME profile for each sample

Analysis design (goal) Expression profiles GSE GDS TCGA Pathways KEGG Reactome Biocarta NCI Platform data Single sample analysis g1, g2 ,g3, … , gk Ranked genes\ transcipts Sample i Weighted ranks w1, w2 ,w3, … , wk Standardized profile low expression high Sample labels Disease Tissue Dataset\sample description Single sample - single pathway analysis For each pathway: Mean SD KS test Skewness Y Samples XT XP Transcript features Pathway features

Analysis design (goal) Expression profiles GSE GDS TCGA Pathways KEGG Reactome Biocarta NCI Platform data Single sample analysis Sample i Ranked genes\ transcipts g1, g2 ,g3, … , gk Dataset\sample description g1, g2 ,g3, … , gk Weighted ranks w1, w2 ,w3, … , wk Sample labels Single sample - single pathway analysis low expression Disease Tissue Standardized profile high expression For each pathway: Mean SD KS test Skewness Transcript features Pathway features Y XT XP Samples

Analysis design (current) Expression profiles GSE GDS Pathways KEGG Reactome Biocarta NCI Platform data Single sample analysis Sample i Ranked genes\ transcipts g1, g2 ,g3, … , gk Dataset\sample description g1, g2 ,g3, … , gk Weighted ranks w1, w2 ,w3, … , wk Sample labels Cancer Healthy Other disease Single sample - single pathway analysis low expression Standardized profile high expression For each pathway: Mean SD KS test Skewness Transcript features Pathway features Y XT XP Samples

Expression data Use the manually curated GEO datasets Expression profiles GSE GDS TCGA Expression data Use the manually curated GEO datasets “GDS” objects 250 human microarray datasets with “disease state” attribute 13114 samples; 59 platforms Manually go over large GEO datasets that have no GDS 40 datasets 13118 samples; 14 platforms

Platform analysis Use platforms with > 20000 probes Expression profiles GSE GDS TCGA Platform analysis Use platforms with > 20000 probes Consider only platforms of Affymetrix, Illumina or Agilent Ignore GPL97 (U133B) – low gene intersection with other platforms 20 platforms GDS: 178 datasets; 9684 samples GSE: 34 datasets; 11900 samples

Removed platform-comment GPL97 – Affy U133B In many cases the same patients also have GPL96 arrays Example: GDS1975 and GDS1976 contain 86 samples each, both are from the same patients Excluding this platform reduces the number of shared refseqs to 2194 Excluding this platform reduces the number of shared entrez genes to 3407

Probes to entrez, refseq Expression profiles GSE GDS TCGA Probes to entrez, refseq For each platform get the probe to refseq and probe to entrez mapping Analyze the data of the platform GEO object Sometimes external mapping is needed Missing refseq column: usually, can be found in the genbank column Map refseq to entrez when only refseq information appears No cases of Entrez+ and Refseq-

Platform gene intersection Refseq: 6686 Entrez: 6067 How many probes are covered by the intersection?

Enrichment analysis of the intersection (entrez) More than 300 GO terms at 0.05 FDR (!) Top enrichments: GO term # genes p-value small molecule metabolic process - GO:0044281 1034 2.00E-58 response to stress - GO:0006950 1149 6.42E-58 response to organic substance - GO:0010033 819 1.10E-57 negative regulation of biological process - GO:0048519 1195 2.57E-46 regulation of biological quality - GO:0065008 964 2.37E-42

Pathways Use the R graphite package Available pathway databases: KEGG Reactome Biocarta NCI 1723 pathways in total Covering 7842 genes Pathways KEGG Reactome Biocarta NCI

All probes vs. refseq gene set vs. Pathway entrez genes GDS4513

Transform data to weighted ranks

Pathway analysis The background – all pathway genes The FAIME statistic is similar to using the mean ranks

Pathway analysis Additional statistics: The standard deviation of the pathway scores The skewness of the pathway score

Pathway analysis Additional statistics: Test if the distribution of pathway gene scores is different than the non-pathway scores We use a non parametric test Kolmogorov-Smirnov Test both up- and down- representation

Pathway features

Mapping samples to labels Huang et al. used the MetaMap tool for (automatically) mapping native language descriptions to MeSH terms MetaMap is not good enough, example: GDS4358 “Analysis of diagnostic lymph-node biopsies from classic Hodgkins lymphoma HIV- patients before ABVD chemotherapy” MetaMap top score: “HIV infections” In most cases, no tissue was found

Manual analysis For each dataset analyze the description of: The GDS object The GSE object Classes names Tissues Missing data? Go to the paper!

Manual analysis For each dataset analyze the description of: The GDS object The GSE object Classes names Tissues Missing data? Go to the paper!

GDS data example

GSE data example – series matrix

So, where are we? Finished going over the GDS objects ~600 records 8168 samples >180 disease names ~110 tissues

So, where are we? In progress: Analyze GSE objects Found an error in GEO GDS4471 data does not fit the platform: use the GSE data Get cancer profiles from TCGA

Cancer vs. healthy vs. others Define simple classes: cancer (MeSH id starts with C04), control, other disease Round1: 105 GDS datasets 81 datasets had at least two classes 5661 samples Classifier: SVM Feature selection: t-test, top 500

Test 1: LOOCV within datasets For each datasets that has at least two classes LOOCV using the raw data matrix LOOCV using the refseq genes matrix Goal: make sure that the refseq matrix produces similar results

Results

Test 2: Leave dataset out Use all datasets Compare the refseq matrix and the pathway matrix

Round2: 145 datasets Leave 10 datasets out 7883 samples Cancer: 3285, Control: 2098, Other: 2500

Random forest vs. SVM

Future work Add GSE and TCGA samples Test other labels: RNAseq analysis Test other labels: Disease: use the UMLS or disease ontology structures Tissues\Anatomy: use MeSH trees

Multitask vs. multilclass learning Use DO structure 1: assign datasets to ancestors Multiple datasets per class Multiclass: each sample is mapped to a single class, number of classes > 2 Multitask: samples can be mapped to more than one class

Multitask computational approaches Learn all classification models using a global loss function Generalized SVMs (iLSVM, Zhu et al. 2012) Classification chains (Groves et al. 2012) Learn a classifier for each task Create a new matrix M: sample VS. class probabilities Learn a new model for each class: Data: M and the original features

Bayesian correction for hierarchies (DeCoro et al. 2012) Learn a classifier for each task For each sample and a task we have the correct classification y and the prediction y’ According to Bayes theorem we want to maximize:

Bayesian correction for hierarchies (DeCoro et al. 2012) Enforcing the Bayesian network: Using the original base classifications Technicalities: use internal CV, learn the global distribution

Bayesian correction for hierarchies (DeCoro et al. 2012)