Use of Microarray Data via Model-Based Classification in the Study and Prediction of Survival from Lung Cancer Liat Jones *, Angus Ng *, Chris Ambroise.

Slides:



Advertisements
Similar presentations
TOP2A IS AN INDEPENDENT PREDICTOR OF SURVIVAL IN UNSELECTED BREAST CANCER Amit Pancholi Molecular Profiling of Breast Cancer: Predictive Markers of Long.
Advertisements

Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
A gene expression analysis system for medical diagnosis D. Maroulis, D. Iakovidis, S. Karkanis, I. Flaounas D. Maroulis, D. Iakovidis, S. Karkanis, I.
Minimum Redundancy and Maximum Relevance Feature Selection
Face Recognition and Biometric Systems
Wenting Zhou, Weichen Wu, Nathan Palmer, Emily Mower, Noah Daniels, Lenore Cowen, Anselm Blumer Tufts University Microarray Data.
Expression profiles for prognosis and prediction Laura J. Van ‘t Veer The Netherlands Cancer Institute, Amsterdam.
Microarrays Dr Peter Smooker,
By Russell Armstrong Supervisor Mrs Wei Ji Diagnosis Analysis of Lung Cancer by Genome Expression Profiles.
Mutual Information Mathematical Biology Seminar
Model and Variable Selections for Personalized Medicine Lu Tian (Northwestern University) Hajime Uno (Kitasato University) Tianxi Cai, Els Goetghebeur,
Glioblastoma Multiforme (GBM) – Subtype Analysis Lance Parsons.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini
Classification: Support Vector Machine 10/10/07. What hyperplane (line) can separate the two classes of data?
‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.
Introduction to Hierarchical Clustering Analysis Pengyu Hong 09/16/2005.
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
1 Masterseminar „A statistical framework for the diagnostic of meningioma cancer“ Chair for Bioinformatics, Saarland University Andreas Keller Supervised.
Patrick Kemmeren Using EP:NG.
Guidelines on Statistical Analysis and Reporting of DNA Microarray Studies of Clinical Outcome Richard Simon, D.Sc. Chief, Biometric Research Branch National.
Survival-Time Classification of Breast Cancer Patients DIMACS Workshop on Data Mining and Scalable Algorithms August 22-24, Rutgers University Y.-J.
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
Survival-Time Classification of Breast Cancer Patients DIMACS Workshop on Data Mining and Scalable Algorithms August 22-24, Rutgers University Y.-J.
Gene expression profiling identifies molecular subtypes of gliomas
Estimating cancer survival and clinical outcome based on genetic tumor progression scores Jörg Rahnenführer 1,*, Niko Beerenwinkel 1,, Wolfgang A. Schulz.
CZ5225: Modeling and Simulation in Biology Lecture 6, Microarray Cancer Classification Prof. Chen Yu Zong Tel:
Whole Genome Expression Analysis
Classification (Supervised Clustering) Naomi Altman Nov '06.
Gene Expression Profiling Illustrated Using BRB-ArrayTools.
Analysis and Management of Microarray Data Dr G. P. S. Raghava.
From motif search to gene expression analysis
Analysis of Molecular and Clinical Data at PolyomX Adrian Driga 1, Kathryn Graham 1, 2, Sambasivarao Damaraju 1, 2, Jennifer Listgarten 3, Russ Greiner.
Clustering of DNA Microarray Data Michael Slifker CIS 526.
1.DATABASE construction  n=1,715  Median OS=40.0 months, age: 64+/-10 yrs  Histology (adeno/squamous/large): 50% / 45% / 5%  Stage 1/2/3/4: 63% / 27%
Bayesian Analysis and Applications of A Cure Rate Model.
Knowledge-Based Breast Cancer Prognosis Olvi Mangasarian UW Madison & UCSD La Jolla Edward Wild UW Madison Computation and Informatics in Biology and Medicine.
Selection of Patient Samples and Genes for Disease Prognosis Limsoon Wong Institute for Infocomm Research Joint work with Jinyan Li & Huiqing Liu.
INCREASED EXPRESSION OF PROTEIN KINASE CK2  SUBUNIT IN HUMAN GASTRIC CARCINOMA Kai-Yuan Lin 1 and Yih-Huei Uen 1,2,3 1 Department of Medical Research,
A Short Overview of Microarrays Tex Thompson Spring 2005.
Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks From Nature Medicine 7(6) 2001 By Javed.
Microarrays and Gene Expression Analysis. 2 Gene Expression Data Microarray experiments Applications Data analysis Gene Expression Databases.
+ Get Rich and Cure Cancer with Support Vector Machines (Your Summer Projects)
1 FINAL PROJECT- Key dates –last day to decided on a project * 11-10/1- Presenting a proposed project in small groups A very short presentation (Max.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
Clustering Features in High-Throughput Proteomic Data Richard Pelikan (or what’s left of him) BIOINF 2054 April
Gene Expression Analysis. 2 DNA Microarray First introduced in 1987 A microarray is a tool for analyzing gene expression in genomic scale. The microarray.
Support Vector Machine Data Mining Olvi L. Mangasarian with Glenn M. Fung, Jude W. Shavlik & Collaborators at ExonHit – Paris Data Mining Institute University.
Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
A comparative study of survival models for breast cancer prognostication based on microarray data: a single gene beat them all? B. Haibe-Kains, C. Desmedt,
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Gene expression. Gene Expression 2 protein RNA DNA.
Case Study: Characterizing Diseased States from Expression/Regulation Data Tuck et al., BMC Bioinformatics, 2006.
Geoff McLachlan Department of Mathematics & Institute of Molecular Bioscience University of Queensland The Classification.
CSE182 L14 Mass Spec Quantitation MS applications Microarray analysis.
REGRESSION MODEL FITTING & IDENTIFICATION OF PROGNOSTIC FACTORS BISMA FAROOQI.
Class Seven Turn In: Chapter 18: 32, 34, 36 Chapter 19: 26, 34, 44 Quiz 3 For Class Eight: Chapter 20: 18, 20, 24 Chapter 22: 34, 36 Read Chapters 23 &
IDENTIFYING CANCER SUBTYPES BASED ON SOMATIC MUTATION PROFILE BIOINFORMATICS SEMINAR 2016 SPRING YOUJIN SHIN.
BINARY LOGISTIC REGRESSION
Predicting Recurrence in Clear Cell Renal Cell Carcinoma
Gene expression.
Volume 5, Issue 6, Pages e3 (December 2017)
Survival analysis Diagnostic Histopathology
Molecular prognostication of liver cancer: End of the beginning
Figure 1. Identification of three tumour molecular subtypes in CIT and TCGA cohorts. We used CIT multi-omics data ( Figure 1. Identification of.
DO NOT POST #4054 Gene expression Difference (GED) Revealed Immune Function Gene UP- or Down-regulation as Tumor-associated Inflammatory Cell (TAIC) Infiltration.
Presentation transcript:

Use of Microarray Data via Model-Based Classification in the Study and Prediction of Survival from Lung Cancer Liat Jones *, Angus Ng *, Chris Ambroise **, Katrina Monico* and Geoff McLachlan * * Institute for Molecular Bioscience ** Laboratoire Heudiasyc University of Queensland

AIM : To link gene-expression data with survival from lung cancer A.CLUSTER ANALYSIS. We apply a model-based clustering approach to classify tumor tissues on the basis of microarray gene expression. The impact of this classification on cancer biology and clinical outcome is studied. B. SURVIVAL ANALYSIS. The association between the clusters so formed and patient survival (recurrence) times is examined. C. DISCRIMINANT ANALYSIS. We show that the prognosis clustering is a more powerful predictor of the outcome of disease than current systems based on histopathology criteria and extent of disease at presentation.

STANFORD and ONTARIO DATASETS : Both these datasets include tissues of various tumor types, as shown in Table 1. The major differences in samples are that the Stanford Dataset contains relatively more adenocarcinoma (AC) samples, and the Ontario Dataset contains only non-small cell carcinomas (NSCLC). In both studies cDNA microarrays were used to obtain gene expression profiles for the tissue (tumour) samples. STANFORD: 918 genes ONTARIO: 2880 genes

Initial Gene Selection We start with the reduced datasets; the Stanford subset of 918 genes (with most similar expression within tumor pairs, but which differed among the other tumor samples) and the Ontario subset of 2880 genes (which contained data points in at least 80 percent of the samples and the transcripts had at least two samples with an absolute value of two in log2 space). The Stanford Dataset had a total of 73 samples (including matched samples), and the Ontario Dataset had a total of 39 samples.

Table 1. Comparison of Tumor Types for Stanford and Ontario Datasets

MIXTURE OF g NORMAL COMPONENTS EUCLIDEAN DISTANCE where constant MAHALANOBIS DISTANCE where

SPHERICAL CLUSTERS k-means MIXTURE OF g NORMAL COMPONENTS k-means

With a mixture model-based approach to clustering, an observation is assigned outright to the ith cluster if its density in the ith component of the mixture distribution (weighted by the prior probability of that component) is greater than in the other (g-1) components.

Mixtures of Factor Analyzers A normal mixture model without restrictions on the component-covariance matrices may be viewed as too general for many situations in practice, in particular, with high dimensional data. One approach for reducing the number of parameters is to work in a lower dimensional space by adopting mixtures of factor analyzers.

B i is a p x q matrix and D i is a diagonal matrix.

Liat, can you please add some legends to the following heat map.

CLUSTERING OF ONTARIO TUMOURS Using EMMIX-GENE Steps used in the application of EMMIX-GENE: 1.Select the most relevant genes from this filtered set of 2,880 genes. The set of retained genes is thus reduced to Cluster these 766 genes into twenty groups. The majority of gene groups produced were reasonably cohesive and distinct. 3.Using these twenty group means, cluster the tissue samples into two groups using a mixture of normal components/factor analyzers.

Tissues are ordered as: Recurrence (1-24) and Censored (25-39) Heat Maps for the 20 Ontario Gene Groups Tissues Genes

Expression Profiles for Useful Metagenes (Ontario 39 Tissues) Log Expression Value Our Tissue Cluster 1 Our Tissue Cluster 2 Tissues Recurrence (1-24)Censored (25-39) Gene Group 1Gene Group 2 Gene Group 19Gene Group 20

Selection of Relevant Genes We retain only 1 of the 16 genes of interest mentioned in the Ontario study (ZNF 136). Why are the others rejected by us yet retained in the Ontario study?

PNUTL1 Cluster A (down Rec, up Censored) Censored (25-39)Recurrence (1-24) FUS Wee1 Expression Profiles of some Genes Identified in Ontario ATM Clusters B and C (up Rec, down Censored) Recurrence (1-24)Censored (25-39) HIF1A RABIF Log Expression Value Tissues

Only ZNF136 is retained by us and also identified in Ontario Recurrence (1-24)Censored (25-39) Tissues Log Expression Value It is found in our Group 19 (up-regulated in recurrence).

Tumours 1-24 belong to RECURRENCE group Tumours are censored CLUSTER ANALYSIS via EMMIX-GENE of 20 METAGENES yields TWO CLUSTERS: CLUSTER 1: 1-14, (recurrence) plus 25-29, 33, 36, 38 (censored) CLUSTER 2: 15 (recurrence) plus 30-32, 34, 35, 37, 39 (censored)

SURVIVAL ANALYSIS: LONG-TERM SURVIVAL (LTS) MODEL where T is time to recurrence and  1 = 1-  2 is the prior prob. of recurrence. Adopt Weibull model for the survival function for recurrence S 1 (t).

Liat, can you add a title and x- and y-legends to the next slide: Title: PCA of Tissues Based on Metagenes X-axis: First PC Y-axis: Second PC

Liat, can you add a title and x- and y-legends to the next slide: Title: PCA of Tissues Based on All Genes (via SVD) X-axis: First PC Y-axis: Second PC

Survival Analysis for Ontario Dataset ClusterNo. of TissuesNo. of Censored Mean time to Failure (  SE)   Kaplan-Meier estimation: A significant difference in recurrence-free between clusters (P=0.027) Cox’s proportional hazards analysis: VariableHazard ratio (95% CI)P-value Cluster 1 vs. Cluster 2 Tumor stage 6.78 (0.9 – 51.5) 1.07 (0.57 – 2.0)

Supervised Classification Method Based on a supervised clustering approach, a prognosis classifier was developed to predict the class of origin of a tumor tissue with a small error rate after correction for the selection bias. A support vector machine (SVM) was adopted to identify important genes that play a key role on predicting the clinical outcome, using all the genes, and the metagenes. A cross-validation (CV) procedure was then performed to calculate the test error, after corrected for the selection bias.

ONTARIO DATA (39 tissues): Support Vector Machine (SVM) with Recursive Feature Elimination (RFE) log2 (number of genes) Error Rate (CV10E) Ten-fold Cross-Validation Error Rate (CV10E) of Support Vector Machine (SVM). applied to g=2 clusters (G1: 1-14, ,33,36,38; G2: 15,30-32,34,35,37,39)

Tissues are ordered by their histological classification: Adenocarcinoma (1-41), Fetal Lung (42), Large cell (43-47), Normal (48-52), Squamous cell (53-68), Small cell (69-73) Heat Maps for the 20 Stanford Gene Clusters (73 Tissues) Genes Tissues

Heat Maps for the 15 Stanford Gene Clusters (35 Tissues) Tissues are ordered by the Stanford classification into AC groups: AC group 1 (1-19), AC group 2 (20-26), AC group 3 (27-35) Tissues Genes

Expression Profiles for Top Metagenes (Stanford 35 AC Tissues) Gene Group 1Gene Group 2 Gene Group 3 Gene Group 4 Log Expression Value Tissues Stanford AC group 1 Stanford AC group 2 Stanford AC group 3 Misallocated

Gene Group 7 Some other interesting Metagenes Group 7 ( 19 genes ) includes: citron surfactant A1 Marker Genes For Group 1 (Supervised) High in group 1, low in 2 (1/ 9 genes) Surfactant Proteins (Unsupervised) High in groups 1 and 2, low in 3 Gene Group 9 Group 9 ( 22 genes ) includes: ICAM-1 (CD54) collagen, type IX hepsin thyroid transcription factor Marker Genes For Group 1 (Supervised) High in group 1, low in 2 (4/ 9 genes) Log Expression Value Tissues

Which Genes make up the top 4 Metagenes ? Group 1 ( 22 genes ) includes: ESTs Hs ataxia-telangiectasia group D-associated protein solute carrier family 7, member 5 (CD98) vascular endothelial growth factor C Marker Genes For Group 3 (Supervised) High in group 3, low in 1 and 2 (4/10 genes) Group 2 ( 12 genes ) includes: ornithine decarboxylase carbonyl reductase (metabolic enzyme) Marker Genes for Group 2 (Supervised) High in group 2, low in 3 (1/8 genes) Group 3 ( 16 genes ) includes: aldo-keto reductase family 1 glutathione peroxidase thioredoxin reductase Metabolic Enzymes (Unsupervised) High in group 3, also SCC (3/6 genes) Group 4 ( 14 genes ) includes: cartilage paired-class homeoprotein tumor suppressor deleted in oral cancer-related 1 Marker Genes for Group 2 (Supervised) High in group 2, low in 3 (2/8 genes)

STANFORD DATA: Cluster 1: 1-19 (good prognosis) Cluster 2: (long-term survivors) Cluster 3: (poor prognosis)

STANFORD DATA: TWO-COMPONENT WEIBULL MIXTURE MODEL where

Survival Analysis for Stanford Dataset ClusterNo. of TissuesNo. of Censored Mean time to Failure (  SE)   2.3 Kaplan-Meier estimation: A significant difference in survival between clusters (P<0.001) Cox’s proportional hazards analysis: VariableHazard ratio (95% CI)P-value Cluster 2 vs. Cluster 1 Grade 3 vs. grades 1 or 2 Tumor size No. of tumors in lymph nodes Presence of metastases 13.2 (2.1 – 81.1) 1.94 (0.5 – 8.5) 0.96 (0.3 – 2.8) 1.65 (0.7 – 3.9) 4.41 (1.0 – 19.8)

Survival Analysis for Stanford Dataset MetageneCoefficient (SE)P-value (0.44) (0.31) 0.14 (0.34) (0.56) 0.66 (0.65) (0.50) (0.57) 0.75 (0.46) (0.50) 0.73 (0.39) (0.50) (0.41) (0.48) 0.22 (0.36) 1.70 (0.92) Univariate Cox’s proportional hazards analysis (metagenes):

Survival Analysis for Stanford Dataset MetageneCoefficient (SE)P-value (0.95) (0.62) (0.73) 1.16 (0.54) Multivariate Cox’s proportional hazards analysis (metagenes): The final model consists of four metagenes.

STANFORD DATA: Support Vector Machine (SVM) with Recursive Feature Elimination (RFE) Ten-fold Cross-Validation Error Rate (CV10E) of Support Vector Machine (SVM). Applied to g=2 clusters log2 (number of genes) Error Rate (CV10E)

MICHIGAN DATA 4965 genes on 86 AC tumours (oligonucleotide arrays) We imposed a floor of –100; ceiling of 26,000; applied generalized log transformation, and then row normalized but not column normalized

Heat Maps for the 40 Michigan Gene Groups Tissues Genes Tissues in order of our clusters: Cluster 1 (1-34) Cluster 2 (35-69) Cluster 3 (70-86)

MICHIGAN DATA: LTS MODEL where CONCLUDE:

Survival Analysis for Michigan Dataset ClusterNo. of TissuesNo. of Censored Mean time to Failure (  SE)    7.8 Kaplan-Meier estimation: No significant difference in survival between clusters Long-term survivor model: Estimates (SE) S 1 (t) (Weibull distribution)  2 (Logistic function)  (0.007);  : (0.215) (0.218); (0.169); (0.702) A significant difference in  2 between Clusters 1 & 2 Cluster 1Cluster 2Cluster 3 Proportion of long-term survivors67.1%51.5%34.0%

MICHICAN DATA: Support Vector Machine (SVM) with Recursive Feature Elimination (RFE) applied to g=3 clusters Ten-fold Cross-Validation Error Rate (CV10E) of Support Vector Machine (SVM) applied to 3 clusters log2 (number of genes) Error Rate (CV10E)

HARVARD DATA: genes on 203 AC tumours (oligonucleotide arrays) We impose a floor of –1; ceiling of 3,000; logged data and then column and row normalized

Heat Maps for the 20 Harvard Gene Groups (126 tissues) Tissues are ordered as our clusters: Cluster 1 (1-53), Cluster 2 (54-126) Genes Tissues

Heat Maps for the 20 Harvard Gene Groups (126 tissues) Genes Tissues Tissues are ordered as our clusters: Cluster 1 (1-55), Cluster 2 (56-110), Cluster 3 ( )

Survival Analysis for Harvard Dataset ClusterNo. of TissuesNo. of Censored Mean time to Failure (  SE)    4.7 Kaplan-Meier estimation: A significant difference in survival between clusters (P=0.044) Cox’s proportional hazards analysis: VariableHazard ratio (95% CI)P-value Cluster 2 vs. Cluster 1 Cluster 3 vs. Cluster 2 Age Female vs. Male Smoking frequency Tumor size Presence of tumor in lymph nodes Stage 0.74 (0.4 – 1.4) 3.08 (1.4 – 6.8) 1.02 (1.0 – 1.1) 0.56 (0.3 – 1.0) 1.24 (0.6 – 2.5) 1.68 (0.9 – 3.3) 2.50 (1.3 – 4.8) 1.43 (0.7 – 2.9)

HARVARD DATA: Support Vector Machine (SVM) with Recursive Feature Elimination (RFE) applied to g=2 clusters Ten-fold Cross-Validation Error Rate (CV10E) of Support Vector Machine (SVM) applied to 2 clusters log2 (number of genes) Error Rate (CV10E)

????????????????????????????????????????????????????? There are some obstacles to integrate information from the Stanford and Ontario Datasets. In the Stanford Dataset, we have clinical data only for the adenocarcinoma (plus two other) samples that are classified into their AC groupings. On the other hand, in the Ontario Dataset we have clinical data for various tumor types. The Ontario study attempts to relate non-adenocarcinoma samples, as well as adenocarcinoma samples, to clinical outcome. In order to do so, they simply cluster the tumors into two groups; recurrence vs non-recurrence, with no evidence for adenocarcinoma subclasses as found in the Stanford study. Additionally, within our gene clusters, we could not find genes common to both datasets.

CONCLUSIONS ????????????????????????????????????? We developed a model-based clustering approach to classify tumor tissues using microarrays genes expression profile. The clustering performed best as a predictor of the clinical outcome based on the overall survival or recurrence times. The results obtained from the analysis of both Stanford and Ontario Datasets indicate that classification of patients into good- prognosis and poor-prognosis subgroups on the basis of microarrays could be a useful tool for linking the impact on lung cancer biology and guiding treatment therapy and patient care to lung cancer patients.

ACKNOWLEDGMENTS We thank colleagues : Nazim Khan Abdollah Khodkar Justin Zhu.