Breast Cancer Subtype Identification Using RNA-Seq Data

Slides:



Advertisements
Similar presentations
Basic Gene Expression Data Analysis--Clustering
Advertisements

TCGA(The cancer genome atlas) catalogue genetic mutations responsible for cancer, using genome sequencing and bioinformatics The TCGA is sequencing the.
Gene expression patterns of breast cancer phenotype revealed by molecular profiling Gabriela Alexe, IBM Research DIMACS Workshop on Detecting and Processing.
Analysis of microarray data
Gene expression profiling identifies molecular subtypes of gliomas
Proteomics Informatics – Data Analysis and Visualization (Week 13)
Whole Genome Expression Analysis
Chapter 7 Essential Concepts in Molecular Pathology Companion site for Molecular Pathology Author: William B. Coleman and Gregory J. Tsongalis.
From motif search to gene expression analysis
Scenario 6 Distinguishing different types of leukemia to target treatment.
Microarrays and Gene Expression Analysis. 2 Gene Expression Data Microarray experiments Applications Data analysis Gene Expression Databases.
Pan-cancer analysis of prognostic genes Jordan Anaya Omnes Res, In this study I have used publicly available clinical and.
CBioPortal Web resource for exploring, visualizing, and analyzing multidimentional cancer genomics data.
CSE182 L14 Mass Spec Quantitation MS applications Microarray analysis.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Molecular Biology of Cancer AND Cancer Informatics (omics) David Boone.
High-throughput genomic profiling of tumor-infiltrating leukocytes
David Amar, Tom Hait, and Ron Shamir
K. Brennan, J.L. Koenig, A.J. Gentles, J.B. Sunwoo, O. Gevaert
Cancer Genomics and Class Discovery
Transcriptional heterogeneity of breast cancer subtypes,
An Artificial Intelligence Approach to Precision Oncology
FINAL PROJECT- Key dates
Sensitivity Analysis of the MGMT-STP27 Model and Impact of Genetic and Epigenetic Context to Predict the MGMT Methylation Status in Gliomas and Other.
Gene expression.
Deep Learning Analysis of Gene Expression Data for Breast Cancer Classification AS Y.P. Manawadu.
Hallett, et al., - Supplementary Figure 1
Functional Genomics Analysis Reveals a MYC Signature Associated with a Poor Clinical Prognosis in Liposarcomas  Dat Tran, Kundan Verma, Kristin Ward,
Figure S1. DCYTB expression is higher in ER+ than ER- patients
Microarray Clustering
Design and Analysis of Single-Cell Sequencing Experiments
High-level TNFSF13 predict a good response to post-operative chemotherapy in patients with basal-like breast cancer: A systematic review 林惠鈺1,2 歸家豪1,3.
Sensitivity Analysis of the MGMT-STP27 Model and Impact of Genetic and Epigenetic Context to Predict the MGMT Methylation Status in Gliomas and Other.
The Functional Impact of Alternative Splicing in Cancer
A Long Noncoding RNA Signature That Predicts Pathological Complete Remission Rate Sensitively in Neoadjuvant Treatment of Breast Cancer  Gen Wang, Xiaosong.
Volume 1, Issue 2, Pages (March 2002)
Volume 2, Issue 4, Pages (April 2008)
Increased MAPK1/3 Phosphorylation in Luminal Breast Cancer Related with PIK3CA Hotspot Mutations and Prognosis  Diana Ramirez-Ardila, A. Mieke Timmermans,
Loyola Marymount University
Volume 72, Issue 4, Pages (October 2017)
Volume 129, Issue 3, Pages (September 2005)
Tumor intrinsic subtype is reflected in cancer-adjacent tissue.
Volume 23, Issue 11, Pages (June 2018)
Class Prediction Based on Gene Expression Data Issues in the Design and Analysis of Microarray Experiments Michael D. Radmacher, Ph.D. Biometric Research.
Volume 26, Issue 4, Pages (October 2014)
Volume 5, Issue 6, Pages e3 (December 2017)
Volume 20, Issue 4, Pages e6 (April 2017)
The Functional Impact of Alternative Splicing in Cancer
Volume 4, Issue 3, Pages (August 2013)
Volume 25, Issue 5, Pages e5 (October 2018)
Distribution of intrinsic subtypes among TNBC and distribution of TNBC among basal-like breast cancer. Distribution of intrinsic subtypes among TNBC and.
Working with RNA-Seq Data
Volume 20, Issue 4, Pages e6 (April 2017)
Volume 17, Issue 8, Pages (November 2016)
Volume 14, Issue 7, Pages (February 2016)
Single Sample Expression-Anchored Mechanisms Predict Survival in Head and Neck Cancer Yang et al Presented by Yves A. Lussier MD PhD The University.
Quality Assurance of RNA Expression Profiling in Clinical Laboratories
Volume 29, Issue 5, Pages (May 2016)
James M. Flanagan, Sibylle Cocciardi, Nic Waddell, Cameron N
Loyola Marymount University
Volume 1, Issue 1, Pages (July 2015)
Volume 26, Issue 12, Pages e5 (March 2019)
Knowledge-Guided Sample Clustering
Loyola Marymount University
Figure 1. Identification of three tumour molecular subtypes in CIT and TCGA cohorts. We used CIT multi-omics data ( Figure 1. Identification of.
Loyola Marymount University
Loyola Marymount University
Volume 2, Issue 3, Pages (March 2016)
Volume 28, Issue 3, Pages e7 (July 2019)
Volume 28, Issue 4, Pages e6 (July 2019)
Presentation transcript:

Breast Cancer Subtype Identification Using RNA-Seq Data Dvir Netanely Shamir’s Group Meeting Tel-Aviv University June 2013

Breast Cancer A Common disease Heterogeneous with respect to 1,300,000 cases, 450,000 deaths each year worldwide Heterogeneous with respect to Molecular alterations Many different cellular scenarios can lead to cancer Cellular composition Tumors are composed of several types of cells Clinical outcome Outcome prediction is challenging

Breast Cancer Clinical Subtypes Breast cancer tumors are clinically categorized into three basic groups, each with its own therapeutic approach: ER+ (Luminal) HER2 Basal-like (Triple-negative) Classification of breast cancer tumors into distinct subtypes is critical for planning treatment and for developing new therapies.

Project Goals Improve the classification of cancer samples into therapeutically-relevant utilizing the flood of new high throughput data sources. Identify biomarkers that are specific to certain subtypes, outcomes.

Breast cancer subtype classification based on gene expression In 2000, Botstein et al. identified 4 classes of breast cancer based on gene expression analysis. 65 samples, 496 genes were used for clustering. Identified 4 molecular sub-types: ER+/luminal-like Basal-like Her2 (Erb-B2+) Normal-like Molecular portraits of human breast tumours David Botstein et al., Nature 406, 747-752 (17 August 2000)

Breast cancer subtype classification based on gene expression “Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications”, Botstein et al. 2001 85 cDNA microarrays, newer chip, same genes as in the previous study. Identified five robust sub-types: Luminal A Luminal B HER2-enriched Basal-like Normal-like Luminal was divided into two subgroups. Survival analyses on a subcohort of patients showed significantly different outcomes for the patients belonging to the various groups.

PAM50 Classifier In 2009, Parker et al. derived a set of 50 genes that robustly classify the five above subtypes of breast cancer. Analysis based on 189 tumors, 29 normal. The PAM50 gene set has high agreement in classification with larger “intrinsic” gene sets previously used for subtyping, and is now commonly employed. Supervised Risk Predictor of Breast Cancer Based on Intrinsic Subtypes, Parker at al. 2009

PAM50

Comprehensive molecular portraits of human breast tumours The Cancer Genome Atlas Network Nature Volume: 490, Pages:61–70 October 2012 Identification of Breast Cancer subtypes based on data coming from five different technologies (not including RNA-Seq)

Paper’s Essence 466 Breast cancer samples from 825 patients were clustered based on data coming from 5 different technologies. Showed correlation between yielded clusters and PAM50 labels. Additionally performed “consensus clustering” which is a clustering of the clusters produced independently by the various technologies.

Technologies used mRNA microarrays Measures mRNA levels for thousands of genes based on probe hybridization DNA methylation arrays Detects methylation levels Genomic DNA copy number arrays Identifies amplifications and deletions on the DNA level Exome sequencing Identifies genetic variance in the form of somatic mutations within the coding regions of the genome (1% of the genome). microRNA sequencing Measures miRNA levels using deep sequencing Reverse-phase protein arrays Measures protein abundance using antibodies

mRNA microarrays analysis Unsupervised hierarchical clustering analysis of 525 tumours and 22 tumour-adjacent normal tissues using the top 3,662 variably expressed genes. >High correlation to PAM50 subtypes and to clinical data.

Coordinated analysis of breast cancer subtypes defined from five different genomic/proteomic platforms. a, Consensus clustering analysis of the subtypes identifies four major groups (samples, n = 348). The blue and white heat map displays sample consensus. b, Heat-map display of the subtypes defined independently by miRNAs, DNA methylation, copy number (CN), PAM50 mRNA expression, and RPPA expression. The red bar indicates membership of a cluster type. c, Associations with molecular and clinical features. P values were calculated using a chi-squared test. DC Koboldt et al. Nature (2012)

Another paper about detecting BRCA subtypes – 10 subtypes Detected 10 subtypes with distinct outcomes Two types of data: Expression: HumanHT-12 v3 Expression BeadChip – Illumina Copy number: Aymetrix SNP 6.0 platform Using as external labels: PAM50 GENIUS http://genomebiology.com/content/11/2/R18 Integrative clustering using iCluster (2009): Joint latent variable framework for integrative clustering

RNA-Seq Based Identification of Breast Cancer Subtypes

RNA-Seq RNA-Seq is a recently developed approach to transcriptome profiling that uses deep-sequencing technologies. Like microarrays it can provide mRNA abundance quantification and considered to be… More precise Less noisy Higher dynamic range Also provides the RNA sequence itself

The TCGA RNA-Seq Breast Cancer Dataset RNA-Seq data and Clinical data for 956 samples were downloaded from TCGA web site 107 Normals, 849 Tumors 20531 genes Sample annotations include PAM50 labels. PAM50 Label Distribution #Sample Type 298 NA 107 Normal 39 ? 95 Basal-Like 56 Her2 229 Luminal A 125 Luminal B 7 Normal Like

Data Preprocessing Removed samples for which PAM50 isn’t available Starting with 20531 genes Flooring to 4 Log2 Keeping only top 10% variable genes (n=2053) Standardizing the rows

Let the Clustering begin… Goal: Partition the samples into groups exhibiting high correlation to PAM50 labels. Started with Hierarchical clustering. Difficult to visualize (hundreds of samples) A measure is needed to determine correlation to PAM50 labels.

Applying hierarchical clustering on samples 1:3:956 PAM50 Normal Her2 LuminalA LuminalB ER Negative Positive PR Positive Negative HER2 Negative Positive Negative Normal Tumor

Applying hierarchical clustering on different subsets of samples (All samples)

Applying hierarchical clustering on different subsets of samples (Tumors only, without Normals)

Applying hierarchical clustering on different subsets of samples (Luminal A, Luminal B, normal-like)

Insights PAM50 subtype detection – some labels are easier to detect than others. From easiest to hardest: Normal Basal-Like Her2 Luminal B Luminal A Decision Tree approach in place? Multi step algorithm that uses different gene subsets for every step ? A measure is needed for evaluating clustering solutions and their correspondence to external labels. Global score or per cluster score ?

Finding measures for evaluating clustering solutions I Global Jaccard 𝑆 𝑖𝑗 = 1: 𝑖,𝑗 𝑖𝑛 𝑡ℎ𝑒 𝑠𝑎𝑚𝑒 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 0: 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 𝑇 𝑖𝑗 = 1: 𝑖,𝑗 𝑏𝑒𝑎𝑟 𝑡ℎ𝑒 𝑠𝑎𝑚𝑒 𝐿𝑎𝑏𝑒𝑙 0: 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 𝐺𝑙𝑜𝑏𝑎𝑙𝐽𝑎𝑐𝑐𝑎𝑟𝑑𝐼𝑛𝑑𝑒𝑥= 𝑆∩𝑇 𝑆∪𝑇

Intersection of Clustering sol. and PAM50 (Samples ordered by PAM50)

Intersection of Clustering sol. and PAM50

Finding measures for evaluating clustering solutions II Insight: Label homogeneity in a cluster should be prioritized over number of clusters. Break to small homogeneous pieces and unite. 𝑾𝑺𝑴𝑹 (𝑾𝒆𝒊𝒈𝒉𝒕𝒆𝒅 𝑺𝒖𝒎 𝑶𝒇 𝑴𝒂𝒙𝒊𝒎𝒂𝒍 𝑹𝒂𝒕𝒊𝒐𝒔) 𝑊𝑆𝑀𝑅 = 𝑖=1..𝐶 𝑐 𝑖 >𝑝 𝑐 𝑖 𝑛 ∗ max 𝑙=1..𝐿 𝑅 𝑖𝑙 𝑛 −𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 𝐶 −𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐶𝑙𝑢𝑠𝑡𝑒𝑟𝑠 𝐿 −𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐿𝑎𝑏𝑒𝑙𝑠 𝑅 𝑖𝑙 −𝑅𝑎𝑡𝑖𝑜 𝑜𝑓 𝑙𝑎𝑏𝑒𝑙 𝑙 𝑖𝑛 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 𝑖 𝑝 −𝑚𝑖𝑛𝑖𝑚𝑎𝑙 𝑣𝑎𝑙𝑖𝑑 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 𝑠𝑖𝑧𝑒

Applying the two measures on various clustering solutions (Hierarchical Clustering, Euclidean distance, Complete linkage)

Intersection block diagram (Hierarchical Clustering, Euclidean distance, Complete linkage)

Which parameter set will yield a clustering solution which best corresponds to PAM50 labels ? [euclidean] [complete] 15 0.798061 [seuclidean] [complete] [minkowski] [complete] [euclidean] [ward] 20 0.79483 30 40 0.793215 [cosine] [complete] 0.789984 [spearman] [complete] 0.785137 50 0.783522 0.781906 [correlation] [complete] 0.778675 [cityblock] [complete] 0.775444 0.772213

KMeans - Correlation

KMeans 15 - 0.83

Click – Correlation distance metric

Click – Dot Product distance metric

Insights Better measurement is out there ? One that will penalize high number of clusters. Global Jaccard Index prioritizes solutions with the same number of clusters. KMeans results are random, hard to identify best parameters.

Applying our measures on TCGA’s clusters

Building a clustering based classifier Find a clustering solution composed of clusters that are highly homogenous with respect to external label. Assign a single label to each cluster based on calculation of maximal ratio. Map each new sample to a cluster based on profile correlation. Assign the new sample with the cluster’s label.

Kmeans 15, correlation- Mean: 0.7512, min:0.669, max:0.830 Accuracy distribution over 40 executions Kmeans 15, correlation- Mean: 0.7512, min:0.669, max:0.830

Kmeans 15, Correlation distance – Example for high score

Accuracy distribution over 40 executions Kmeans 15, euclidean - Mean: 0.734, min:0.64, max:0.798

Kmeans 15, Euclidiean distance – Example for low score High WSMR but low accuracy

Summary Global-Jaccard, WSMR and classifier’s accuracy are all measures indicative of how well our algorithm can cluster the samples into meaningful groups. Additional measures are needed. These measures shall be used to fine-tune the clustering algorithm.

Next Steps Complete development of classifier evaluation measures Look for better clustering evaluation measure in correlation with classifier accuracy. Improve classifier by Smarter gene selection, better preprocessing Identify best clustering algorithm, distance metric, parameters Use the two confidence values (R and MaxRatio) to estimate overall prediction confidence Use different set of genes for each step (Decision tree) Focus on LuminalA-LuminalB separation Use additional data sources Try other classification methods Look for significant survival differences of patients mapped to different clusters.