Download presentation
Presentation is loading. Please wait.
Published byBrian Hood Modified over 6 years ago
1
A graph-based integration of multiple layers of cancer genomics data (Progress Report)
Do Kyoon Kim 1
2
Outline Introduction Databasing TCGA Data
Graph-based Semi-Supervised Learning with Gene Expression Data
3
Introduction
4
Introduction Although microarray technology allows the investigation of the transcriptomic make-up of a tumor the transcriptome does not completely reflect the underlying biology due to alternative splicing, post-translational modification This increases the importance of integration more than one source of genome-wide data, such as the genome, transcriptome, proteome, and epigenome The current increase in the amount of available omics data emphasizes the need for a methodological integration framework
5
Introduction Data integration: different point of view
Heterogeneous data from different sources were analyzed sequentially The term data integration has also been used as synonym for data merging in which different data sets are concatenated at the database level by cross-referencing the identifiers Integrate multiple layers of experimental data into one mathematical model for the development of more homogeneous classifiers in clinical decision support Daemen et al., 2009, Genome Medicine
6
The Cancer Genome Atlas (TCGA)
Mission The Cancer Genome Atlas (TCGA) is a comprehensive and coordinated effort to accelerate our understanding of the molecular basis of cancer through the application of genome analysis technologies, including large-scale sequencing Goal To improve our ability to diagnose, treat and prevent cancer A pilot project developed and tested the research framework needed to systematically explore the entire spectrum of genomic changes involved in human cancer Focus on three selected cancer types Serous cystadenocarcinoma (ovarian) Squamous carcinoma (lung) Glioblastoma multiforme (brain) 500 samples per tumor type 6
7
TCGA data How to integrate? TCGA research network., (2008), Nature 7
8
The second page of TCGA project
8
9
Specific Goal Problem: Prediction of recurrence in GBM patients using multiple types of genomic data PHENOTYPE SNP SEQUENCE EXPRESSION METHYLATION COPY NUMBER miRNA 9
10
Biological Organization
TF binding SNP methylation CNV,LOH, Del TFbs TFbs TFbs TRANSCRIPTION CNV,LOH, Del Gene Gene Gene alternative splicing EXPRESSION microRNA microRNA mRNA mRNA mRNA TRANSLATION x post modification glucosylation phosphorylation Protein TF Protein FUNCTION TF: transcription factor TFbs: transcription factor binding site Phenotype
11
Graph-based Learning Recently, to integrate multiple data sources, a simidefinite programming (SDP) based SVM method was introduced In SDP/SVM, multiple kernel matrices corresponding to each of data sources are combined However, when trying to apply SDP/SVM to large problems, the computational cost can become prohibitive, since both converting the data to a kernel matrix for SVM and solving the SDP are time and memory demanding 11
12
Graph-based Learning Significant progress of graph-based semi-supervised learning methods in the machine learning community One important problem in graph-based learning, which has not yet been addressed, is the combination of multiple graphs Each vectorial data can be incorporated after conversion into a network Due to the sparsity of network edges, the computation time is nearly linear in the number of edges of the combined network 12
13
Graph-based integration
expression miRNA Methylation CNV 13
14
Databasing TCGA data
15
Data release Data Levels I and II correspond to raw and processed data, respectively, for each sample Level III data are the output of basic analyses of Level I/II data, such as mutational calls of sequenced genes, copy number and LOH calls of genomic regions of aberrations, and expression level of a gene for each sample Level IV data represent interpretations of the data, such as what genes are significantly mutated, or altered in copy number, DNA methylation, or expression across multiple samples and data types For protection of patient privacy, access to Level I and/or II data for certain platforms (e.g. SNP genotyping) or data types (e.g. germ-line mutations) is restricted to qualified researchers and requires approval of a TCGA Data Access Committee 15
16
16
17
Download directory structure and URL construction
17
18
Retrieving available TCGA Data: Done
Cancer type: GBM Time: about 10 days Size: About 230 GB 18
19
Databasing TCGA data 19
20
Python scripts for inserting data into database
20
21
Databasing annotation files from multiple types of platforms
Multiple types of Annotation Files -> Database Annotation (each platforms) ADF files (same genome build) 21
22
Insert new platforms and Experiment data
Column wise queries 22
23
Row wise queries Theoretically, queries are possible
Select all data with level 3 where gene symbol is ‘ERBB2’ Select expression, CGH, methylation with level 2 where chromosome position is ‘17q36.1’ 23
24
Statistics of data Overlap samples with tumor type = ‘solid tumor’
expression miRNA Methylation CNV P: P: 1510 P: 1498 P: S: 273 S: 276 S: 238 S: 441 24
25
Graph-based Semi-Supervised Learning with Gene Expression Data
26
Gene expression data Data reduction
Class (Procudure_Type from Phenotype) 1: Surgical Resection -1: Secondary Surgery for tumor recurrence/progression: locoregional procedure expression expression expression with output variable Gene summarization P: P: P: S: 273 S: 258 S: 258 26
27
Data plot 27
28
Graph-based SSL Without Feature selection: (258 x 12043)
W matrix: K-NN + exp-weighted graphs K = 5 SSL Mu = 1 5-fold cross validation ROC score: 28
29
Feature Selection Identify differential expressed genes from two phenotypes T-test Using mattest in MATLAB: p_value < 0.05: 768 p_value < 0.01: 181 p_value < 0.001: 23 29
30
Graph-based SSL With Feature selection (p_value < 0.05)
258 x 768 W matrix: K-NN + exp-weighted graphs K = 20 SSL Mu = 10 5-fold cross validation ROC score: 30
31
Future work Systematically control parameters
Other methods for making W matrix Correlation Tanh-weighted graphs Any good method with large-scale features ? Experiment with other data types miRNA Methylation CNV Combine multiple types of genomics data ROC score improved? 31
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.