Curate patient-centric multi-omics data for precision medicine

Slides:



Advertisements
Similar presentations
Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis Yan Guo.
Advertisements

Affymetrix Gene Expression Microarrays Application to Pulmonary Arterial Hypertension Bob Stearman 02/24/2014.
Somatic Mutation Distributions Determining Cut-offs Supp. Figure 5.
Data integration across omics landscapes Bing Zhang, Ph.D. Department of Biomedical Informatics Vanderbilt University School of Medicine
Linking Genetic Profiles to Biological Outcome Paul Fogel Consultant, Paris S. Stanley Young National Institute of Statistical Sciences NISS, NMF Workshop.
Bioinformatics lectures at Rice University Li Zhang Lecture 10: Networks and integrative genomic analysis-2 Genome instability and DNA copy number data.
Introduction Integrative Analysis of Genomic Variants in Carcinogenesis Syed Haider, Arek Kasprzyk, Pietro Lio Artificial Intelligence and Computational.
Gene Co-expression Network Analysis BMI 730 Kun Huang Department of Biomedical Informatics Ohio State University.
Radiogenomics in glioblastoma multiforme
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
Bioinformatics MEDC601 Lecture by Brad Windle Ph# Office: Massey Cancer Center, Goodwin Labs Room 319 Web site for lecture:
Anthony Gitter Cancer Bioinformatics (BMI 826/CS 838) May 5, 2015
Pan-cancer analysis of prognostic genes Jordan Anaya Omnes Res, In this study I have used publicly available clinical and.
(1) Genotype-Tissue Expression (GTEx) Largest systematic study of genetic regulation in multiple tissues to date 53 tissues, 500+ donors, 9K samples, 180M.
An Overview of The Cancer Genome Atlas (TCGA)
Tumor Genome Sequencing Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST512.
David Amar, Tom Hait, and Ron Shamir
To develop the scientific evidence base that will lessen the burden of cancer in the United States and around the world. NCI Mission Key message:
Cancer Genomics Core Lab
A graph-based integration of multiple layers of cancer genomics data (Progress Report) Do Kyoon Kim 1.
GraDe-SVM: Graph-Diffused Classification for the Analysis of Somatic Mutations in Cancer Morteza H.Chalabi, Fabio Vandin Hello.
Cancer Genomics and Class Discovery
Areas of Research Xia Jiang Associate Professor of
From: Epigenetic instability of imprinted genes in human cancers
An Artificial Intelligence Approach to Precision Oncology
Gregory Cooper Professor of Biomedical Informatics Director, Center for Causal Discovery Vice Chair Research, Department of Biomedical Informatics.
Optimizing Biological Data Integration
Statistical Applications in Biology and Genetics
Sensitivity Analysis of the MGMT-STP27 Model and Impact of Genetic and Epigenetic Context to Predict the MGMT Methylation Status in Gliomas and Other.
Utilizing the Illumina deep sequencing technique to define
Gene expression.
Introduction to Bioinformatics February 13, 2017
Lecture 7. Topics in RNA Bioinformatics (Single-Cell RNA Sequencing)
Claudio Lottaz and Rainer Spang
Sensitivity Analysis of the MGMT-STP27 Model and Impact of Genetic and Epigenetic Context to Predict the MGMT Methylation Status in Gliomas and Other.
Genetics and Genomics 5a. Integrative Genomics
Topic 7: The Organization and Control of Eukaryotic Genomes
The Functional Impact of Alternative Splicing in Cancer
Molecular Signatures for Tumor Classification
Areas of Research Xia Jiang Assistant Professor
ML criticality in clinical outcome across cancer types.
A Short Tutorial on Causal Network Modeling and Discovery
Hotspot mutations drive clustering of tumor types
Class Prediction Based on Gene Expression Data Issues in the Design and Analysis of Microarray Experiments Michael D. Radmacher, Ph.D. Biometric Research.
(A) Hierarchical clustering was performed to identify groups of patients with similar RNASeq expression of 20 genes associated with reduced survivability.
The Functional Impact of Alternative Splicing in Cancer
Volume 4, Issue 3, Pages (August 2013)
Volume 127, Issue 2, Pages (August 2004)
Principle of Epistasis Analysis
Xing Hua, Haiming Xu, Yaning Yang, Jun Zhu, Pengyuan Liu, Yan Lu 
Volume 17, Issue 8, Pages (November 2016)
Single Sample Expression-Anchored Mechanisms Predict Survival in Head and Neck Cancer Yang et al Presented by Yves A. Lussier MD PhD The University.
Volume 29, Issue 5, Pages (May 2016)
qRT-PCR validation of differential expression.
Session 1: WELCOME AND INTRODUCTIONS
Moebius Syndrome: Research Opportunities
Genetic Mutations Associated with Histopathology Changes in Kidney Cancer Kun Huang, PhD Jun Cheng, PhD, Zhi Han, PhD, Qianjin Feng, PhD, Liang Cheng,
Volume 26, Issue 12, Pages e5 (March 2019)
The Cancer genome atlas (TCGA) and the search for a CUP genetic/epigenetic signature Manel Esteller, MD, PhD. Director, Josep Carreras Leukaemia Research.
Figure 1. Identification of three tumour molecular subtypes in CIT and TCGA cohorts. We used CIT multi-omics data ( Figure 1. Identification of.
To Infinium, and Beyond! Cancer Cell
NRG1 rearrangements are found in multiple solid tumors.
Xing Hua, Haiming Xu, Yaning Yang, Jun Zhu, Pengyuan Liu, Yan Lu 
The NCI Genomic Data Commons as an engine for precision medicine
Claudio Lottaz and Rainer Spang
Gene expression patterns of SLC and ABC transporters in normal and tumor tissues. Gene expression patterns of SLC and ABC transporters in normal and tumor.
lncRNA HOXA11-AS is overexpressed in gastric cancer tissues.
Superenhancers near the KLF5 gene are focally amplified in diverse cancer types. Superenhancers near the KLF5 gene are focally amplified in diverse cancer.
Highly metastatic PDAC cells have a unique gene signature, which is not preserved in metastases but predicts poor patient outcome. Highly metastatic PDAC.
Genomic instability is a core feature of ovarian cancer that frequently involves DNA-damage repair genes. Genomic instability is a core feature of ovarian.
Presentation transcript:

Curate patient-centric multi-omics data for precision medicine Jun Zhu, Ph. D. Genomics and Genetic Sciences Icahn Institute of Genomics and Multi-scale Biology The Tisch Cancer Institute Icahn Medical School at Mount Sinai New York, NY http://research.mssm.edu/integrative-network-biology/ Email: jun.zhu@mssm.edu @IcahnInstitute

Why it is critical to curate patient-centric multi-omics data?

Why errors are common in large data sets? Clustering analysis and differential analysis do not require a large number of samples. A few errors in the samples don’t affect results much. Data sets are over-powered for the types of analyses (low hanging fruits), or the data sets are under-used.

Sample swaps Gene expression Swap across genders

Gene expression based gender DNA methylation based gender Gender errors in TCGA (The Cancer Genomic Atlas) Tumor Type Gene expression based gender DNA methylation based gender Microarray RNAseq HM27 HM450 # of tumor samples Male vs Female Inconsistent samples BLCA - 408 292:116 9 (2.2%) 153 112:41 3 (2%) BRCA 534 6:528 2 (0.4%) 1059 12:1047 0 (0%) 318 3:315 748 9:739 COAD 166 84:82 1 (0.6%) 258 141:117 14 (5.4%) GBM 561 336:218 9 (1.6%) 169 109:60 225 138:87 125 71:54 KIRC 219 142:77 2 (0.9%) 283 183:100 8 (2.9%) LUAD 33 15:18 7 (21.2%) 517 243:274 3 (0.6%) 127 63:64 1 (0.8%) 306 145:161 LUSC 134 99:35 227 163:64 5 (2.2%) READ 68 37:31 96 54:42 STAD 384 242:142 11 (2.9%) 82 42:40 210 134:76

Gender consistency check is not sufficient! Only ~50% sample labeling errors can be detected by gender QC. Not only to identify problems/errors, but also to unambiguously correct errors

Iteratively aggregating patient-centric omics data Multi-omics data Matcher (MODMatcher) Yoo et al, PLoS CompBio. 2014

Aggregating different types of profiles based on intrinsic barcodes Yoo et al, PLoS CompBio. 2014

Simulation results

TCGA (The Cancer Genomic Atlas) BRCA GBM Data type RNASeq Microarray HM27 HM450 CNV Original tumors 1059 534 318 747 1065 170 288 142 565 Good 1056 518 314 329 1043 520 284 141 559 With other 3 16 14 4 1 Poor 18 31 Error rate 0.3% 3.0% 1.2% 2.4% 3% 0% 2.6% 1.4% 0.7% 1.1% Cancer LUAD PRAD STAD Data type RNASeq Microarray HM27 HM450 CNV Original tumors 517 33 127 460 421 498 493 384 476 82 395 Good 515 23 110 453 420 374 369 376 446 77 371 With other 9 17 1 5 Poor 2 7 124 6 29 Error rate 0.4% 28% 13% 1.5% 0.2% 24.9% 25.2% 1.6% 6.3% 6.1%

Probabilistic MODmatcher when intrinsic barcode size is small

Sample cross contamination Gene expression Sample contamination

Clean matching patterns

Possible sample contamination

mis-annotated clinical information Gene expression

mis-annotated clinical information

Software Will be available in Docker’s container soon. Curation results will be shared through our web page.

Aknowledgements Supported by: Zhu lab Mount Sinai Seungyeul Yoo Eunjee Lee Li Wang Wenhui Wang Yongjae Woo Yi Zhang Mount Sinai Genomics Institute Matthew Galsky Yixuan Gong Spiros Hiotos Qin Wang Supported by: Icahn Institute of Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai NIH R01-AG046170 BD2K U01-HG008451