Curate patient-centric multi-omics data for precision medicine

Curate patient-centric multi-omics data for precision medicine
Jun Zhu, Ph. D. Genomics and Genetic Sciences Icahn Institute of Genomics and Multi-scale Biology The Tisch Cancer Institute Icahn Medical School at Mount Sinai New York, NY @IcahnInstitute

Why it is critical to curate patient-centric multi-omics data?

Why errors are common in large data sets?
Clustering analysis and differential analysis do not require a large number of samples. A few errors in the samples don’t affect results much. Data sets are over-powered for the types of analyses (low hanging fruits), or the data sets are under-used.

Sample swaps Gene expression Swap across genders

Gene expression based gender DNA methylation based gender
Gender errors in TCGA (The Cancer Genomic Atlas) Tumor Type Gene expression based gender DNA methylation based gender Microarray RNAseq HM27 HM450 # of tumor samples Male vs Female Inconsistent samples BLCA - 408 292:116 9 (2.2%) 153 112:41 3 (2%) BRCA 534 6:528 2 (0.4%) 1059 12:1047 0 (0%) 318 3:315 748 9:739 COAD 166 84:82 1 (0.6%) 258 141:117 14 (5.4%) GBM 561 336:218 9 (1.6%) 169 109:60 225 138:87 125 71:54 KIRC 219 142:77 2 (0.9%) 283 183:100 8 (2.9%) LUAD 33 15:18 7 (21.2%) 517 243:274 3 (0.6%) 127 63:64 1 (0.8%) 306 145:161 LUSC 134 99:35 227 163:64 5 (2.2%) READ 68 37:31 96 54:42 STAD 384 242:142 11 (2.9%) 82 42:40 210 134:76

Gender consistency check is not sufficient!
Only ~50% sample labeling errors can be detected by gender QC. Not only to identify problems/errors, but also to unambiguously correct errors

Iteratively aggregating patient-centric omics data
Multi-omics data Matcher (MODMatcher) Yoo et al, PLoS CompBio. 2014

Aggregating different types of profiles based on intrinsic barcodes
Yoo et al, PLoS CompBio. 2014

Simulation results

TCGA (The Cancer Genomic Atlas)
BRCA GBM Data type RNASeq Microarray HM27 HM450 CNV Original tumors 1059 534 318 747 1065 170 288 142 565 Good 1056 518 314 329 1043 520 284 141 559 With other 3 16 14 4 1 Poor 18 31 Error rate 0.3% 3.0% 1.2% 2.4% 3% 0% 2.6% 1.4% 0.7% 1.1% Cancer LUAD PRAD STAD Data type RNASeq Microarray HM27 HM450 CNV Original tumors 517 33 127 460 421 498 493 384 476 82 395 Good 515 23 110 453 420 374 369 376 446 77 371 With other 9 17 1 5 Poor 2 7 124 6 29 Error rate 0.4% 28% 13% 1.5% 0.2% 24.9% 25.2% 1.6% 6.3% 6.1%

Probabilistic MODmatcher
when intrinsic barcode size is small

Sample cross contamination
Gene expression Sample contamination

Clean matching patterns

Possible sample contamination

mis-annotated clinical information
Gene expression

mis-annotated clinical information

Software Will be available in Docker’s container soon.
Curation results will be shared through our web page.

Aknowledgements Supported by: Zhu lab Mount Sinai Seungyeul Yoo
Eunjee Lee Li Wang Wenhui Wang Yongjae Woo Yi Zhang Mount Sinai Genomics Institute Matthew Galsky Yixuan Gong Spiros Hiotos Qin Wang Supported by: Icahn Institute of Genomics and Multiscale Biology, Icahn School of Medicine at Mount Sinai NIH R01-AG046170 BD2K U01-HG008451

Curate patient-centric multi-omics data for precision medicine

Similar presentations

Presentation on theme: "Curate patient-centric multi-omics data for precision medicine"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Curate patient-centric multi-omics data for precision medicine

Similar presentations

Presentation on theme: "Curate patient-centric multi-omics data for precision medicine"— Presentation transcript:

Similar presentations

About project

Feedback