Big Data & the CPTAC Data Portal Nathan Edwards, Peter McGarvey Mauricio Oberti, Ratna Thangudu Shuang Cai, Karen Ketchum Georgetown University & ESAC Nathan Edwards Georgetown University Medical Center
NCI: CPTAC Clinical Proteomic Tumor Analysis Consortium (CPTAC) Comprehensive study of genomically characterized (TCGA) cancer biospecimens by bottom-up mass- spectrometry-based proteomics workflows Follows Clinical Proteomics Technology Assessment Consortium (CPTAC Phase I) 2
NCI: CPTAC 3
CPTAC Data Portal All data is publicly released… …subject to responsible use guidelines Consortium has 15 months to publish first global analysis Data available in the meantime. 4
Proteomics Workflows Modern Instrumentation: Orbitrap, Q-Exactive, AB 5600 Protein Enrichment: Phosphoproteins, Glycoproteins Quantitation: Label-free, precursor area or spectral count; or iTRAQ Peptide Fractionation: Deep sampling of less abundant peptides 5
Available Data Mass Spectrometry Data Raw and mzML formats Experimental Design Meta-Data Link to TCGA, clinical context Analytical Protocol Documents Sample prep, chromatography, MS Peptide-Spectrum-Match Data CPTAC Common analysis pipeline (NIST) MS-GF+ based, TSV and mzIdentML formats Gene inference and quantitation 6
CPTAC/TCGA Colorectal Cancer (Proteome) Vanderbilt PCC (PI: Liebler), Embargo: 12/ TCGA samples, 15 fractions / sample Label-free spectral count / precursor XIC quant. Orbitrap Velos; high-accuracy precursor 1425 spectra files ~ 600 Gb / ~ 129 Gb (mzML.gz) Spectra: ~ 18M; ~ 13M MS/MS 4,644,354 PSMs at 1% MSGF+ q-value 10,258 genes at 0.01% gene FDR, 9047 groups 7
CPTAC/TCGA Breast Cancer (Proteome) Broad PCC (PI: Carr), Embargo: 5/ TCGA samples, 25 fractions / sample-mixture Proteome; iTRAQ quantitation; 3 samples vs POOL Q-Exactive; high-accuracy precursor 900 spectra files ~ 1Tb / ~ 280 Gb (mzML.gz) Spectra: ~ 41M; ~ 32M MS/MS 13,764,193 PSMs at 1% MSGF+ q-value 13,716 genes at 0.01% gene FDR, 10,007 groups 8
CPTAC/TCGA Breast Cancer (Phosphoproteome) Broad PCC (PI: Carr), Embargo: 5/ TCGA samples, 13 fractions / sample-mixture IMAC enriched; iTRAQ quant.; 3 samp. vs POOL Q-Exactive; high-accuracy precursor 468 spectra files ~ 600 Gb / ~ 130 Gb (mzML.gz) Spectra: ~ 16M; ~ 10M MS/MS 3,355,721 PSMs at 1% MSGF+ q-value 10,352 genes at 0.01% gene FDR, 8875 groups 9
CPTAC Data Center Lessons Files on disk are "easy" Meta-data, experimental design, semantics HARD File naming conventions seem trivial but do it Backup, access, redundancy is IT and costs $$ Advanced network transfer tools really work! Aspera provides order of magnitude improvement Scriptable upload/download/navigation matters! (Spectra) file integrity is really important Platform agnostic chain of custody from lab mzML conversion verifies RAW file semantics mzML embeds checksums, platform agnostic mzML semantic compression (peaks only) 10
CPTAC TCGA Data Lessons Monolithic computation no longer sufficient! Many datafiles, distributed computation, out-of-core PSMs are the new RAW data? (~ NGS reads) Many PSMs / gene; # Spectra >> # Sequences! "Poor" acquisitions are not uncommon Need fast, easy QC to permit re-analysis Other issues: Is identifiability information leaking (germline mutations)? Protein inference for human/mouse xenograft spectra? How to really handle isoforms? Proteome coverage – how to estimate? 11
Heresy: PSMs as NGS reads Need O(n) spectra → good PSMs We work too hard to identify all spectra, too stringent? Progressive, pareto, PTAS identification? Output as genome alignments, BAM files? Volume dominates noise and loss of detail: e.g. Twitter; indirect observation of splicing, PTMs? Models of distributed computation Distributed data and/or computation Failure, interruption tolerant computing Heterogeneous computing resources PSM search engine API for mining (social, reward?) 12