Big Data & the CPTAC Data Portal Nathan Edwards, Peter McGarvey Mauricio Oberti, Ratna Thangudu Shuang Cai, Karen Ketchum Georgetown University & ESAC.

Slides:



Advertisements
Similar presentations
PSI Mass Spectrometry Standards Working Group Summary HUPO PSI MS Standards Working Group.
Advertisements

ProteinPilot ™ Software © 2008 Applera Corporation and MDS Inc.
Analysis of human haptoglobin, digest with trypsin and Glu-C – six putative N-motif peptides. Glycopeptide separation by hydrophilic interaction liquid.
MIAPE Extractor Tutorial SHPP meeting, 28 Aug 2012 La Cristalera, Miraflores de la Sierra, Madrid Salvador Martínez de Bartolomé Izquierdo CNB-CSIC / ProteoRed.
Data integration across omics landscapes Bing Zhang, Ph.D. Department of Biomedical Informatics Vanderbilt University School of Medicine
ACCELERATING CLINICAL AND TRANSLATIONAL RESEARCH Metabolomics/Proteomics and Genomics at IUB Indiana CTSI – Purdue Retreat Monday,
Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis Nathan Edwards Department of Biochemistry and Molecular & Cellular.
Knowledge Enabled Information and Services Science What can SW do for HCLS today? Panel at HCSL Workshop, WWW2007 Amit Sheth Kno.e.sis Center Wright State.
PepArML: A model-free, result-combining peptide identification arbiter via machine learning Xue Wu, Chau-Wen Tseng, Nathan Edwards University of Maryland,
Sangtae Kim Ph.D. candidate University of California, San Diego
Proteomics: A Challenge for Technology and Information Science CBCB Seminar, November 21, 2005 Tim Griffin Dept. Biochemistry, Molecular Biology and Biophysics.
Proteomics Informatics Workshop Part I: Protein Identification
Previous Lecture: Regression and Correlation
FIGURE 5. Plot of peptide charge state ratios. Quality Control Concept Figure 6 shows a concept for the implementation of quality control as system suitability.
Proteomics Informatics (BMSC-GA 4437) Course Director David Fenyö Contact information
Proteomics Informatics (BMSC-GA 4437) Course Director David Fenyö Contact information
NCI’s Clinical Proteomic Technologies for Cancer: “Restructuring Proteomics to Succeed in Discovering Cancer Biomarkers” Joe.
Daehee Hwang Leroy Hood Institute for Systems Biology.
Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis Nathan Edwards Department of Biochemistry and Molecular & Cellular.
Spectral Counting. 2 Definition The total number of identified peptide sequences (peptide spectrum matches) for the protein, including those redundantly.
Proteomics Informatics Workshop Part III: Protein Quantitation
Gene Set Enrichment and Splicing Detection using Spectral Counting Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University.
Daniel C. Liebler Vanderbilt University School of Medicine Vanderbilt, Tennessee Performance and Optimization of LC-MS/MS Platforms for Unbiased Discovery.
Proteomics Informatics – Data Analysis and Visualization (Week 13)
Proteomics and Biomarker Discovery Discovery to Targets for a Phosphoproteomic Signature Assay: One-stop shopping in Skyline Jake Jaffe Skyline Users Meeting.
Human Proteome Project? Màster en bioquímica, biologia molecular i biomedicina Mòdul 4: Genòmica i Proteòmica Núria Colomé Calls.
© 2010 SRI International - Company Confidential and Proprietary Information Quantitative Proteomics: Approaches and Current Capabilities Pathway Tools.
GSAT501 - proteomics Name, home-town Students – previous lab experience –Lab you hope to end up in? Teachers – what is your current project.
Proteomics and Biomarker Discovery “Research-grade” Targeted Proteomics Assay Development: PRMs for PTM Studies with Skyline or, “How I learned to ditch.
Bringing Metrology to Clinical Proteomic Research David Bunk Chemical Science and Technology Laboratory National Institute of Standards and Technology.
MS/MS Libraries of Identified Peptides and Recurring Spectra in Protein Digests Lisa Kilpatrick, Jeri Roth, Paul Rudnick, Xiaoyu Yang, Steve Stein Mass.
Generating Peptide Candidates from Protein Sequence Databases for Protein Identification via Mass Spectrometry Nathan Edwards Informatics Research.
Common parameters At the beginning one need to set up the parameters.
Karl Clauser Proteomics and Biomarker Discovery Breast Cancer Proteomics and the use of TCGA Mutational Data - Broad Institute update/issues Karl Clauser.
Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University Medical Center.
Laxman Yetukuri T : Modeling of Proteomics Data
Current Challenges in Metagenomics: an Overview Chandan Pal 17 th December, GoBiG Meeting.
Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.
Knowledge Enabled Information and Services Science Glycomics project overview.
Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center Application of meta-search, grid-computing, and machine-learning.
Patricia HernandezGeneva, 28 th September 2006 Swiss Bio Grid: Proteomics Project (PP)
False-Discovery-Rate Aware Protein Inference by Generalized Protein Parsimony Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology.
Proteogenomic Novelty in 105 TCGA Breast Tumors
Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,
Multiple flavors of mass analyzers Single MS (peptide fingerprinting): Identifies m/z of peptide only Peptide id’d by comparison to database, of predicted.
Bioinformatics Research Overview Outline Biomedical Ontologies oGlycO oEnzyO oProPreO Scientific Workflow for analysis of Proteomics Data Framework for.
Data Management Support for Life Sciences or What can we do for the Life Sciences? Mourad Ouzzani
Clinical Proteomic Tumor Analysis Consortium: Ontology Considerations
Proteomics Informatics (BMSC-GA 4437) Instructor David Fenyö Contact information
Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center Application of meta-search, grid-computing, and machine-learning.
Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,
Salamanca, March 16th 2010 Participants: Laboratori de Proteomica-HUVH Servicio de Proteómica-CNB-CSIC Participants: Laboratori de Proteomica-HUVH Servicio.
Application of meta-search, grid-computing, and machine-learning can significantly improve the sensitivity of peptide identification. The PepArML meta-search.
Proteomics Informatics (BMSC-GA 4437) Course Directors David Fenyö Kelly Ruggles Beatrix Ueberheide Contact information
Constructing high resolution consensus spectra for a peptide library
What is proteomics? Richard Mbasu and Ben Richards.
CoLIMS progress Computational Omics and Systems Biology (CompOmics) Group Niels Hulstaert
Protein quantitation I: Overview (Week 5). Fractionation Digestion LC-MS Lysis MS Sample i Protein j Peptide k Proteomic Bioinformatics – Quantitation.
Using Scaffold OHRI Proteomics Core Facility. This presentation is intended for Core Facility internal training purposes only.
CPAS Comparative Proteomics Analysis System Adam Rauch LabKey Software
Protein identification by mass spectrometry The shotgun proteomics strategy, based on digesting proteins into peptides and sequencing them using tandem.
Protein identification by mass spectrometry The shotgun proteomics strategy, based on digesting proteins into peptides and sequencing them using tandem.
Table 1. Quality Parameters Being Considered for Evaluation
Algorithms and Computation: Bottom-Up Data Analysis Workflows
Open source tools for data analysis
Connecting Cancer Genomics to Cancer Biology using Proteomics
PCCSE Project Peculiar P100 Peak Picking Puzzles
Proteomics Informatics David Fenyő
Proteomics Informatics David Fenyő
The NCI Genomic Data Commons as an engine for precision medicine
Presentation transcript:

Big Data & the CPTAC Data Portal Nathan Edwards, Peter McGarvey Mauricio Oberti, Ratna Thangudu Shuang Cai, Karen Ketchum Georgetown University & ESAC Nathan Edwards Georgetown University Medical Center

NCI: CPTAC Clinical Proteomic Tumor Analysis Consortium (CPTAC) Comprehensive study of genomically characterized (TCGA) cancer biospecimens by bottom-up mass- spectrometry-based proteomics workflows Follows Clinical Proteomics Technology Assessment Consortium (CPTAC Phase I) 2

NCI: CPTAC 3

CPTAC Data Portal All data is publicly released… …subject to responsible use guidelines Consortium has 15 months to publish first global analysis Data available in the meantime. 4

Proteomics Workflows Modern Instrumentation: Orbitrap, Q-Exactive, AB 5600 Protein Enrichment: Phosphoproteins, Glycoproteins Quantitation: Label-free, precursor area or spectral count; or iTRAQ Peptide Fractionation: Deep sampling of less abundant peptides 5

Available Data Mass Spectrometry Data Raw and mzML formats Experimental Design Meta-Data Link to TCGA, clinical context Analytical Protocol Documents Sample prep, chromatography, MS Peptide-Spectrum-Match Data CPTAC Common analysis pipeline (NIST) MS-GF+ based, TSV and mzIdentML formats Gene inference and quantitation 6

CPTAC/TCGA Colorectal Cancer (Proteome) Vanderbilt PCC (PI: Liebler), Embargo: 12/ TCGA samples, 15 fractions / sample Label-free spectral count / precursor XIC quant. Orbitrap Velos; high-accuracy precursor 1425 spectra files ~ 600 Gb / ~ 129 Gb (mzML.gz) Spectra: ~ 18M; ~ 13M MS/MS 4,644,354 PSMs at 1% MSGF+ q-value 10,258 genes at 0.01% gene FDR, 9047 groups 7

CPTAC/TCGA Breast Cancer (Proteome) Broad PCC (PI: Carr), Embargo: 5/ TCGA samples, 25 fractions / sample-mixture Proteome; iTRAQ quantitation; 3 samples vs POOL Q-Exactive; high-accuracy precursor 900 spectra files ~ 1Tb / ~ 280 Gb (mzML.gz) Spectra: ~ 41M; ~ 32M MS/MS 13,764,193 PSMs at 1% MSGF+ q-value 13,716 genes at 0.01% gene FDR, 10,007 groups 8

CPTAC/TCGA Breast Cancer (Phosphoproteome) Broad PCC (PI: Carr), Embargo: 5/ TCGA samples, 13 fractions / sample-mixture IMAC enriched; iTRAQ quant.; 3 samp. vs POOL Q-Exactive; high-accuracy precursor 468 spectra files ~ 600 Gb / ~ 130 Gb (mzML.gz) Spectra: ~ 16M; ~ 10M MS/MS 3,355,721 PSMs at 1% MSGF+ q-value 10,352 genes at 0.01% gene FDR, 8875 groups 9

CPTAC Data Center Lessons Files on disk are "easy" Meta-data, experimental design, semantics HARD File naming conventions seem trivial but do it Backup, access, redundancy is IT and costs $$ Advanced network transfer tools really work! Aspera provides order of magnitude improvement Scriptable upload/download/navigation matters! (Spectra) file integrity is really important Platform agnostic chain of custody from lab mzML conversion verifies RAW file semantics mzML embeds checksums, platform agnostic mzML semantic compression (peaks only) 10

CPTAC TCGA Data Lessons Monolithic computation no longer sufficient! Many datafiles, distributed computation, out-of-core PSMs are the new RAW data? (~ NGS reads) Many PSMs / gene; # Spectra >> # Sequences! "Poor" acquisitions are not uncommon Need fast, easy QC to permit re-analysis Other issues: Is identifiability information leaking (germline mutations)? Protein inference for human/mouse xenograft spectra? How to really handle isoforms? Proteome coverage – how to estimate? 11

Heresy: PSMs as NGS reads Need O(n) spectra → good PSMs We work too hard to identify all spectra, too stringent? Progressive, pareto, PTAS identification? Output as genome alignments, BAM files? Volume dominates noise and loss of detail: e.g. Twitter; indirect observation of splicing, PTMs? Models of distributed computation Distributed data and/or computation Failure, interruption tolerant computing Heterogeneous computing resources PSM search engine API for mining (social, reward?) 12