Algorithms and Computation: Bottom-Up Data Analysis Workflows

Slides:

Advertisements

Similar presentations

Protein Quantitation II: Multiple Reaction Monitoring

Advertisements

The Proteomics Core at Wayne State University

SALSA HPC Group School of Informatics and Computing Indiana University.

Big Data & the CPTAC Data Portal Nathan Edwards, Peter McGarvey Mauricio Oberti, Ratna Thangudu Shuang Cai, Karen Ketchum Georgetown University & ESAC.

N-Glycopeptide Identification from CID Tandem Mass Spectra using Glycan Databases and False Discovery Rate Estimation Kevin B. Chandler, Petr Pompach,

Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis Nathan Edwards Department of Biochemistry and Molecular & Cellular.

PepArML: A model-free, result-combining peptide identification arbiter via machine learning Xue Wu, Chau-Wen Tseng, Nathan Edwards University of Maryland,

De Novo Sequencing v.s. Database Search Bin Ma School of Computer Science University of Waterloo Ontario, Canada.

Bin Ma, CTO Bioinformatics Solutions Inc. June 5, 2011.

Sangtae Kim Ph.D. candidate University of California, San Diego

Proteomics: A Challenge for Technology and Information Science CBCB Seminar, November 21, 2005 Tim Griffin Dept. Biochemistry, Molecular Biology and Biophysics.

Previous Lecture: Regression and Correlation

Mass Spectrometry. What are mass spectrometers? They are analytical tools used to measure the molecular weight of a sample. Accuracy – 0.01 % of the total.

FIGURE 5. Plot of peptide charge state ratios. Quality Control Concept Figure 6 shows a concept for the implementation of quality control as system suitability.

Scaffold Download free viewer:

Facts and Fallacies about de Novo Sequencing & Database Search.

Proteomics Informatics (BMSC-GA 4437) Course Director David Fenyö Contact information

Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis Nathan Edwards Department of Biochemistry and Molecular & Cellular.

Spectral Counting. 2 Definition The total number of identified peptide sequences (peptide spectrum matches) for the protein, including those redundantly.

Gene Set Enrichment and Splicing Detection using Spectral Counting Nathan Edwards Department of Biochemistry and Mol. & Cell. Biology Georgetown University.

Tryptic digestion Proteomics Workflow for Gel-based and LC-coupled Mass Spectrometry Protein or peptide pre-fractionation is a prerequisite for the reduction.

Proteomics Informatics – Data Analysis and Visualization (Week 13)

Production of polypeptides, Da, and middle-down analysis by LC-MSMS Catherine Fenselau 1, Joseph Cannon 1, Nathan Edwards 2, Karen Lohnes 1,

MS/MS Libraries of Identified Peptides and Recurring Spectra in Protein Digests Lisa Kilpatrick, Jeri Roth, Paul Rudnick, Xiaoyu Yang, Steve Stein Mass.

Common parameters At the beginning one need to set up the parameters.

Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University Medical Center.

Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:

A Comprehensive Comparison of the de novo Sequencing Accuracies of PEAKS, BioAnalyst and PLGS Bin Ma 1 ; Amanda Doherty-Kirby 1 ; Aaron Booy 2 ; Bob Olafson.

Meta-Search and Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.

Advanced Computer Networks Topic 2: Characterization of Distributed Systems.

Laxman Yetukuri T : Modeling of Proteomics Data

Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.

Protein bioinformatics and systems biology Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.

1 Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data Yi Wang, Gagan Agrawal, Gulcin Ozer and Kun Huang The Ohio State University.

Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center Application of meta-search, grid-computing, and machine-learning.

Patricia HernandezGeneva, 28 th September 2006 Swiss Bio Grid: Proteomics Project (PP)

Glycoprotein Microheterogeneity via N-Glycopeptide Identification Kevin Brown Chandler, Petr Pompach, Radoslav Goldman, Nathan Edwards Georgetown University.

Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical.

Faster, more sensitive peptide identification from tandem mass spectra by sequence database compression Nathan J. Edwards Center for Bioinformatics & Computational.

EBI is an Outstation of the European Molecular Biology Laboratory. In silico analysis of accurate proteomics, complemented by selective isolation of peptides.

COT6930 Course Project. Outline Gene Selection Sequence Alignment.

Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center  Peptide sequence databases, meta-search engine, machine-learning.

Improving the Sensitivity of Peptide Identification by Meta-Search, Grid-Computing, and Machine-Learning Nathan Edwards Georgetown University Medical Center.

Improving the Sensitivity of Peptide Identification for Genome Annotation Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown.

Proteomics Informatics (BMSC-GA 4437) Instructor David Fenyö Contact information

Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center Application of meta-search, grid-computing, and machine-learning.

Application of meta-search, grid-computing, and machine-learning can significantly improve the sensitivity of peptide identification. The PepArML meta-search.

Minimize Database-Dependence in Proteome Informatics Apr. 28, 2009 Kyung-Hoon Kwon Korea Basic Science Institute.

Using Scaffold OHRI Proteomics Core Facility. This presentation is intended for Core Facility internal training purposes only.

CPAS Comparative Proteomics Analysis System Adam Rauch LabKey Software

HPC In The Cloud Case Study: Proteomics Workflow

Database Search Algorithm for Identification of Intact Cross-Links in Proteins and Peptides Using Tandem Mass Sepctrometry 신성호.

Open source tools for data analysis

Large Scale DIA With Skyline

Bottom-Up Proteomics Data collection

A Database of Peak Annotations of Empirically Derived Mass Spectra

R SE to the challenges of ntelligent systems

Connecting Cancer Genomics to Cancer Biology using Proteomics

Protein Inference by Generalized Protein Parsimony reduces False Positive Proteins in Bottom-Up Workflows Nathan J. Edwards, Department of Biochemistry.

Thomas BOTZANOWSKI & Blandine CHAZARIN

Proteomics Informatics David Fenyő

Protein Identification Using Mass Spectrometry

Bioinformatics for Proteomics

Is Proteomics the New Genomics?

Performance metrics for triplicate analyses of a tryptic digest of the CPTAC yeast reference proteome on four LTQ-Orbitraps at three different sites in.

Illustration of chromatography metric C-2A applied to LC-MS/MS data from three Thermo LTQ systems in analyses of yeast proteome samples in CPTAC Study.

Jia-Bin Huang Virginia Tech

Proteomics Informatics David Fenyő

Generalized Protein Parsimony

Presentation transcript:

Algorithms and Computation: Bottom-Up Data Analysis Workflows Nathan Edwards Georgetown University Medical Center

Changing landscape Experimental landscape Computational landscape Spectra, sensitivity, resolution, samples Computational landscape Data-size, cloud, resources, reliability Data-size and false positive identifications Controlling for false proteins/genes Improving peptide identification sensitivity Machine-learning, multiple search engines Filtered PSMs as a primary data-type

Changing Experimental Landscape Instruments are faster… More spectra, better precursor sampling Sensitivity improvements… More fractionation (automation), deeper precursor sampling, ion optics Resolution continues to get better… Accurate precursors (fragments) make a big difference Analytical samples per study… Fractionation, chromatography, automation improvements

Clinical Proteomic Tumor Analysis Consortium (NCI) Comprehensive study of genomically characterized (TCGA) cancer biospecimens by mass-spectrometry-based proteomics workflows ~ 100 clinical tumor samples per study Colorectal, breast, ovarian cancer CPTAC Data Portal provides Raw & mzML spectra; TSV and mzIdentML PSMs; protein reports; experimental meta-data

…from Edwards et al., Journal of Proteome Research, 2015 CPTAC Data Portal …from Edwards et al., Journal of Proteome Research, 2015

CPTAC/TCGA Colorectal Cancer (Proteome) Vanderbilt PCC (Liebler) 95 TCGA samples, 15 fractions / sample Label-free spectral count / precursor XIC quant. Orbitrap Velos; high-accuracy precursor 1425 spectra files ~600 Gb/~129 Gb (mzML.gz) Spectra: ~ 18M; ~ 13M MS/MS ~ 4.6M PSMs at 1% MSGF+ q-value

Changing Computational Landscape Single computer operating on a single spectral data-file is no longer feasible MS/MS search is the computational bottleneck Private computing clusters are quickly obsolete Need $$ to upgrade every 3-4 years Personnel costs for cluster administration and management Cloud computing gets faster and cheaper over time… …but requires rethinking the computing model

PepArML Meta-Search Engine Simple, unified, peptide identification search parameterization and execution: Mascot, MSGF+, X!Tandem, K-Score, S-Score, OMSSA, MyriMatch Cluster, grid, and cloud scheduler: Reliable batch spectra conversion and upload, Automated distribution of spectra and sequence, Job-failure tolerant with result-file validation Machine-learning-based result combining: Model-free – heterogeneous features Adapts to the characteristics of each dataset

PepArML Meta-Search Engine Georgetown & Maryland HPC Heterogeneous compute resources Secure communication Edwards Lab Scheduler & 48+ CPUs Under the hood, the user interacts with a scheduler, uploading spectra, and specifying a meta-search. The compute clients, local or remote, contact the scheduler to get spectra and jobs to compute. Single, simple search request Amazon Web Services

PepArML Meta-Search Engine Georgetown & Maryland HPC Heterogeneous compute resources Secure communication Edwards Lab Scheduler & 48+ CPUs Under the hood, the user interacts with a scheduler, uploading spectra, and specifying a meta-search. The compute clients, local or remote, contact the scheduler to get spectra and jobs to compute. Single, simple search request Amazon Web Services

Run all of the search engines!

Search Engine Running Time Which (combination of) search engine(s) should I use?

Fault Tolerant Computing Spot instances can be preempted for those willing to pay more Spot prices are cheaper (7¢/hour vs 46¢/hour)

Identifications per $$ How long will a specific job take? How much memory / data-transfer is needed? What is a good decomposition size? What cloud-instance to use? Wall-clock time can be significantly reduced: …but management overhead costs too. Cost of total compute may even increase. Failed analyses cost too!

Data-Scale and False Positives Big datasets have more false positive proteins and genes! CPTAC Colorectal Cancer (CDAP) 4.6M MSGF+ 1% FDR PSMs + 2 peptides/gene ~ 10,000 genes identified…

Data-Scale and False Positives Big datasets have more false positive proteins and genes! CPTAC Colorectal Cancer (CDAP) 4.6M MSGF+ 1% FDR PSMs + 2 peptides/gene ~ 10,000 genes identified… …but ~ 40% gene FDR

Simple decoy protein model Decoy peptides hit decoy proteins uniformly. Each decoy peptide represents an independent trial. Binomial distribution on size of protein database number of decoy peptides Big-datasets have more decoy peptides!

Example Large: 10,000 proteins, 100,000 peptides Small: 1,000 proteins, 10,000 peptides

Data-Size and False Positives CPTAC Colorectal Cancer 1% FDR PSMs, but ~ 25% peptide FDR ~ 25,000 decoy peptides on ~ 20,000 genes Control of gene FDR requires even more stringent filtering of PSMs. If we require strong evidence in all 95 samples: No decoy genes, but less than 1000 genes identified. Bad scenario: PDHA1 and PDHA2 in CPTAC Breast Cancer – shared and unique peptides PDHA2 is testes specific!

Improved Sensitivity Machine-learning models Use additional metrics for good identifications Combining multiple search engines Agreement indicates good identifications Both approaches successful at boosting ids, particularly when adaptable to each dataset. Watch for the use of decoys in training the model. Both have scaling issues and lack transparency …may add noise to comparisons

PepArML Performance Standard Protein Mix Database LCQ QSTAR LTQ-FT Standard Protein Mix Database 18 Standard Proteins – Mix1

Search Engine Info. Gain

Precursor & Digest Info. Gain

Filtered PSMs as Primary Data For large enough spectral datasets, we might choose best effort peptide identification Filtered PSMs become primary data Spectral counts become more quantitative Need linear-time spectra → PSM algorithm We work less hard to identify all spectra? Output as genome alignments, BAM files? How should PSMs be represented to maximize their utility? What about decoy peptide identifications?

Nascent polypeptide-associated complex subunit alpha

Pyruvate kinase isozymes M1/M2 2.5 x 10-5

Questions?