False-Discovery-Rate Aware Protein Inference by Generalized Protein Parsimony Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology.

Slides:



Advertisements
Similar presentations
Informatics for proteomic inventories Biomedical Informatics Vanderbilt University.
Advertisements

Chapter 5 Decisions-making
Cluster Analysis: Basic Concepts and Algorithms
Element Loads Strain and Stress 2D Analyses Structural Mechanics Displacement-based Formulations.
Big Data & the CPTAC Data Portal Nathan Edwards, Peter McGarvey Mauricio Oberti, Ratna Thangudu Shuang Cai, Karen Ketchum Georgetown University & ESAC.
In-depth Analysis of Protein Amino Acid Sequence and PTMs with High-resolution Mass Spectrometry Lian Yang 2 ; Baozhen Shan 1 ; Bin Ma 2 1 Bioinformatics.
N-Glycopeptide Identification from CID Tandem Mass Spectra using Glycan Databases and False Discovery Rate Estimation Kevin B. Chandler, Petr Pompach,
Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis Nathan Edwards Department of Biochemistry and Molecular & Cellular.
PepArML: A model-free, result-combining peptide identification arbiter via machine learning Xue Wu, Chau-Wen Tseng, Nathan Edwards University of Maryland,
De Novo Sequencing v.s. Database Search Bin Ma School of Computer Science University of Waterloo Ontario, Canada.
Threshold selection in gene co- expression networks using spectral graph theory techniques Andy D Perkins*,Michael A Langston BMC Bioinformatics 1.
The Rational Decision-Making Process
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Predicting protein functions from redundancies in large-scale protein interaction networks Speaker: Chun-hui CAI
+ Protein and gene model inference based on statistical modeling in k-partite graphs Sarah Gester, Ermir Qeli, Christian H. Ahrens, and Peter Buhlmann.
GTL Facilities Characterization and Imaging of Molecular Machines Lee Makowski.
Dan Simon Cleveland State University
This work is licensed under a Creative Commons Attribution 4.0 International License. Oliver Kohlbacher, Sven Nahnsen, Knut Reinert COMPUTATIONAL PROTEOMICS.
Scaffold Download free viewer:
DECISION MAKING. What Decision Making Is?  Decision making is the process of identifying problems and opportunities, developing alternative solutions,
Analysis of tandem mass spectra - II Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology.
Generalized Protein Parsimony and Spectral Counting for Functional Enrichment Analysis Nathan Edwards Department of Biochemistry and Molecular & Cellular.
Evaluated Reference MS/MS Spectra Libraries Current and Future NIST Programs.
Identification of regulatory proteins from human cells using 2D-GE and LC-MS/MS Victor Paromov Christian Muenyi William L. Stone.
Production of polypeptides, Da, and middle-down analysis by LC-MSMS Catherine Fenselau 1, Joseph Cannon 1, Nathan Edwards 2, Karen Lohnes 1,
Generating Peptide Candidates from Protein Sequence Databases for Protein Identification via Mass Spectrometry Nathan Edwards Informatics Research.
Common parameters At the beginning one need to set up the parameters.
Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University Medical Center.
1 Bio-Trac 40 (Protein Bioinformatics) October 8, 2009 Zhang-Zhi Hu, M.D. Associate Professor Department of Oncology Department of Biochemistry and Molecular.
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
Meta-Search and Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.
Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.
Protein bioinformatics and systems biology Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.
Peptidesproteinsgenes protein accessionsharedsharedunique gene nameshareduniqueunique Identified by gene unique peptides Identified by protein and gene.
PeptideProphet Explained Brian C. Searle Proteome Software Inc SW Bertha Blvd, Portland OR (503) An explanation.
FDR Thresholding Caleb J. Emmons Slide: 1. What is FDR? Slide: 2 If decoy proteins are present Protein FDR = # decoy proteins identified # target proteins.
Systems Biology ___ Toward System-level Understanding of Biological Systems Hou-Haifeng.
Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center Application of meta-search, grid-computing, and machine-learning.
Software Project MassAnalyst Roeland Luitwieler Marnix Kammer April 24, 2006.
Standards for proteomics: The HUPO Proteomics Standards Initiative (HUPO PSI) Public Repository for Mass spectrometry spectral.
Proteomics What is it? How is it done? Are there different kinds? Why would you want to do it (what can it tell you)?
Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,
Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical.
Faster, more sensitive peptide identification from tandem mass spectra by sequence database compression Nathan J. Edwards Center for Bioinformatics & Computational.
Multiple flavors of mass analyzers Single MS (peptide fingerprinting): Identifies m/z of peptide only Peptide id’d by comparison to database, of predicted.
EBI is an Outstation of the European Molecular Biology Laboratory. In silico analysis of accurate proteomics, complemented by selective isolation of peptides.
Data Management Support for Life Sciences or What can we do for the Life Sciences? Mourad Ouzzani
Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center  Peptide sequence databases, meta-search engine, machine-learning.
Remembering Key words: who, what, why, when, where, which, choose, find, how, define, label, show, spell, list, match, name, relate, tell, recall, select.
Interactive Control of Avatars Animated with Human Motion Data By: Jehee Lee, Jinxiang Chai, Paul S. A. Reitsma, Jessica K. Hodgins, Nancy S. Pollard Presented.
Poster produced by Faculty & Curriculum Support (FACS), Georgetown University Medical Center Application of meta-search, grid-computing, and machine-learning.
Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,
Critique of Hotelling Hotelling’s “Principle of Minimum Differentiation” was flawed No pure strategy exists if firms are close together. With quadratic.
PeptideShaker Overview What makes PeptideShaker special? - proteomics: shaken, not stirred! 1)Free, open-source and platform independent! 2)Focus on user-friendliness.
Reid & Sanders, Operations Management © Wiley 2002 Linear Programming B SUPPLEMENT.
Application of meta-search, grid-computing, and machine-learning can significantly improve the sensitivity of peptide identification. The PepArML meta-search.
Deducing protein composition from complex protein preparations by MALDI without peptide separation.. TP #419 Kenneth C. Parker SimulTof Corporation, Sudbury,
Updating Probabilities Ariel Caticha and Adom Giffin Department of Physics University at Albany - SUNY MaxEnt 2006.
Minimize Database-Dependence in Proteome Informatics Apr. 28, 2009 Kyung-Hoon Kwon Korea Basic Science Institute.
2015/06/03 Park, Hyewon 1. Introduction Protein assembly Transforms a list of identified peptides into a list of identified proteins. 2 Duplicate Spectrum.
Cedar: A Multi-Tiered Protein Identification Scheme for Shotgun Proteomics Terry Farrah (1); Eric Deutsch (1); Gilbert Omenn (2,1); Ruedi Aebersold (3),
Algorithms and Computation: Bottom-Up Data Analysis Workflows
Proteomic Parsimony through Bipartite Graph Analysis Improves Accuracy and Transparency 2013/05/28 Ahn, Soohan.
Yiming Yang1,2, Abhay Harpale1 and Subramanian Ganaphathy1
Assessment.
Protein Inference by Generalized Protein Parsimony reduces False Positive Proteins in Bottom-Up Workflows Nathan J. Edwards, Department of Biochemistry.
Highlights of proposed changes
Identification of chaperonin GroEL (Rv0440) with representative MS/MS spectrum. Identification of chaperonin GroEL (Rv0440) with representative MS/MS spectrum.A,
Generalized Protein Parsimony
Presentation transcript:

False-Discovery-Rate Aware Protein Inference by Generalized Protein Parsimony Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center

Peptide-Spectrum Matches Sigma49 – 32,691 LTQ MS/MS spectra of 49 human protein standards; IPI Human Yeast – 162,420 LTQ MS/MS spectra from a yeast cell lysate; SGD. X!Tandem E-value (no refinement), 1% FDR 2 Spectra used in: Zhang, B.; Chambers, M. C.; Tabb, D. L

Traditional Protein Parsimony Select the smallest set of proteins that explain all identified peptides. Sensible principle, implies Eliminate equivalent/subset proteins Equivalent proteins are problematic: Which one to choose? Unique-protein peptides force the inclusion of proteins into solution True for most tools, even probability based ones Bad consequences for FDR filtered ids 3

Many proteins are easy Eliminate equivalent / dominated proteins Sigma49: 277 → 60 proteins Yeast:1226 → 1085 proteins Many components have a single protein: Sigma49: 52 ( 3 multi-protein) Yeast: 994 (43 multi-protein) "Unique" peptides force protein inclusion Sigma49: 16 single-peptide proteins Yeast: 476 single-peptide proteins 4

Must eliminate redundancy Contained proteins should not be selected 5 37 distinct peptides

Must eliminate redundancy Contained proteins should not be selected Even if they have some probability mass Number of sibling peptides matter less if they are shared Single AA Difference

Must ignore some PSMs A single additional peptide should not force protein into solution 7 Single AA Difference

Example from Yeast "Inosine monophosphate dehydrogenase" 4 gene family Contained proteins should not be selected Single peptide evidence for YML056C

Must ignore some PSMs Improving peptide identification sensitivity makes things worse! False PSMs don't cluster 9 10% 2x Proteins PSMs

Must ignore some PSMs Improving peptide identification sensitivity makes things worse! False PSMs don't cluster 10 Select Proteins to Explain True PSM% PSMs 90%

Must ignore some PSMs How do we choose? Maximize # peptides? Minimize FDR (naïve model)? Maximize # PSMs? 11

Generalized Protein Parsimony Weight peptides by number of PSMs Constrain unique peptides per protein Maximize explained peptides (PSMs) Match PSM filtering FDR to % uncovered PSMs Readily solved by branch-and-bound Permits complex protein/peptide constraints Reduces to traditional protein parsimony 12

Match FDR to uncovered PSMs 13 Traditional Parsimony at 1% FDR: 1085 ( Unique) Proteins

Software Filter multi-acquisition identifications by: FDR, E-value, probability Rewrite PSMs to reflect parsimony analysis PepXML, CSV, Excel Component-wise Peptide-Protein matrix: Selected, Dominant, Equivalent, Contained Selected protein accessions: …plus equivalents 14

Conclusions Many components are clear Doesn't matter what technique is used Traditional techniques do not handle the second protein in a component well A single additional peptide should not force Explain only the true PSM %: Determine protein criteria first Adjust PSM filter until explained peptides match 15