Cedar: A Multi-Tiered Protein Identification Scheme for Shotgun Proteomics Terry Farrah (1); Eric Deutsch (1); Gilbert Omenn (2,1); Ruedi Aebersold (3),

Slides:



Advertisements
Similar presentations
David Campbell 1,, Eric Deutsch 1, Henry Lam 1, Hamid Mirzaei 1, Paola Picotti 2, Jeff Ranish 1, Ning Zhang 1, and Ruedi Aebersold 1,2,3 1.Institute for.
Advertisements

De Novo Sequencing v.s. Database Search Bin Ma School of Computer Science University of Waterloo Ontario, Canada.
Bin Ma, CTO Bioinformatics Solutions Inc. June 5, 2011.
CSE182-L12 Gene Finding.
Pattern databases in protein analysis Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Previous Lecture: Regression and Correlation
This work is licensed under a Creative Commons Attribution 4.0 International License. Oliver Kohlbacher, Sven Nahnsen, Knut Reinert COMPUTATIONAL PROTEOMICS.
Overview We have developed a complete, end-to-end data analysis pipeline that provides an automated, reliable, consistent, and objective analysis of high-throughput.
Build Results Plasma-only Build Empirical Observability Scores Eric W. Deutsch, Nichole L. King, Jimmy K. Eng, Alexey I. Nesvizhskii, David S. Shteynberg,
Analysis of tandem mass spectra - II Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology.
Absolute protein quantification estimated by spectral counting using large datasets in PeptideAtlas Ning Zhang 1*, Eric W. Deutsch 1*, Henry Lam 1, Hamid.
Information theoretic interpretation of PAM matrices Sorin Istrail and Derek Aguiar.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
TEMPLATE DESIGN © Abstract Result 3 : Statistical analysis of identified N α -acetylated peptides. Methods Conclusion.
Introduction The GPM project (The Global Proteome Machine Organization) Salvador Martínez de Bartolomé Bioinformatics support –
Human Proteome Project? Màster en bioquímica, biologia molecular i biomedicina Mòdul 4: Genòmica i Proteòmica Núria Colomé Calls.
Modelling binding site with 3DLigandSite Mark Wass
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Generating Peptide Candidates from Protein Sequence Databases for Protein Identification via Mass Spectrometry Nathan Edwards Informatics Research.
Common parameters At the beginning one need to set up the parameters.
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
Search Engine Result Combining Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center.
Peptidesproteinsgenes protein accessionsharedsharedunique gene nameshareduniqueunique Identified by gene unique peptides Identified by protein and gene.
PeptideProphet Explained Brian C. Searle Proteome Software Inc SW Bertha Blvd, Portland OR (503) An explanation.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Protein Identification via Database searching Attila Kertész-Farkas Protein Structure and Bioinformatics Group, ICGEB, Trieste.
Standards for proteomics: The HUPO Proteomics Standards Initiative (HUPO PSI) Public Repository for Mass spectrometry spectral.
Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.
Spaghetti: Visualization of Observed Peptides in Tandem Mass Spectrometry Steven Lewis 1, Terry Farrah 1, Eric W Deutsch 1, John Boyle 1 1 Institute for.
Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,
Improving the Sensitivity of Peptide Identification Nathan Edwards Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical.
Faster, more sensitive peptide identification from tandem mass spectra by sequence database compression Nathan J. Edwards Center for Bioinformatics & Computational.
Tutorial 3 BLAST 1. BLAST tutorial How to use BLAST Score vs. E-value Exercise Cool story of the day: How Alzheimer is studied in yeast 2.
EBI is an Outstation of the European Molecular Biology Laboratory. In silico analysis of accurate proteomics, complemented by selective isolation of peptides.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Background Spectral library searching Spectral library searching is an alternative approach to traditional sequence database searching for peptide inference.
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Statistical Tests We propose a novel test that takes into account both the genes conserved in all three regions ( x 123 ) and in only pairs of regions.
Lecture 6 Comparative analysis Oct 2011 SDMBT.
ISA Kim Hye mi. Introduction Input Spectrum data (Protein database) Peptide assignment Peptide validation manual validation PeptideProphet.
Proteomics Informatics (BMSC-GA 4437) Course Directors David Fenyö Kelly Ruggles Beatrix Ueberheide Contact information
Minimize Database-Dependence in Proteome Informatics Apr. 28, 2009 Kyung-Hoon Kwon Korea Basic Science Institute.
2015/06/03 Park, Hyewon 1. Introduction Protein assembly Transforms a list of identified peptides into a list of identified proteins. 2 Duplicate Spectrum.
Using Scaffold OHRI Proteomics Core Facility. This presentation is intended for Core Facility internal training purposes only.
Considerations for multi-omics data integration Michael Tress CNIO,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Protein identification by mass spectrometry The shotgun proteomics strategy, based on digesting proteins into peptides and sequencing them using tandem.
Protein identification by mass spectrometry The shotgun proteomics strategy, based on digesting proteins into peptides and sequencing them using tandem.
Proteomic Parsimony through Bipartite Graph Analysis Improves Accuracy and Transparency 2013/05/28 Ahn, Soohan.
Yiming Yang1,2, Abhay Harpale1 and Subramanian Ganaphathy1
Jarrett Egertson, Ph.D. MacCoss Lab
Bottom-Up Proteomics Data collection
Protein Identification via Database searching
Creation of assays using repositories
Protein Inference by Generalized Protein Parsimony reduces False Positive Proteins in Bottom-Up Workflows Nathan J. Edwards, Department of Biochemistry.
UniProt: Universal Protein Resource
CSE182-L12 Gene Finding.
Predicting Active Site Residue Annotations in the Pfam Database
Highlights of proposed changes
Proteomics Informatics David Fenyő
Volume 24, Issue 13, Pages (July 2014)
Top-down protein identification.
Schematic representation of proteogenomic annotation strategy.
Protein identification using MS/MS.
Sim and PIC scoring results for standard peptides and the test shotgun proteomics dataset. Sim and PIC scoring results for standard peptides and the test.
(A) Design of the PhosphoPep database.
Proteomics Informatics David Fenyő
Presentation transcript:

Cedar: A Multi-Tiered Protein Identification Scheme for Shotgun Proteomics Terry Farrah (1); Eric Deutsch (1); Gilbert Omenn (2,1); Ruedi Aebersold (3), Robert Moritz (1) 1 Institute for Systems Biology, Seattle, WA; 2 University of Michigan, Ann Arbor, MI; 3 Swiss Federal Institute of Technology, Zurich, Switzerland

Producing a definitive set of protein identifications from a set of peptide identifications is a continuing problem in interpreting shotgun proteomics data, and to date no standard method has been agreed upon. Further, for some purposes, such as estimating the number of distinct proteins revealed by the data, a highly non-redundant protein identification set is desired, whereas for other purposes, such as comparison with a non-redundant list for another proteome, or selection of peptides for selected reaction monitoring (SRM) experiment design, redundancy is desirable. Background

Human Plasma PeptideAtlas 1 identified peptides were mapped to IPI+Ensembl, yielding ~16,000 protein identifications, many identical to others or with no or few distinguishing peptides. How to report number of proteins observed? Motivation

Given a list of peptides (pepseqs stripped of mods + number of tryptic termini) identified in a shotgun proteomics experiment, create: 1. A conservative list of protein identifications - each entry highly distinguishable from the others by the data - each entry possibly representing a set of indistinguishable or poorly distinguishable protein sequences - the length of the list representing the number of proteins we believe we have observed - low FDR 2. A comprehensive list of protein identifications, including any protein sequence from a reference database that includes even one observed peptide. 3. A protein identification for each observed peptide, minimizing the number of distinct identifiers used Goal

- Groups protein sequences sharing peptides - Notes indistinguishable (listed together on single line) and subsumed (unique peps = 0) protein sequences ProteinProphet provides foundation Nesvizhskii, et al., A Statistical Model for Identifying Proteins by Tandem Mass Spectrometry, Anal Chem 2003

- Does not provide desired conservative protein list. Including top probability protein in each group is too conservative, because often other proteins in group have many distinguishing peptides. Including all non-subsumed proteins, however, is overly inclusive. - Some protein sequences that are not notated “subsumed” actually are subsumed, but differ in whether N-term. is tryptic for one/more peptides. ProteinProphet lacks some desired features

Cedar Architecture 1 PeptideProphet: assign probabilities iProphet 2 : refine probabilities Among canonicals + possibly distinguished, iteratively find covering set. Search, ideally using decoys Adjust PSM FDR threshold Preliminary protein identification list (exhaustive set), organized into sets of indistinguishable protein sequences (share exact same observed peps) and further separated into protein groups, with some labeled subsumed Among remaining non-subsumed, determine one or more canonicals for each protein group using 80% peptide sharing threshold. Label others possibly-distinguished. Among non-subsumed distinguishables, find ntt- subsumed ProteinProphet (combining multiple expts if desired) Protein FDR ~ 1% for canonical set? Apply PSM FDR threshold (0.001 – ) to each expt. RefreshParser: map peptides to reference DB Reference DB: Swiss-Prot + varsplic, IPI, Ensembl Peptide list Spectra Within each set of indistinguishable protein sequences: select one with preferred accession for distinguishable set. from each cluster of identical sequences, select one with preferred accession for sequence-unique set. Search DB: IPI or NIST no yes TransProteomic Pipeline (TPP) Protein identification classification steps 1.Farrah, et al., A High Confidence Human Plasma Proteome Reference Set with Estimated Concentrations in the PeptideAtlas, MCP Shteynberg, et al., iProphet: Improved Statistical Validation of Peptide Identifications in Shotgun Proteomics, submitted to MCP.

Multi-tiered protein identifications with tallies from 2010 Human Plasma PeptideAtlas 1 Canonical set is conservative list Exhaustive set is comprehensive list Covering set provides protein identification for each peptide

Illustration of Terminology at Peptide Level using a hypothetical ProteinProphet protein group protein peptide

Exhaustive:12,621 Canonical: 1243 Exhaustive: 19,460 Canonical: 1929 Urine Plasma To find the overlap between two atlases, take the intersection of the canonical list for one atlas and the exhaustive list for the other. This comparison is not symmetric. Urine canonical vs. plasma exhaustive: 808 Plasma canonical vs. urine exhaustive: 871 Application: Comparing protein lists for human plasma and urine

Example: Complement C3 protein group in Human Plasma PeptideAtlas

possibly distinguished subsumed canonical subsumed possibly distinguished P01024 Complement C3 ENSP IPI IPI A8K2U0 Alpha-2-macroglobulin-like protein 1 O95568 UPF0558 protein C1orf156 IPI ntt-subsumed Schematic sequence alignment Region of non-identity Region of sequence identity subsumed Distinguished by its N-terminal peptide, which is not seen in P01024 because it is not tryptic there. Distinguished by peptides encompassing this single residue difference Shares one 7-residue peptide with P01024; unrelated Shares one 8-residue peptide with P01024; unrelated These two sequences, which differ from each other in only 3 positions, appear to be splice variants of P Complement C3 group:

- Recognize protein sequences that are subsumable by multiple protein sequences, and omit these from the canonical set. - As a threshold for separating canonical from possibly distinguished, consider using fraction of matching residues among observed peptides rather than fraction of matching peptides. - Symmetric comparison between protein lists - Full automation of this scheme within the TPP Future work