Cedar: A Multi-Tiered Protein Identification Scheme for Shotgun Proteomics Terry Farrah (1); Eric Deutsch (1); Gilbert Omenn (2,1); Ruedi Aebersold (3),

Cedar: A Multi-Tiered Protein Identification Scheme for Shotgun Proteomics Terry Farrah (1); Eric Deutsch (1); Gilbert Omenn (2,1); Ruedi Aebersold (3), Robert Moritz (1) 1 Institute for Systems Biology, Seattle, WA; 2 University of Michigan, Ann Arbor, MI; 3 Swiss Federal Institute of Technology, Zurich, Switzerland

Producing a definitive set of protein identifications from a set of peptide identifications is a continuing problem in interpreting shotgun proteomics data, and to date no standard method has been agreed upon. Further, for some purposes, such as estimating the number of distinct proteins revealed by the data, a highly non-redundant protein identification set is desired, whereas for other purposes, such as comparison with a non-redundant list for another proteome, or selection of peptides for selected reaction monitoring (SRM) experiment design, redundancy is desirable. Background

Human Plasma PeptideAtlas 1 identified peptides were mapped to IPI+Ensembl, yielding ~16,000 protein identifications, many identical to others or with no or few distinguishing peptides. How to report number of proteins observed? Motivation

Given a list of peptides (pepseqs stripped of mods + number of tryptic termini) identified in a shotgun proteomics experiment, create: 1. A conservative list of protein identifications - each entry highly distinguishable from the others by the data - each entry possibly representing a set of indistinguishable or poorly distinguishable protein sequences - the length of the list representing the number of proteins we believe we have observed - low FDR 2. A comprehensive list of protein identifications, including any protein sequence from a reference database that includes even one observed peptide. 3. A protein identification for each observed peptide, minimizing the number of distinct identifiers used Goal

- Groups protein sequences sharing peptides - Notes indistinguishable (listed together on single line) and subsumed (unique peps = 0) protein sequences ProteinProphet provides foundation Nesvizhskii, et al., A Statistical Model for Identifying Proteins by Tandem Mass Spectrometry, Anal Chem 2003

- Does not provide desired conservative protein list. Including top probability protein in each group is too conservative, because often other proteins in group have many distinguishing peptides. Including all non-subsumed proteins, however, is overly inclusive. - Some protein sequences that are not notated “subsumed” actually are subsumed, but differ in whether N-term. is tryptic for one/more peptides. ProteinProphet lacks some desired features

Cedar Architecture 1 PeptideProphet: assign probabilities iProphet 2 : refine probabilities Among canonicals + possibly distinguished, iteratively find covering set. Search, ideally using decoys Adjust PSM FDR threshold Preliminary protein identification list (exhaustive set), organized into sets of indistinguishable protein sequences (share exact same observed peps) and further separated into protein groups, with some labeled subsumed Among remaining non-subsumed, determine one or more canonicals for each protein group using 80% peptide sharing threshold. Label others possibly-distinguished. Among non-subsumed distinguishables, find ntt- subsumed ProteinProphet (combining multiple expts if desired) Protein FDR ~ 1% for canonical set? Apply PSM FDR threshold (0.001 – 0.00001) to each expt. RefreshParser: map peptides to reference DB Reference DB: Swiss-Prot + varsplic, IPI, Ensembl Peptide list Spectra Within each set of indistinguishable protein sequences: select one with preferred accession for distinguishable set. from each cluster of identical sequences, select one with preferred accession for sequence-unique set. Search DB: IPI or NIST no yes TransProteomic Pipeline (TPP) Protein identification classification steps 1.Farrah, et al., A High Confidence Human Plasma Proteome Reference Set with Estimated Concentrations in the PeptideAtlas, MCP 2011 2.Shteynberg, et al., iProphet: Improved Statistical Validation of Peptide Identifications in Shotgun Proteomics, submitted to MCP.

19460 9419 4671 2175 1929246 2022 Multi-tiered protein identifications with tallies from 2010 Human Plasma PeptideAtlas 1 Canonical set is conservative list Exhaustive set is comprehensive list Covering set provides protein identification for each peptide

Illustration of Terminology at Peptide Level using a hypothetical ProteinProphet protein group protein peptide

Exhaustive:12,621 Canonical: 1243 Exhaustive: 19,460 Canonical: 1929 Urine Plasma To find the overlap between two atlases, take the intersection of the canonical list for one atlas and the exhaustive list for the other. This comparison is not symmetric. Urine canonical vs. plasma exhaustive: 808 Plasma canonical vs. urine exhaustive: 871 Application: Comparing protein lists for human plasma and urine

Example: Complement C3 protein group in Human Plasma PeptideAtlas

possibly distinguished subsumed canonical subsumed possibly distinguished P01024 Complement C3 ENSP00000406291 IPI00739237 IPI00887739 A8K2U0 Alpha-2-macroglobulin-like protein 1 O95568 UPF0558 protein C1orf156 IPI00942927 ntt-subsumed Schematic sequence alignment Region of non-identity Region of sequence identity subsumed Distinguished by its N-terminal peptide, which is not seen in P01024 because it is not tryptic there. Distinguished by peptides encompassing this single residue difference Shares one 7-residue peptide with P01024; unrelated Shares one 8-residue peptide with P01024; unrelated These two sequences, which differ from each other in only 3 positions, appear to be splice variants of P01024. Complement C3 group:

- Recognize protein sequences that are subsumable by multiple protein sequences, and omit these from the canonical set. - As a threshold for separating canonical from possibly distinguished, consider using fraction of matching residues among observed peptides rather than fraction of matching peptides. - Symmetric comparison between protein lists - Full automation of this scheme within the TPP Future work

Cedar: A Multi-Tiered Protein Identification Scheme for Shotgun Proteomics Terry Farrah (1); Eric Deutsch (1); Gilbert Omenn (2,1); Ruedi Aebersold (3),

Similar presentations

Presentation on theme: "Cedar: A Multi-Tiered Protein Identification Scheme for Shotgun Proteomics Terry Farrah (1); Eric Deutsch (1); Gilbert Omenn (2,1); Ruedi Aebersold (3),"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Cedar: A Multi-Tiered Protein Identification Scheme for Shotgun Proteomics Terry Farrah (1); Eric Deutsch (1); Gilbert Omenn (2,1); Ruedi Aebersold (3),

Similar presentations

Presentation on theme: "Cedar: A Multi-Tiered Protein Identification Scheme for Shotgun Proteomics Terry Farrah (1); Eric Deutsch (1); Gilbert Omenn (2,1); Ruedi Aebersold (3),"— Presentation transcript:

Similar presentations

About project

Feedback