MS/MS Libraries of Identified Peptides and Recurring Spectra in Protein Digests Lisa Kilpatrick, Jeri Roth, Paul Rudnick, Xiaoyu Yang, Steve Stein Mass Spectrometry Data Center
Library searching in not new Organize for Reuse
MS Library Searching Hertz, Hites and Biemann Anal. Chem. (1971). PBM: McLafferty, Hertel, Villwock Org. Mass Spectrom. (1974). SISCOM: Damen, Henneberg, Weimann, Anal. Chem. Acta (1978). INCOS: Sokolow, Karnofsky, Gustafson, Finnigan Application Report 2 (March 1978). Stein, Scott J. Amer. Soc. Mass Spectrom., (1994).
‘Dot Product’ (cosine of ‘angle’ between a pair of spectra) Measured = f(m/z abundance) Reference = f(m/z abundance) f(abundance) : Weight as you like Sum over all peaks in common Normalize
Traditional GC/MS Library Search
Variability Depends on S/N ~7,000 Radiodurans Peptides, LCQ (PNNL/NCRR) Medians
Library Searching for Peptides LIBQUEST (Yates) –Yates et al, Anal. Chem., 1998, 70, 3557 X!Hunter (Beavis) –Craig et al, J. Proteome Res., 2006, 5, 1843 BiblioSpec (MacCoss) –Frewen et al., Anal. Chem. 2006, 78, 5678 Spectral Comparison (Kearney) –Liu et al, Proteome Science 2007, 5:3 SpectraST (Aebersold) –Lam et al., Proteomics , NIST Peptide Ion Fragmentation Library –June 2006 release (US-HUPO – March 2004)
Why Spectrum Libraries? More sensitive Better scoring Faster Annotation Unrestricted precursor ion
Identification by Spectrum Matching is More Sensitive than by Spectrum/Sequence Matching Simple Protein Mix
Spectrum/Spectrum Scores are More Robust than Sequence/Spectrum Scores Sequence score 99% Confidence
0.005/s vs. 6.2/s per query spectrum Matching Spectra is Faster than Matching Sequence
Reference Library Building Extract identified spectra from sequence search –Multiple search engines –Instrument-class specific Create ‘consensus’ spectra –Two or more matching spectra, also save best Assign probability of being correct –Refine confidence starting from decoy FDR –Classify peptides – tryptic, missed cleavage, semi, mods Create searchable spectral library –Resolve conflicts, add annotation
Three Classes of Libraries I. Conventional Target Identification –Peptides (Proteins) II. Identifiable –By unconventional searching III. Not Identifiable –Account for all recurring spectra –QA/QC
I. OMSSA overlap with MS/MS Library Search K 6/ K 6/07 Identified spectra (1% FDR) for 1-D Yeast NCI/CPTAC – Vanderbilt
II. Identify What we Can Derive Class-specific FDR Tryptic –Simple –Expected missed cleavages –Unexpected missed cleavages Semitryptic (cleaved tryptic) –No missed cleavage In source (with parent at same retention) In sample –Missed cleavage In source (with parent) In sample (obey rules) Uncommon – reject Others …
Atypical Peptide Ions use Sequence Search Method Tryptic only with many mods Less common: Methylation, Phosphorylation, … Artifacts: Na, K, Carbamyl InsPecT/Pevzner (Unidentified, +70) High charge states, >2 missed cleavages Use class specific score thresholds
HSA/Fibrinogen/Transferrin Mix 6124 Consensus Peptide Spectra, IT, Qtof, TofTof Ion Trap Peptide Ions: 1300 HSA, 1100 Fibrinogen, 700 Transferrin
contiguous = tryptic, exploded = semitryptic
III. Library of Recurring, Unidentified Spectra Create consensus spectra –From similar spectra from an experiment Combine from multiple experiments Identify spectra in other experiments –QA/QC: Artifacts, in standards, … –Apply other sequencing methods
Assign all Spectra Identified Spectrum –Matches library peptide or unidentified spectrum –Subset of peaks match library spectrum (impure) –Similar to a matched spectrum (cluster) Not a Peptide –Low S/N Maximum/Median <15 –High charge state (many large peaks) Proteins, large fragments, … –One dominant peak Stable ion, not peptide –Singly charged (high/low abund < 1.2) Probable artifact, lower probability of identification –Narrow m/z range Peptide?
exploded = identified, contiguous = unidentified
Library Pipeline of the Future assigned No ID Pep. Lib Unass. Lib unassigned No ID Garbage filter Sequence Search, De Novo, Theoretical Spec, Similarity,... No ID assigned Mass spectrometer
NCI/NIH - CPTAC: Clinical Proteomic Technology Assessment for Cancer Technology assessment; develop standard protocols and clinical reference sets; and evaluate methods to ensure data reproducibility. Broad Institute of MIT and Harvard, Memorial Sloan-Kettering Cancer Center, Purdue University, University of California, San Francisco,, and Vanderbilt University School of Medicine. NCI grants (U24CA , U24CA , U24CA , U24CA , and U24CA ).
Run-to-Run Chromatographic Reproducibility
Broad Orbitrap Vandy Orbitrap NYU Orbitrap INCAPS LTQ NIST LTQ Vandy LTQ Purdue LTQ YICENQDSISSK Lab-to-Lab Chromatography
HSA_CAM_SigmaA9511_5H_8MS2_m2_10de_040406_05
Measures of Reproducibility Identified ions –Unique peptides, Ions, Spectrum counts Unidentified components –Classify by type, link to origin Ion cluster analysis –MS1 linked to MS2 Chromatography –Time evolution of ion clusters
Ion Component Analysis
Ion Component Analysis (Yeast)
Components in Replicate Runs total sampled identified ▲▼ run 1,2 ■ in both