Building and Using Libraries of Peptide Ion Fragmentation Spectra S.E. Stein, L.E. Kilpatrick, M. Mautner, P. Neta, J. Roth National Institute of Standards.

Slides:



Advertisements
Similar presentations
Kaizhong Zhang Department of Computer Science University of Western Ontario London, Ontario, Canada Joint work with Bin Ma, Gilles Lajoie, Amanda Doherty-Kirby,
Advertisements

David Campbell 1,, Eric Deutsch 1, Henry Lam 1, Hamid Mirzaei 1, Paola Picotti 2, Jeff Ranish 1, Ning Zhang 1, and Ruedi Aebersold 1,2,3 1.Institute for.
Protein Quantitation II: Multiple Reaction Monitoring
UC Mass Spectrometry Facility & Protein Characterization for Proteomics Core Proteomics Capabilities: Examples of Protein ID and Analysis of Modified Proteins.
In-depth Analysis of Protein Amino Acid Sequence and PTMs with High-resolution Mass Spectrometry Lian Yang 2 ; Baozhen Shan 1 ; Bin Ma 2 1 Bioinformatics.
Bin Ma, CTO Bioinformatics Solutions Inc. June 5, 2011.
CSE182 CSE182-L12 Mass Spectrometry Peptide identification.
Data Processing Algorithms for Analysis of High Resolution MSMS Spectra of Peptides with Complex Patterns of Posttranslational Modifications Shenheng Guan.
Fa 05CSE182 CSE182-L8 Mass Spectrometry. Fa 05CSE182 Bio. quiz What is a gene? What is a transcript? What is translation? What are microarrays? What is.
Proteomics Informatics – Protein identification II: search engines and protein sequence databases (Week 5)
Proteomics Informatics Workshop Part I: Protein Identification
Previous Lecture: Regression and Correlation
Each results report will contain:
Scaffold Download free viewer:
Proteomics Informatics (BMSC-GA 4437) Course Director David Fenyö Contact information
My contact details and information about submitting samples for MS
Facts and Fallacies about de Novo Sequencing & Database Search.
Analysis of tandem mass spectra - II Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology.
Proteomics Informatics (BMSC-GA 4437) Course Director David Fenyö Contact information
Social Science Research Design and Statistics, 2/e Alfred P. Rovai, Jason D. Baker, and Michael K. Ponton Internal Consistency Reliability Analysis PowerPoint.
Evaluated Reference MS/MS Spectra Libraries Current and Future NIST Programs.
Protein sequencing and Mass Spectrometry. Sample Preparation Enzymatic Digestion (Trypsin) + Fractionation.
Proteomics Informatics – Data Analysis and Visualization (Week 13)
The dynamic nature of the proteome
PROTEIN STRUCTURE NAME: ANUSHA. INTRODUCTION Frederick Sanger was awarded his first Nobel Prize for determining the amino acid sequence of insulin, the.
Introduction The GPM project (The Global Proteome Machine Organization) Salvador Martínez de Bartolomé Bioinformatics support –
INF380 - Proteomics-91 INF380 – Proteomics Chapter 9 – Identification and characterization by MS/MS The MS/MS identification problem can be formulated.
MS/MS Libraries of Identified Peptides and Recurring Spectra in Protein Digests Lisa Kilpatrick, Jeri Roth, Paul Rudnick, Xiaoyu Yang, Steve Stein Mass.
Common parameters At the beginning one need to set up the parameters.
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
A Comprehensive Comparison of the de novo Sequencing Accuracies of PEAKS, BioAnalyst and PLGS Bin Ma 1 ; Amanda Doherty-Kirby 1 ; Aaron Booy 2 ; Bob Olafson.
A Phospho-Peptide Spectrum Library for Improved Targeted Assays Barbara Frewen 1, Scott Peterman 1, John Sinclair 2, Claus Jorgensen 2, Amol Prakash 1,
Laxman Yetukuri T : Modeling of Proteomics Data
Measures of central tendency are statistics that express the most typical or average scores in a distribution These measures are: The Mode The Median.
INF380 - Proteomics-101 INF380 – Proteomics Chapter 10 – Spectral Comparison Spectral comparison means that an experimental spectrum is compared to theoretical.
PeptideProphet Explained Brian C. Searle Proteome Software Inc SW Bertha Blvd, Portland OR (503) An explanation.
Temple University MASS SPECTROMETRY FURTHER INVESTIGATIONS Ilyana Mushaeva and Amber Moscato Department of Electrical and Computer Engineering Temple University.
Temple University MASS SPECTROMETRY INTRODUCTION Ilyana Mushaeva and Amber Moscato Department of Electrical and Computer Engineering Temple University.
Protein Identification via Database searching Attila Kertész-Farkas Protein Structure and Bioinformatics Group, ICGEB, Trieste.
A Reference Library of Peptide Ion Fragmentation Spectra: Yeast S.E. Stein, L.E. Kilpatrick, P. Neta, Q.L. Pu, J. Roth, X. Yang National Institute of Standards.
PEAKS: De Novo Sequencing using Tandem Mass Spectrometry Bin Ma Dept. of Computer Science University of Western Ontario.
Proteomics What is it? How is it done? Are there different kinds? Why would you want to do it (what can it tell you)?
CSE182 CSE182-L12 Mass Spectrometry Peptide identification.
INF380 - Proteomics-71 INF380 – Proteomics Chap 7 –Protein Identification and Characterization by MS Protein identification in our context means that we.
A Reference Library of Peptide Ion Fragmentation Spectra Stephen Stein 1 ; Lisa Kilpatrick 2 ; Pedatsur Neta 1 ; Jeri Roth 1 ; Xiaoyu Yang 1 National Institute.
Multiple flavors of mass analyzers Single MS (peptide fingerprinting): Identifies m/z of peptide only Peptide id’d by comparison to database, of predicted.
Overview of Mass Spectrometry
EBI is an Outstation of the European Molecular Biology Laboratory. In silico analysis of accurate proteomics, complemented by selective isolation of peptides.
Background Spectral library searching Spectral library searching is an alternative approach to traditional sequence database searching for peptide inference.
Error tolerant search Large number of spectra remain without significant score. Reasonable number of fragment ion peaks might have not match. – Underestimated.
Proteomics Informatics (BMSC-GA 4437) Instructor David Fenyö Contact information
Tag-based Blind Identification of PTMs with Point Process Model 1 Chunmei Liu, 2 Bo Yan, 1 Yinglei Song, 2 Ying Xu, 1 Liming Cai 1 Dept. of Computer Science.
Deducing protein composition from complex protein preparations by MALDI without peptide separation.. TP #419 Kenneth C. Parker SimulTof Corporation, Sudbury,
Geranyl acetate C12H20O2. Mass Spectral Libraries An Ever-Expanding Resource for Chemical Identification Steve Stein Mass Spectrometry Data Center National.
ISA Kim Hye mi. Introduction Input Spectrum data (Protein database) Peptide assignment Peptide validation manual validation PeptideProphet.
Proteomics Informatics (BMSC-GA 4437) Course Directors David Fenyö Kelly Ruggles Beatrix Ueberheide Contact information
Constructing high resolution consensus spectra for a peptide library
Using Scaffold OHRI Proteomics Core Facility. This presentation is intended for Core Facility internal training purposes only.
김지형. Introduction precursor peptides are dynamically selected for fragmentation with exclusion to prevent repetitive acquisition of MS/MS spectra.
A Database of Peak Annotations of Empirically Derived Mass Spectra
MassMatrix Search Results Explained
Protein Identification via Database searching
Proteomics Informatics David Fenyő
Interpretation of Mass Spectra I
Proteomics Informatics –
NoDupe algorithm to detect and group similar mass spectra.
Sim and PIC scoring results for standard peptides and the test shotgun proteomics dataset. Sim and PIC scoring results for standard peptides and the test.
Proteomics Informatics David Fenyő
Interpretation of Mass Spectra
Kuen-Pin Wu Institute of Information Science Academia Sinica
Presentation transcript:

Building and Using Libraries of Peptide Ion Fragmentation Spectra S.E. Stein, L.E. Kilpatrick, M. Mautner, P. Neta, J. Roth National Institute of Standards and Technology, Gaithersburg, MD/Charleston, SC Overview: MS/MS spectra that serve to identify peptides by sequence/spectrum matching can be of value for more reliably identifying those peptides in later studies. Our work involves the development of methods for refining and reusing information contained in these spectra to create mass spectral reference libraries, not dissimilar from those routinely employed in other fields of mass spectrometry. Method: Spectra that identify peptides by current sequence/spectrum matching methods are first added to an archive along with associated information. For each identified peptide ion that has been identified more than once, an annotated ‘consensus spectrum’ is derived from the spectra used to make those identifications. Consensus spectra are then subjected to spectrum/sequence consistency tests to assign a measure of identification reliability and remove false positive identifications. This employs sequence/spectrum correlations reported in the literature and found in our work as well as other quantities found to be correlated with identification reliability. Examples are the similarity of original spectra, unassigned abundance in the consensus spectrum and scores of the original sequence/spectrum match. Related consensus spectra are intercompared to find further errors and finally combined to form an annotated, searchable reference library. Building The Library 1)Extract spectra and analysis information for reliable peptide identifications from spectrum/sequence matches. 2)Create a ‘consensus spectrum’ from all spectra assigned to a single peptide ion. 3) For each consensus spectrum, perform spectrum/sequence consistency check. Also examine reliable, single hit identifications. 4) Assign reliability measures to each spectrum and build library. Ways of Using the Library Identification Confirmation (post-processing) Confirm/reject peptides identified by sequence search programs Find peptides (pre-processing) Search all spectra against library Then search unmatched peptides using sequence search and other methods Create library of ‘unidentified spectra’ from their consensus spectra Compare to identified peptides, use de novo methods, find biomarkers, … Target identification Internal standards, biomarkers, target proteins Transmit peptide analysis information Difficult-to-identify, unusual, manually identified, special meaning Spectrum Variability: Effective peptide identification by matching spectra in a reference library requires that spectra are reproducible. The degree of variability has been measured using sets of spectra identifying the same precursor ion. Variations typical of ion trap spectra are shown below. Most spectra identified by sequence search methods have dot products above 0.7. Next Steps Enhance QA/QC Goal: Distinguish unexpected fragmentations from misidentifications Add spectra from reliable singly identified peptides to Library Build comprehensive Libraries Test and optimize spectrum matching algorithms Provide annotated libraries for mass spectrometry data systems Spectrum/Sequence Consistency All Data for an Identified Peptide Ion Annotated Reference Spectrum Reference Archive Reference Library SpectraSequence Confirmatory Information Consensus Spectrum Extract Common Features Fragmentation Rules Analysis LC-MS/MS Results 1)Source Spectra a)Online repositories - PeptideAtlas - the Global Proteome Machine - Open Proteomics Database - NCRR Proteomics Resource b)Collaborators/Contributors - NIH/LNT (Markey/Geer/Kowalik…) - Institute of Systems Biology (Nesvizshskii/King/Aebersold/…) - theGPM (Beavis) - Blueprint Initiative (Hogue) c)NIST Measurements Single Protein Digests/IT, 3Q 3) Spectrum/Sequence Consistency Find likelihood that a consensus spectrum originated from assigned sequence. Each factor serves to refine probability. Factors identified as discriminating (a and b are illustrated below): a) Match with theoretical spectrum based on individual amino acid fragmentation behavior b) Fraction of unassigned abundance abundance for peaks not originating from a known fragmentation path c) Y/B ion correlations and ratio of Y/B ion abundance sums d) Unexpected major peaks formally consistent with rules include tests for reasonableness of neutral losses e) Amino acid specific rules (Proline, Glycine, Aspartic Acid, …) f) Predicted/observed fragment ion charge state abundance ratios 4) Overall Identification Confidence [under development] Influential factors employed Spectrum/Sequence Consistency (described above) Original Score (degree of sequence match) Peptide sequence re-identifications Number of different spectra, experiments, modifications, charge states Number of peptides per protein Occurrence Distribution: Most peptide identifications are made multiple times. 2) Build Consensus Spectra Reject noise and spurious signals/Measure and report variability Identify matching m/z peaks in input spectra. Find and exclude outlier spectra (use dot product of each pair) when multiple sources are available, limit the number of spectra from a single source Omit peaks occurring in ½ or fewer of spectra consider only spectra with sufficiently high S/N to have generated the peak of interest Compute average abundance, report variance Create annotated spectrum include information such as spectrum origin, retention, median dot product using all peaks and only consensus peaks, fraction of abundance not at consensus m/z, … Sample Application: Mycobacterium Smegmatis [data from the Open Proteomics Database/R.Wang, 27 sets of experiments] - Created library of 2739 Peptide Ion Consensus Spectra - 95% of identified spectra matched peptides identified by other spectra In one series of LC-MS/MS experiments: 1551 Spectra were identified by sequence search engine 1527 Of the above spectra were re-matched by consensus library searching 1067 Spectra not identified by sequence search were identified by library search 948 Different peptides were originally identified by sequence search engine 924 Peptides were re-matched 332 Peptides not identified by sequence search in this series were identified 24 Peptides not re-matched were rich in false positives 983 Consensus spectra of spectra unmatched by library search were derived Please Recycle Your Spectra! - Your MS/MS Data Files contain valuable, reusable information that may be very helpful to others (and maybe even you) after recycling. - Please let us know if we may recycle them / ). Raw data files are fine, with or without identifications. - Sources of original spectra are cited as part of consensus spectrum annotation (if you wish). - We also encourage submission to on-line spectral repositories ( bioinformatics.icmb.utexas.edu/OPD/) Library Construction Flow Diagram Spectrum Matching Algorithms Algorithms: Measures of spectrum matching have been adapted from algorithms used for electron ionization spectra. Peaks are weighted by their significance: - Reduce significance of common impurity ions (e.g., neutral loss from parent ion) - Adjust Y/B weighting for instrument and sequence - Reduce weight for uncertain and isotopic peaks - Use confidence of library spectrum Speed: Straightforward indexing leads to very fast identification (<< sec) even for very large libraries. Spectrum 1 Spectrum 2 Consensus Spectrum A ‘consensus spectrum’ is composed of peaks present in the majority of spectra in which the peak could have been generated Replicate spectra at moderate to low S/N commonly show spurious peaks from neutral losses by impurity ions of uncertain origin as well as seemingly random noise spikes. Similarity of pairs of spectra originating from the same peptide ion are measured by their ‘dot product’, where spectra are expressed as normalized vectors. Here, S/N is measured as the maximum to median abundance in a typical spectrum (assumes most peaks are noise). Above S/N of 40, spectra are quite reproducible (0.7 or better median dot product). Distribution of Number of Identifications Per Identified Peptide for D. Radiodurans (spectra from NCRR, and M. Smegmatis (Open Proteomics Database) Pairs of spectra identified as originating from a given peptide ion have been compared to each other – variations in abundance have been found to depend primarily on the level of signal/noise in the measured spectra. OrganismDifferent Ions Source Spectra Human14,69884,288 D. Radiodurans6,888115,972 Yeast5,87943,344 M. Smegmatis2,73930,718 E. Coli1,2595,861 Mixed (GPM)66,882587,199 These plots use D. Radiodurans spectra from NCRR, false ids obtained by searching Human sequences. True and false identifications had the same average sequence ‘score’. Correct identifications have lower fractions of unassigned abundance than incorrect identifications True positive spectra match theoretical spectra better than false positive spectra. These plots illustrate quantitative relations for factors a) and b) that can aid in the separation of true and false positive identifications In two well-studied cases, about 5% of identifications were made by only one spectrum. Roughly 10% of identifications were made more than 100 times Approximately 1/3 of peptides identified were identified only once (not shown). These will be separately processed, requiring higher scores for acceptance in library. Consensus Spectra Derived