Proteomics Informatics – Protein identification II: search engines and protein sequence databases (Week 5)

Slides:



Advertisements
Similar presentations
Tandem MS (MS/MS) on the Q-ToF2
Advertisements

Genomes and Proteomes genome: complete set of genetic information in organism gene sequence contains recipe for making proteins (genotype) proteome: complete.
Proteomics Informatics – Protein characterization I: post-translational modifications (Week 10)
UC Mass Spectrometry Facility & Protein Characterization for Proteomics Core Proteomics Capabilities: Examples of Protein ID and Analysis of Modified Proteins.
In-depth Analysis of Protein Amino Acid Sequence and PTMs with High-resolution Mass Spectrometry Lian Yang 2 ; Baozhen Shan 1 ; Bin Ma 2 1 Bioinformatics.
1336 SW Bertha Blvd, Portland OR 97219
MN-B-C 2 Analysis of High Dimensional (-omics) Data Kay Hofmann – Protein Evolution Group Week 5: Proteomics.
How to identify peptides October 2013 Gustavo de Souza IMM, OUS.
Peptide Mass Fingerprinting
Mass Fingerprint. Protease A protease is any enzyme that conducts proteolysis, that is, begins protein catabolism by hydrolysis of the peptide bonds that.
Peptide Identification by Tandem Mass Spectrometry Behshad Behzadi April 2005.
Data Processing Algorithms for Analysis of High Resolution MSMS Spectra of Peptides with Complex Patterns of Posttranslational Modifications Shenheng Guan.
Proteomics: A Challenge for Technology and Information Science CBCB Seminar, November 21, 2005 Tim Griffin Dept. Biochemistry, Molecular Biology and Biophysics.
ProReP - Protein Results Parser v3.0©
Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine
Proteomics Informatics Workshop Part I: Protein Identification
Previous Lecture: Regression and Correlation
Scaffold Download free viewer:
Proteomics Informatics (BMSC-GA 4437) Course Director David Fenyö Contact information
My contact details and information about submitting samples for MS
Analysis of tandem mass spectra - II Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology.
Proteomics Informatics (BMSC-GA 4437) Course Director David Fenyö Contact information
Proteomics Informatics Workshop Part III: Protein Quantitation
Proteomics Informatics Workshop Part II: Protein Characterization David Fenyö February 18, 2011 Top-down/bottom-up proteomics Post-translational modifications.
Evaluated Reference MS/MS Spectra Libraries Current and Future NIST Programs.
Tryptic digestion Proteomics Workflow for Gel-based and LC-coupled Mass Spectrometry Protein or peptide pre-fractionation is a prerequisite for the reduction.
Proteomics Informatics – Data Analysis and Visualization (Week 13)
The dynamic nature of the proteome
PROTEIN STRUCTURE NAME: ANUSHA. INTRODUCTION Frederick Sanger was awarded his first Nobel Prize for determining the amino acid sequence of insulin, the.
Introduction The GPM project (The Global Proteome Machine Organization) Salvador Martínez de Bartolomé Bioinformatics support –
INF380 - Proteomics-91 INF380 – Proteomics Chapter 9 – Identification and characterization by MS/MS The MS/MS identification problem can be formulated.
Generating Peptide Candidates from Protein Sequence Databases for Protein Identification via Mass Spectrometry Nathan Edwards Informatics Research.
Common parameters At the beginning one need to set up the parameters.
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
Laxman Yetukuri T : Modeling of Proteomics Data
INF380 - Proteomics-101 INF380 – Proteomics Chapter 10 – Spectral Comparison Spectral comparison means that an experimental spectrum is compared to theoretical.
In-Gel Digestion Why In-Gel Digest?
Protein Identification via Database searching Attila Kertész-Farkas Protein Structure and Bioinformatics Group, ICGEB, Trieste.
Proteomics What is it? How is it done? Are there different kinds? Why would you want to do it (what can it tell you)?
INF380 - Proteomics-71 INF380 – Proteomics Chap 7 –Protein Identification and Characterization by MS Protein identification in our context means that we.
CSE182 CSE182-L11 Protein sequencing and Mass Spectrometry.
Multiple flavors of mass analyzers Single MS (peptide fingerprinting): Identifies m/z of peptide only Peptide id’d by comparison to database, of predicted.
Overview of Mass Spectrometry
EBI is an Outstation of the European Molecular Biology Laboratory. In silico analysis of accurate proteomics, complemented by selective isolation of peptides.
Proteomics Informatics (BMSC-GA 4437) Instructor David Fenyö Contact information
Tag-based Blind Identification of PTMs with Point Process Model 1 Chunmei Liu, 2 Bo Yan, 1 Yinglei Song, 2 Ying Xu, 1 Liming Cai 1 Dept. of Computer Science.
The observed and theoretical peptide sequence information Cal.MassObserved. Mass ±da±ppmStart Sequence EndSequenceIon Score C.I%modification FLPVNEK.
Peptide-assisted annotation of the Mlp genome Philippe Tanguay Nicolas Feau David Joly Richard Hamelin.
Mascot Example Slides. MS/MS Database Search Example Data: BSAonespectra.mgf (one spectra) Database: bovine Fixed modifications: Carboxymethyl(C )
Deducing protein composition from complex protein preparations by MALDI without peptide separation.. TP #419 Kenneth C. Parker SimulTof Corporation, Sudbury,
Protein identification. Peptide Mass Fingerprinting In situ digestion Peptide extraction MALDI-MS Putative Candidates Score 1. Larval serum protein 2.
Proteomics Informatics (BMSC-GA 4437) Course Directors David Fenyö Kelly Ruggles Beatrix Ueberheide Contact information
Constructing high resolution consensus spectra for a peptide library
Protein quantitation I: Overview (Week 5). Fractionation Digestion LC-MS Lysis MS Sample i Protein j Peptide k Proteomic Bioinformatics – Quantitation.
김지형. Introduction precursor peptides are dynamically selected for fragmentation with exclusion to prevent repetitive acquisition of MS/MS spectra.
B Monoisotopic mass of neutral peptide M r (calc): Fixed modifications: Carbamidomethyl Ions score: 45 † Expect: ‡ Matches (red): 18/50.
Yonsei Proteome Research Center Peptide Mass Finger-Printing Part II. MALDI-TOF 2013 생화학 실험 (1) 6 주차 자료 임종선 조교 내선 6625.
Post translational modification n- acetylation Peptide Mass Fingerprinting (PMF) is an analytical technique for identifying unknown protein. Proteins to.
Bioinformatics Solutions Inc.
Proteomics Informatics David Fenyő
Interpretation of Mass Spectra I
Protein Identification by Peptide Mass Fingerprinting
Proteomics Informatics –
NoDupe algorithm to detect and group similar mass spectra.
Shotgun Proteomics in Neuroscience
High level view of the MAE algorithm.
Tryptic phosphopeptides of AdIGFBP-5, [γ-32P]ATP-labeled in vitro by phosphorylation with CK2, were separated by HPLC and detected and sequenced by mass.
Tryptic phosphopeptides of 32P-labeled IGFBP-5 from T47D cells separated by HPLC and sequenced by tandem MS.a, tandem MS spectrum of the triply charged.
Proteomics Informatics David Fenyő
Interpretation of Mass Spectra
Presentation transcript:

Proteomics Informatics – Protein identification II: search engines and protein sequence databases (Week 5)

The response to random input data should be random. Maximum number of correct identification and minimum number of incorrect identifications for any data set. Maximal separation between scores for correct identifications and the distribution of scores for random matching proteins for any data set. The statistical significance of the results should be calculated. The searches should be fast. General Criteria for a Good Protein Identification Algorithms

Search Parameters Parent tolerance+/- daltons/ppm Frag. Tolerance+/- daltons/ppm Complete modsCys alkylation Potential mods (artifacts) Met/Trp oxidation, Gln/Asn deamidation Potential mods (PTMs) Phosphoryl, sulfonyl, acetyl, methyl, glycosyl, GPI CleavageTrypsin ([KR]|{P}) Scoring methodScores or statistics SequencesFASTA files

MS Identification – Peptide Mass Fingerprinting MS Digestion All Peptide Masses Pick Protein Compare, Score, Test Significance Repeat for each protein Sequence DB Identified Proteins

Response to Random Data Normalized Frequency

ProFound – Search Parameters

ProFound – Protein Identification by Peptide Mapping W. Zhang & B.T. Chait, Analytical Chemistry 72 (2000)

ProFound Results

Peptide Mapping – Mass Accuracy

Peptide Mapping - Database Size S. cerevisiae Fungi All Taxa Expectation Values Peptide mapping example: S. Cerevisiae4.8e-7 Fungi8.4e-6 All Taxa2.9e-4

Missed Cleavage Sites u = 1 u = 2 u = 4 Expectation Values Peptide mapping example: u=14.8e-7 u=2 1.1e-5 u=4 6.8e-4

Peptide Mapping - Partial Modifications No Modifications Phophorylation (S, T, or Y) SearchedSearched With Without Possible Modifications Phosphorylation of S/T/Y DARPP CFTR Even if the protein is modified it is usually better to search a protein sequence database without specifying possible modifications using peptide mapping data.

Peptide Mapping - Ranking by Direct Calculation of the Significance

MS/MS Lysis Fractionation Tandem MS – Database Search MS/MS Digestion Sequence DB All Fragment Masses Pick Protein Compare, Score, Test Significance Repeat for all proteins Pick PeptideLC-MS Repeat for all peptides

Algorithms

Comparing and Optimizing Algorithms

17 MS/MS - Parent Mass Error and Enzyme Specificity Expectation Values MS/MS example:  m=2, Trypsin2.5e-5  m=100, Trypsin 2.5e-5  m=2, non-specific7.9e-5  m=100, non-specific1.6e-4

Sequest Cross-correlation

X! Tandem - Search Parameters

X! Tandem - Search Parameters

sequences spectra Conventional, single stage searching Generic search engine Test all cleavages, modifications, & mutations for all sequences

Determining potential modifications - e.g., oxidation, phosphorylation, deamidation - calculation order 2 n - NP complete Some hard problems in MS/MS analysis in proteomics Allowing for unanticipated peptide cleavages - e.g., chymotryptic contamination in trypsin - calculation order ~ 200 × tryptic cleavage - “unfortunate” coefficient Detecting point mutations - e.g., sequence homology - calculation order 18 N - NP complete

sequences spectra Multi-stage searching Tryptic cleavage Modifications #1 Modifications #2 Point mutation X! Tandem

Search Results

Sequence Annotations

Search Results

Mascot

Lysis Fractionation Digestion LC-MS/MS Identification – Spectrum Library Search MS/MS Spectrum Library Pick Spectrum Compare, Score, Test Significance Repeat for all spectra Identified Proteins

1. Find the best 10 spectra for a particular sequence, with the same PTMs and charge. 2. Add the spectra together and normalize the intensity values. 3. Assign a “quality” value: the median expectation value of the 10 spectra used. 4. Record the 20 most intense peaks in the averaged spectrum, it’s parent ion z, m/z, sequence, protein accessions & quality. Steps in making an Annotated Spectrum Library (ASL):

Spectrum Library Characteristics – Peptide Length

Spectrum Library Characteristics – Protein Coverage

Library spectrum Test spectrum (5:25) Results: 4 peaks selected, 1 peak missed Identification – Spectrum Library Search

MatchesProbability Apply a hypergeometric probability model: - 25 possible m/z values; - 5 peaks in the library spectrum; and - 4 selected by the test spectrum. How likely is this? Identification – Spectrum Library Search

If you have 1000 possible m/z values and 20 peaks in test and library spectrum? 1 matched: p = matched: p = matched: p = Identification – Spectrum Library Search

Experimental Mass Spectrum Library of Assigned Mass Spectra  M/Z Best search result Identification – Spectrum Library Search

X! Hunter

1. Use dot product to find a library spectrum that best matches a test spectrum. 2. Calculate p-value with hypergeometric distribution. 3. Use p-value to calculate expectation value, given the identification parameters. 4. If expectation value is less than the median expectation value of the library spectrum, report the median value. X! Hunter algorithm:

X! Hunter Result Query Spectrum Library Spectrum

Dynamic Range In Proteomics Large discrepancy between the experimental dynamic range and the range of amounts of different proteins in a proteome Experimental Dynamic Range Distribution of Protein Amounts Log (Protein Amount) Number of Proteins The goal is to identify and characterize all components of a proteome Desired Dynamic Range

Mass Separation Detection Mass Separation Peptide Separation Peptide Labeling Protein Separation Digestion Protein Labeling Sample Extraction Ionization Fragmentation

Experimental Designs Simulated

Parameters in Simulation ● Distribution of protein amounts in sample ● Loss of peptides before binding to the column ● Loss of peptides after elution off the column ● Distribution of mass spectrometric response for different peptides present at the same amount ● Total amount of peptides that are loaded on column (limited by column loading capacity) ● # of peptide fractions ● # of Proteins in each fraction ● Total amount of peptides that are loaded on column (limited by column loading capacity) ● # of peptide fractions ● Dynamic range of mass spectrometer ● Detection limit of mass spectrometer

Simulation Results for 1D-LC-MS Complex Mixtures of Proteins RPC Digestion MS Analysis No Protein Separation Protein Separation: 10 fractions Protein Separation: 10 fractions No Protein Separation Tissue Body Fluid

Success Rate of a Proteomics Experiment DEFINITION: The success rate of a proteomics experiment is defined as the number of proteins detected divided by the total number of proteins in the proteome. Log (Protein Amount) Number of Proteins Proteins Detected Distribution of Protein Amounts

Relative Dynamic Range of a Proteomics Experiment DEFINITION: RELATIVE DYNAMIC RANGE, RDR x, where x is e.g. 10%, 50%, or 90% Log (Protein Amount) RDR 90 RDR 50 RDR 10 Fraction of Proteins Detected Number of Proteins Proteins Detected Distribution of Protein Amounts

Number of Proteins in Mixture TissueBody Fluid 112 RDR 50 Success Rate Tissue Body Fluid 1 1 Tissue 2 2 2

Amount of Peptides Loaded on the Column TissueBody Fluid 223 RDR 50 Success Rate Tissue Body Fluid 2 2 Tissue 3 3 3

Peptide Separation TissueBody Fluid 334 RDR 50 Success Rate Tissue Body Fluid 3 3 Tissue 4 4 4

Amount loaded and peptide separation 1. Protein separation 2. Amount loaded 3. Peptide separation Order: Tissue Protein separation Tissue Protein separation Amount loaded Tissue Protein separation Peptide separation Amount loaded 1. Protein separation 2. Peptide separation 3. Amount loaded Protein separation Tissue Protein separation Peptide separation Tissue Protein separation Amount loaded Peptide separation Protein separation Amount loaded Peptide separation Ranges: Protein separation: – 3000 proteins in each fraction Amount loaded: 0.1 ug – 10 ug Peptide separation: 100 – 1000 fractions

Repeat Analysis 1 Analysis

2 Analyses Repeat Analysis

3 Analyses Repeat Analysis

4 Analyses Repeat Analysis

5 Analyses Repeat Analysis

6 Analyses Repeat Analysis

7 Analyses Repeat Analysis

8 Analyses Repeat Analysis

Repeat Analysis: Simulations

Summary The success rate of proteome analysis is influenced by the following factors (listed in order of importance): Amount of peptides loaded on column or mass spectrometric detection limit The degree of peptide separation or mass spectrometric dynamic range The degree of protein separation

Proteomics Informatics – Protein identification II: search engines and protein sequence databases (Week 5)