Presentation is loading. Please wait.

Presentation is loading. Please wait.

Protein Identification David Fenyő

Similar presentations


Presentation on theme: "Protein Identification David Fenyő"— Presentation transcript:

1 Protein Identification David Fenyő

2 Protein Identification and Quantitation
Samples Peptides Mass Spectrometry Quantity intensity m/z Identity

3 Information Content in a Single Mass Measurement
Human 10 8 6 Avg. #of matching peptides 4 3 2 1 #of matching peptides Tryptic peptide mass [Da] S. cerevisiae 10 8 6 Avg. #of matching peptides 4 3 2 1 #of matching peptides Tryptic peptide mass [Da]

4 Repeat for each protein Compare, Score, Test Significance
Identification – Peptide Mass Fingerprinting Sequence DB Pick Protein Digestion MS All Peptide Masses Repeat for each protein MS Compare, Score, Test Significance Identified Proteins

5 ProFound – Search Parameters

6 ProFound – Protein Identification by Peptide Mapping
W. Zhang & B.T. Chait, Analytical Chemistry 72 (2000)

7 ProFound Results

8 Peptide Mapping – Mass Accuracy

9 Peptide Mapping - Database Size
S. cerevisiae Expectation Values Peptide mapping example: S. Cerevisiae 4.8e-7 Fungi 8.4e-6 All Taxa 2.9e-4 Fungi All Taxa

10 Peptide Mapping - Database Size

11 Missed Cleavage Sites Expectation Values Peptide mapping example:

12 Peptide Mapping - Partial Modifications
No Modifications Searched Searched With Without Possible Modifications Phosphorylation of S/T/Y DARPP CFTR Even if the protein is modified it is usually better to search a protein sequence database without specifying possible modifications using peptide mapping data. Phophorylation (S, T, or Y)

13 Peptide Mapping - Ranking by
Direct Calculation of the Significance

14 General Criteria for a Good Protein Identification Algorithms
The response to random input data should be random. Maximum number of correct identification and minimum number of incorrect identifications for any data set. Maximal separation between scores for correct identifications and the distribution of scores for random matching proteins for any data set. The statistical significance of the results should be calculated. The searches should be fast.

15 Response to Random Data
Normalized Frequency

16 b y Peptide Fragmentation Mass Analyzer 1 Frag-mentation Detector
Ion Source Mass Analyzer 2 b y

17 Identification – Tandem MS

18 Interpretation of Mass Spectra
K L E D F G S m/z % Relative Abundance 100 250 500 750 1000

19 Interpretation of Mass Spectra
K L E D F G S K 1166 L 1020 E 907 D 778 663 534 405 F 292 G 145 S 88 b ions m/z % Relative Abundance 100 250 500 750 1000

20 Interpretation of Mass Spectra
K L E D F G S 147 K 1166 L 260 1020 E 389 907 D 504 778 633 663 762 534 875 405 F 1022 292 G 1080 145 S 88 y ions b ions m/z % Relative Abundance 100 250 500 750 1000

21 Interpretation of Mass Spectra
K L E D F G S 147 K 1166 L 260 1020 E 389 907 D 504 778 633 663 762 534 875 405 F 1022 292 G 1080 145 S 88 y ions b ions m/z % Relative Abundance 100 250 500 750 1000 [M+2H]2+ 762 260 389 504 633 875 292 405 534 907 1020 663 778 1080 1022

22 Interpretation of Mass Spectra
K L E D F G S 147 K 1166 L 260 1020 E 389 907 D 504 778 633 663 762 534 875 405 F 1022 292 G 1080 145 S 88 y ions b ions m/z % Relative Abundance 100 250 500 750 1000 [M+2H]2+ 762 260 389 504 633 875 292 405 534 907 1020 663 778 1080 1022

23 Interpretation of Mass Spectra
K L E D F G S 147 K 1166 L 260 1020 E 389 907 D 504 778 633 663 762 534 875 405 F 1022 292 G 1080 145 S 88 y ions b ions m/z % Relative Abundance 100 250 500 750 1000 [M+2H]2+ 762 260 389 504 633 875 292 405 534 907 1020 663 778 1080 1022 113 113

24 Interpretation of Mass Spectra
K L E D F G S 147 K 1166 L 260 1020 E 389 907 D 504 778 633 663 762 534 875 405 F 1022 292 G 1080 145 S 88 y ions b ions m/z % Relative Abundance 100 250 500 750 1000 [M+2H]2+ 762 260 389 504 633 875 292 405 534 907 1020 663 778 1080 1022 129 129

25 Interpretation of Mass Spectra
K L E D F G S 147 K 1166 L 260 1020 E 389 907 D 504 778 633 663 762 534 875 405 F 1022 292 G 1080 145 S 88 y ions b ions m/z % Relative Abundance 100 250 500 750 1000 [M+2H]2+ 762 260 389 504 633 875 292 405 534 907 1020 663 778 1080 1022

26 Interpretation of Mass Spectra
K L E D F G S 147 K 1166 L 260 1020 E 389 907 D 504 778 633 663 762 534 875 405 F 1022 292 G 1080 145 S 88 y ions b ions m/z % Relative Abundance 100 250 500 750 1000 [M+2H]2+ 762 260 389 504 633 875 292 405 534 907 1020 663 778 1080 1022

27 Interpretation of Mass Spectra
K L E D F G S 147 K 1166 L 260 1020 E 389 907 D 504 778 633 663 762 534 875 405 F 1022 292 G 1080 145 S 88 y ions b ions m/z % Relative Abundance 100 250 500 750 1000 [M+2H]2+ 762 260 389 504 633 875 292 405 534 907 1020 663 778 1080 1022

28 De Novo Sequencing Sequences consistent with spectrum
Amino acid masses 762 100 875 [M+2H]2+ % Relative Abundance 633 292 405 260 389 534 1022 504 663 778 907 1020 1080 250 500 750 1000 m/z Mass Differences Sequences consistent with spectrum

29 De Novo Sequencing

30 De Novo Sequencing

31 SGF(I/L)EEDE(I/L)(K/Q)
De Novo Sequencing X X X …GF(I/L)EEDE(I/L)… …(I/L)EDEE(I/L)FG… SGF(I/L)EEDE(I/L)… 1166 – 1020 – 18 = 128 K or Q SGF(I/L)EEDE(I/L)(K/Q) …GF(I/L)EEDE(I/L)… …(I/L)EDEE(I/L)FG… Peptide M+H = 1166 = 87 => S SGF(I/L)EEDE(I/L)… X X X

32 De Novo Sequencing Challenges in de novo sequencing
Neutral loss (-H2O, -NH3) Modifications Background peaks Incomplete information Challenges in de novo sequencing Neutral loss (-H2O, -NH3) Modifications Background peaks Incomplete information

33 Tandem MS – Database Search
Sequence DB Lysis Fractionation Pick Protein Digestion LC-MS Pick Peptide Repeat for all proteins MS/MS All Fragment Masses all peptides Repeat for MS/MS Compare, Score, Test Significance

34 Algorithms

35 Comparing and Optimizing Algorithms

36 MS/MS - Parent Mass Error and Enzyme Specificity
Expectation Values MS/MS example: Dm=2, Trypsin 2.5e-5 Dm=100, Trypsin 2.5e-5 Dm=2, non-specific 7.9e-5 Dm=100, non-specific 1.6e-4

37 Sequest Cross-correlation

38 X! Tandem - Search Parameters

39 X! Tandem - Search Parameters

40 X! Tandem - Search Parameters

41 single stage searching
spectra Generic search engine Test all cleavages, modifications, & mutations for all sequences sequences sequences Conventional, single stage searching

42 Some hard problems in MS/MS analysis in proteomics
Allowing for unanticipated peptide cleavages - e.g., chymotryptic contamination in trypsin - calculation order ~ 200 × tryptic cleavage - “unfortunate” coefficient Determining potential modifications - e.g., oxidation, phosphorylation, deamidation - calculation order 2n - NP complete Detecting point mutations - e.g., sequence homology - calculation order 18N - NP complete

43 Multi-stage searching
spectra Tryptic cleavage Modifications #1 sequences Modifications #2 sequences Point mutation X! Tandem

44 Search Results

45 Search Results

46 Sequence Annotations

47 Search Results

48 Search Results

49 Identification – Spectrum Library Search
Lysis Fractionation Digestion LC-MS/MS Pick Spectrum Repeat for all spectra MS/MS Compare, Score, Test Significance Identified Proteins

50 Steps in making an Annotated Spectrum Library (ASL): 1. Find the best 10 spectra for a particular sequence, with the same PTMs and charge. 2. Add the spectra together and normalize the intensity values. 3. Assign a “quality” value: the median expectation value of the 10 spectra used. 4. Record the 20 most intense peaks in the averaged spectrum, it’s parent ion z, m/z, sequence, protein accessions & quality.

51 Spectrum Library Characteristics – Peptide Length

52 Spectrum Library Characteristics – Protein Coverage

53 Identification – Spectrum Library Search
Library spectrum (5:25) Test spectrum (5:25) Results: 4 peaks selected, 1 peak missed

54 Identification – Spectrum Library Search
How likely is this? Apply a hypergeometric probability model: - 25 possible m/z values; - 5 peaks in the library spectrum; and - 4 selected by the test spectrum. Matches Probability 1 0.45 2 0.15 3 0.016 4 5

55 Identification – Spectrum Library Search
If you have 1000 possible m/z values and 20 peaks in test and library spectrum? 1 matched: p = 0.6 5 matched: p = 10 matched: p =

56 X! Hunter

57 X! Hunter algorithm: 1. Use dot product to find a library spectrum that best matches a test spectrum. 2. Calculate p-value with hypergeometric distribution. 3. Use p-value to calculate expectation value, given the identification parameters. 4. If expectation value is less than the median expectation value of the library spectrum, report the median value.

58 X! Hunter Result Query Spectrum Library Spectrum

59 Significance Testing False protein identification is caused by random matching An objective criterion for testing the significance of protein identification results is necessary. The significance of protein identifications can be tested once the distribution of scores for false results is known.

60 Significance Testing - Expectation Values
The majority of sequences in a collection will give a score due to random matching.

61 Significance Testing - Expectation Values
Database Search List of Candidates M/Z Distribution of Scores for Random and False Identifications Extrapolate And Calculate Expectation Values List of Candidates With Expectation Values

62 Homework Explore search parameter space for X! Tandem. Pick a subject for a short presentation next Tuesday from these:

63 Protein Sequence Databases

64 http://www.ncbi.nlm.nih.gov/books/NBK21091/ RefSeq
Distinguishing Features of the RefSeq collection include: non-redundancy explicitly linked nucleotide and protein sequences updates to reflect current knowledge of sequence data and biology data validation and format consistency ongoing curation by NCBI staff and collaborators, with reviewed records indicated

65 http://www.ensembl.org/ Ensembl
genome information for sequenced chordate genomes. evidenced-based gene sets for all supported species large-scale whole genome multiple species alignments across vertebrates variation data resources for 17 species and regulation annotations based on ENCODE and other data sets.

66 http://www.uniprot.org/ UniProt
The mission of UniProt is to provide the scientific community with a comprehensive, high-quality and freely accessible resource of protein sequence and functional information.

67 Species-Centric Consortia
For some organisms, there are consortia that provide high-quality databases: Yeast ( Fly ( Arabidopsis (

68 http://en.wikipedia.org/wiki/FASTA_format FASTA RefSeq:
>gi| |ref|NP_ | zinc finger protein 683 [Homo sapiens] MKEESAAQLGCCHRPMALGGTGGSLSPSLDFQLFRGDQVFSACRPLPDMVDAHGPSCASWLCPLPLAPGRSALLACLQDL DLNLCTPQPAPLGTDLQGLQEDALSMKHEPPGLQASSTDDKKFTVKYPQNKDKLGKQPERAGEGAPCPAFSSHNSSSPPP LQNRKSPSPLAFCPCPPVNSISKELPFLLHAFYPGYPLLLPPPHLFTYGALPSDQCPHLLMLPQDPSYPTMAMPSLLMMV NELGHPSARWETLLPYPGAFQASGQALPSQARNPGAGAAPTDSPGLERGGMASPAKRVPLSSQTGTAALPYPLKKKNGKI LYECNICGKSFGQLSNLKVHLRVHSGERPFQCALCQKSFTQLAHLQKHHLVHTGERPHKCSVCHKRFSSSSNLKTHLRLH SGARPFQCSVCRSRFTQHIHLKLHHRLHAPQPCGLVHTQLPLASLACLAQWHQGALDLMAVASEKHMGYDIDEVKVSSTS QGKARAVSLSSAGTPLVMGQDQNN Ensembl: >ENSMUSP pep:known supercontig:NCBIM37:NT_166407:104574:105272:-1 gene:ENSMUSG transcript:ENSMUST MFSLMKKRRRKSSSNTLRNIVGCRISHCWKEGNEPVTQWKAIVLGQLPTNPSLYLVKYDGIDSIYGQELYSDDRILNLKVL PPIVVFPQVRDAHLARALVGRAVQQKFERKDGSEVNWRGVVLAQVPIMKDLFYITYKKDPALYAYQLLDDYKEGNLHMIPD TPPAEERSGGDSDVLIGNWVQYTRKDGSKKFGKVVYQVLDNPSVFFIKFHGDIHIYVYTMVPKILEVEKS UniProt: >sp|Q16695|H31T_HUMAN Histone H3.1t OS=Homo sapiens GN=HIST3H3 PE=1 SV=3 MARTKQTARKSTGGKAPRKQLATKVARKSAPATGGVKKPHRYRPGTVALREIRRYQKSTELLIRKLPFQRLMREIAQDFK TDLRFQSSAVMALQEACESYLVGLFEDTNLCVIHAKRVTIMPKDIQLARRIRGERA

69 PEFF - PSI Extended Fasta Format
>sp:P06748 \ID=NPM_HUMAN \Pname=(Nucleophosmin) (NPM) (Nucleolar phosphoprotein B23) (Numatrin) (Nucleolar protein NO38) \NcbiTaxId=9606 \ModRes=(125|MOD:00046)(199|MOD:00047) \Length=294 >sp:P00761 \ID=TRYP_PIG \Pname=(Trypsin precursor) (EC ) \NcbiTaxId=9823 \Variant=(20|20|V) \Processed=(1|8|PROPEP)(9|231|CHAIN) \Length=231

70 Sample-specific protein sequence databases Identified and quantified
Peptides MS Protein DB Identified and quantified peptides and proteins

71 Sample-specific protein sequence databases
Next-generation sequencing of the genome and transcriptome Samples Peptides MS Sample-specific Protein DB Identified and quantified peptides and proteins

72 Data Repositories

73 ProteomeExchange

74 PRIDE

75 PeptideAtlas

76 Chorus Key Aspects: Upload and share raw data with collaborators
Analyze data with available tools and workflows Create projects and experiments Select from public files and (re-)analyze/visualize Download selected files

77 MassIVE Key Aspects: Upload files
Spectra and Spectrum libraries, Analysis Results, Sequence Databases, Methods and Protocol) Perform analysis using available tools Browse public datasets Download data

78 The Global Proteome Machine Databases (GPMDB)

79 Comparison with GPMDB Most proteins show very reproducible peptide patterns

80 Comparison with GPMDB Query Spectrum Best match In GPMDB Second

81 GPMDB Data Crowdsourcing
Any lab performs experiments Raw data sent to public repository (TRANCHE, PRIDE) Data imported by GPMDB Data analyzed & accepted/rejected Accepted information loaded into public collection General community uses information and inspects data

82 Information for including a data set in GPMDB
MS/MS data (required) MS raw data files ASCII files: mzXML, mzML, MGF, DTA, etc. Analysis files: DAT, MSF, BIOML Sample Information (supply if possible) Species : human, yeast Cell/tissue type & subcellular localization Reagents: urea, formic acid, etc. Quantitation: SILAC, iTRAQ Proteolysis agent: trypsin, Lys-C Project information (suggested) Project name Contact information

83 How to characterize the evidence in GPMDB for a protein?
High confidence Medium confidence Low confidence No observation

84 Statistical model for 212 observations of TP53
Start End N -2 -3 -4 -5 -6 -7 -8 -9 -10 -11 Skew Kurt 214 248 539 0.15 0.18 0.22 0.17 0.07 0.03 0.01 0.00 -0.01 -2.01 249 267 1010 0.04 0.09 0.13 0.16 0.14 0.06 0.05 -0.08 -1.89 182 196 832 0.20 0.19 -0.12 -1.84 250 4 0.25 0.48 -2.28 1 24 269 0.10 0.12 -0.33 -0.88 65 51 0.08 0.02 0.47 -1.62 66 101 334 0.11 -1.21 273 60 0.45 -1.36 242 10 0.30 0.54 -1.39 239 32 -0.99 111 120 117 0.26 0.29 0.62 251 16 0.24 -0.60 241 14 0.21 0.87 -0.97 159 174 100 0.31 0.99 -1.07 68 0.86 -0.91 235 30 0.23 0.81 -0.82

85 Statistical model for observations of DNAH2

86 Statistical model for observations of GRAP2

87 DNA Repair

88 DNA Repair

89 TP53BP1:p, tumor protein p53 binding protein 1

90 TP53BP1:p, tumor protein p53 binding protein 1

91 Sequence Annotations

92 TP53BP1:p, tumor protein p53 binding protein 1

93 TP53BP1:p, tumor protein p53 binding protein 1

94 Peptide observations, catalase
Peptide Sequence Observations FSTVAGESGSADTVR 2633 FNTANDDNVTQVR 2432 AFYVNVLNEEQR 1722 LVNANGEAVYCK 1701 GPLLVQDVVFTDEMAHFDR 1637 LSQEDPDYGIR 1560 LFAYPDTHR 1499 NLSVEDAAR 1400 FYTEDGNWDLVGNNTPIFFIR 1386 ADVLTTGAGNPVGDK 1338

95 Peptide frequency (ω), catalase
Peptide Sequence ω FSTVAGESGSADTVR 0.08 FNTANDDNVTQVR 0.07 AFYVNVLNEEQR 0.05 LVNANGEAVYCK GPLLVQDVVFTDEMAHFDR LSQEDPDYGIR 0.04 LFAYPDTHR NLSVEDAAR FYTEDGNWDLVGNNTPIFFIR ADVLTTGAGNPVGDK

96 Global frequency of observation (ω), catalase
Peptide sequences

97 Omega (Ω) value for a protein identification
For any set peptides observed in an experiment assigned to a particular protein (1 to j ):

98 Protein Ω’s for a set of identifications
Protein ID Ω (z=2) Ω (z=3) SERPINB1 0.88 0.82 SNRPD1 0.59 CFL1 0.81 0.87 SNRPE 0.8 PPIA 0.79 0.64 CSTA 0.36 PFN1 0.76 0.61 CAT 0.71 0.78 GLRX 0.66 CALM1 0.62 FABP5 0.57 0.17

99 Retention Time Distribution

100 Mass Accuracy

101 GO Cellular Processes

102 KEGG Pathways

103 Open-Source Resources

104 ProteoWizard

105 Protein Prospector

106 UCSC Genome Browser

107 Slice - Scalable Data Sharing for Remote Mass Informatics
Developed by Manor Askenazi openslice.fenyolab.org Most mass spectrometry data is acquired in discovery mode, meaning that the data is amenable to open-ended analysis as our understanding of the target biochemistry increases. In this sense, mass spectrometry based discovery work is more akin to an astronomical survey, where the full list of object-types being imaged has not yet been fully elucidated, as opposed to e.g. micro-array work, where the list of probes spotted onto the slide is finite and well understood.

108 Standardization

109 Standardization - MIAPE

110 Standardization – MIAPE-MSI

111 Standardization – XML Formats
mzML - experimental results obtained by mass spectrometric analysis of biomolecular compounds mzIdentML - describe the outputs of proteomics search engines TraML - exchange and transmission of transition lists for selected reaction monitoring (SRM) experiments mzQuantML - describe the outputs of quantitation software for proteomics mzTab - defines a tab delimited text file format to report proteomics and metabolomics results. MIF - decribes the molecular interaction data exchange format. GelML - describes the processing and separations of proteins in samples using gel electrophoresis, within a proteomics experiment.


Download ppt "Protein Identification David Fenyő"

Similar presentations


Ads by Google