Presentation is loading. Please wait.

Presentation is loading. Please wait.

Protein Identification via Database searching

Similar presentations


Presentation on theme: "Protein Identification via Database searching"— Presentation transcript:

1 Protein Identification via Database searching
Attila Kertész-Farkas Protein Structure and Bioinformatics Group, ICGEB, Trieste

2 Mass Spectra analysis Biological sample Results report

3 Mass Spectra analysis Biological sample Results report

4 Computational analysis of MS/MS
Two approaches: De novo sequencing Database searching based Hybrid

5 De novo sequencing

6 De novo sequencing   can identify new peptides and proteins
Able to discover (new) PTMs Independent of protein databases Requires MS/MS data of good quality No statistics based validation

7 Database searching-based MS/MS tandem mass spectra identification
Pipeline Input data Peptide assignment Validation Protein inference Interpretation Quantitation

8 Database searching-based MS/MS tandem mass spectra identification
Pipeline Input data Peptide assignment Validation Protein inference Interpretation Quantitation

9 Database searching-based MS/MS tandem mass spectra identification
Pipeline Input data Peptide identification Validation Protein inference Interpretation Data formats Database searching Statistical methods for validations Quantitation Protein assembling

10 Input data Peptide assignment Validation Protein inference Quantitation Interpretation Mass spectrum: Histogram of the mass over charge of the observed fragment ions. Spectrum normalization. Usually intensity is scaled to [0,100] interval.

11 Most common formats are the mzXML, MGF and DAT,
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Most common formats are the mzXML, MGF and DAT,

12 MGF file format Input data Peptide assignment Validation
Protein inference Quantitation Interpretation MGF file format

13 .mzXML Input data Peptide assignment Validation Protein inference
Quantitation Interpretation .mzXML

14 >IPI:IPI00000044.1|SWISS-PROT:P01127
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Scores: 1. 2 Input data Experimental Spectra Spectra comparison: Protein sequence DB >IPI:IPI |SWISS-PROT:P01127 MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA

15 >IPI:IPI00000044.1|SWISS-PROT:P01127
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Scores: 1. 2 2. 1 Input data Experimental Spectra Spectra comparison: Protein sequence DB >IPI:IPI |SWISS-PROT:P01127 MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA

16 >IPI:IPI00000044.1|SWISS-PROT:P01127
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Scores: 3. 4 1. 2 2. 1 Input data Experimental Spectra Spectra comparison: Protein sequence DB >IPI:IPI |SWISS-PROT:P01127 MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA

17 >IPI:IPI00000044.1|SWISS-PROT:P01127
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Scores: 3. 4 1. 2 2. 1 4. 1 Input data Experimental Spectra Spectra comparison: Protein sequence DB >IPI:IPI |SWISS-PROT:P01127 MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA

18 >IPI:IPI00000044.1|SWISS-PROT:P01127
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Scores: 3. 4 1. 2 2. 1 4. 1 5. 1 Input data Experimental Spectra Spectra comparison: Protein sequence DB >IPI:IPI |SWISS-PROT:P01127 MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA

19 >IPI:IPI00000044.1|SWISS-PROT:P01127
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Scores: 3. 4 1. 2 2. 2 2. 1 4. 1 5. 1 Input data Experimental Spectra Spectra comparison: Protein sequence DB >IPI:IPI |SWISS-PROT:P01127 MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA

20 Scores: 3. 4 14. 3 1. 2 2 Input data 7. 2 Experimental Spectra 2. 1
Peptide assignment Validation Protein inference Quantitation Interpretation Scores: 3. 4 14. 3 1. 2 2 7. 2 2. 1 4. 1 9. 1 12. 1 Input data Experimental Spectra Spectra comparison: Protein sequence DB >IPI:IPI |SWISS-PROT:P MEICRGLRSHLITLLLFLFHSETICRPSGRKSSKMQAFRIWDVNQKTFYLRNNQLVAGYLQGPNVNLEEKIDVVPIEPHALFLGIHGGKMCLSCVKSGDETRLQLEAVNITDLSENRKQDKRFAFIRSDSGPTTSFESAACPGWFLCTAMEADQPVSLTNMPDEGVMVTKFYFQEDE

21 Scores: 15. 32 3. 4 14. 3 1. 2 Input data 2 Experimental Spectra 7. 2
Peptide assignment Validation Protein inference Quantitation Interpretation Scores: 15. 32 3. 4 14. 3 1. 2 2 7. 2 2. 1 4. 1 9. 1 12. 1 Input data Experimental Spectra Spectra comparison: Protein sequence DB >IPI:IPI |SWISS-PROT:P MEICRGLRSHLITLLLFLFHSETICRPSGRKSSKMQAFRIWDVNQKTFYLRNNQLVAGYLQGPNVNLEEKIDVVPIEPHALFLGIHGGKMCLSCVKSGDETRLQLEAVNITDLSENRKQDKRFAFIRSDSGPTTSFESAACPGWFLCTAMEADQPVSLTNMPDEGVMVTKFYFQEDE

22 Peptide: SHLITLLLFLFHSETICR
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Scores: 15. 32 3. 4 14. 3 1. 2 2 7. 2 2. 1 4. 1 9. 1 12. 1 Input data Experimental Spectra Score: 32 Peptide: SHLITLLLFLFHSETICR Protein sequence DB

23 >IPI:IPI00000044.1|SWISS-PROT:P01127
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Scores: 13. 4 6. 4 1. 4 9. 3 4. 3 3. 2 7. 2 11. 2 8. 1 10. 1 2. 1 5. 1 12. 1 Input data Experimental Spectra Spectra comparison: Protein sequence DB >IPI:IPI |SWISS-PROT:P01127 MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA

24 Peptide: SHLITLLLFLFHSETICR
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Scores: 13. 4 6. 4 1. 4 9. 3 4. 3 3. 2 7. 2 11. 2 8. 1 10. 1 2. 1 5. 1 12. 1 Input data Experimental Spectra Score: 32 Peptide: SHLITLLLFLFHSETICR Score: 4 Peptide: AELDLNMTR Protein sequence DB

25 Peptide: SHLITLLLFLFHSETICR
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Scores: 11. 3 6. 3 9. 3 3. 3 1. 3 4. 2 7. 2 13. 2 1. 1 10. 1 Input data Experimental Spectra Score: 32 Peptide: SHLITLLLFLFHSETICR Score: 4 Peptide: AELDLNMTR Score: 3 Peptide: MEICRGLR Protein sequence DB

26 Peptide: SHLITLLLFLFHSETICR
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Scores: 13. 15 6. 4 1. 4 9. 3 4. 3 3. 2 7. 2 11. 2 8. 1 10. 1 2. 1 5. 1 12. 1 Input data Experimental Spectra Score: 32 Peptide: SHLITLLLFLFHSETICR Score: 4 Peptide: AELDLNMTR Score: 3 Peptide: MEICRGLR Score: 15 Peptide: LLHGDPGEEDK Protein sequence DB

27 >IPI:IPI00000044.1|SWISS-PROT:P01127
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Scores: 1. 2 Input data Experimental Spectra Spectra comparison: Protein sequence DB >IPI:IPI |SWISS-PROT:P01127 MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA

28 >IPI:IPI00000044.1|SWISS-PROT:P01127
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Scores: 1. 2 Input data Experimental Spectra Spectra comparison: 1. Protein sequence DB >IPI:IPI |SWISS-PROT:P01127 MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA

29 >IPI:IPI00000044.1|SWISS-PROT:P01127
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Scores: 1. 2 Input data Experimental Spectra Spectra comparison: 1. 2. Protein sequence DB >IPI:IPI |SWISS-PROT:P01127 MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA

30 Shared Peak Count (SPC)
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Spectra comparison: 1. 100% Shared Peak Count (SPC) This is the number of the peaks in the theoretical spectrum that are matched to peaks in the experimental spectrum 0% 1

31 Shared Peak Count (SPC)
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Spectra comparison: 1. 100% Shared Peak Count (SPC) This is the number of the peaks in the theoretical spectrum that are matched to peaks in the experimental spectrum 0% 1 SPC = 7

32 Spectra comparison: 1. 100% Inner product (I)
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Spectra comparison: 1. 100% Inner product (I) This is the sum of the intensities of the peaks in the experimental spectrum that match to peaks in the theoretical spectrum 0% 1

33 Spectra comparison: 1. 100% Inner product (I)
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Spectra comparison: 1. 100% Inner product (I) This is the sum of the intensities of the peaks in the experimental spectrum that match to peaks in the theoretical spectrum I = 3.5 0% 1

34 Hyperscore: H = I*Nb!*Ny!
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Spectra comparison: 1. 100% Hyperscore: H = I*Nb!*Ny! I is the sum of the intensity of the matched peaks Nb, (resp. Ny) is the number of the matched b (resp. y) peaks in the theoretical spectrum ! is the factorial function. 0% 1 b b y b y y b b y y

35 Hyperscore: H = I*Nb!*Ny!
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Spectra comparison: 1. 100% Hyperscore: H = I*Nb!*Ny! - I is the sum of the intensity of the matched peaks - Nb, (resp. Ny) is the number of the matched b (resp. y) peaks in the theoretical spectrum - ! is the factorial function. 0% 1 b b y b y y b b y y H = 3.2*3!*4! = 3.2*6*24 =

36 t is the theoretical spectrum
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Spectra comparison: 1. 100% Xcorr q is the query spectrum t is the theoretical spectrum 0% 1

37 t is the theoretical spectrum
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Spectra comparison: 1. 100% Xcorr q is the query spectrum t is the theoretical spectrum I(q,t)=3.2 0% 1

38 t is the theoretical spectrum
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Spectra comparison: 1. 100% Xcorr q is the query spectrum t is the theoretical spectrum I(q,t)=3.2 0% 1 I(q,t[-75])=

39 t is the theoretical spectrum
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Spectra comparison: 1. 100% Xcorr q is the query spectrum t is the theoretical spectrum I(q,t)=3.2 0% 1 I(q,t[-32])=

40 t is the theoretical spectrum
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Spectra comparison: 1. 100% Xcorr q is the query spectrum t is the theoretical spectrum I(q,t)=3.2 0% 1 I(q,t[0])=

41 t is the theoretical spectrum
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Spectra comparison: 1. 100% Xcorr q is the query spectrum t is the theoretical spectrum I(q,t)=3.2 0% 1 I(q,t[32])= And so on.

42 Protein Sequence Databases
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Protein Sequence Databases Completeness:  Complete  Longer searching time Redundancy:  Sequence variations can be found  Redundant database can mess up the statistics Quality of sequence annotation 2. Protein sequence DB

43 Reference Sequence (RefSeq) and UniProt (Swiss-Prot and TrEMBL)
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Entrez Protein DB Most complete, redundant Reference Sequence (RefSeq) and UniProt (Swiss-Prot and TrEMBL) Well annotated, non-redundant International Protein Index (IPI) Represents a good balance between redundancy and completeness. Contains cross-reference to Ensemble, UniProt, RefSeq. Sequences from a single genome Difficult to obtain good statistics on small datasats. 2. Protein sequence DB

44 Input data Peptide assignment Validation Protein inference Quantitation Interpretation Taxonomy Allows searches to be limited to entries from particular species or groups of species. Speed up a search, and ensures that the hit list will only contain entries from the selected species. For non-redundant databases, a single entry may represent identical sequences from multiple species. The accession string and title text from the FASTA entry, listed on the master results page, will usually describe just one of these entries. To see the equivalent entries, and to explore their taxonomy, follow the accession number link in the results list to the Protein View. If the hit is from a non-redundant database, and represents multiple entries with identical sequences, the Protein View will include links to NCBI Entrez and the NCBI Taxonomy Browser for all equivalent entries. 2. Protein sequence DB

45 Input data Peptide assignment Validation Protein inference Quantitation Interpretation Run time Database search has to enumerate all peptides and compare them to all experimental spectra. This can be slow with large protein sequence databases especially when slow scoring function is applied, like Xcorr.

46 Fast database indexing
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Speedup techniques Fast database indexing Fast implementation of sequence indexing in the database Parent mass check PTMs can be lost Sequest’s preliminary score Tag-based filtering (de novo hybrid) Increases the specificity(or sensitivity)

47 Advanced database indexing
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Advanced database indexing Better implementation of the sequence indexing Better representation of protein sequences.

48 Parent mass check Scores: 1. 2 Input data Experimental Spectra
Peptide assignment Validation Protein inference Quantitation Interpretation Scores: 1. 2 Input data Experimental Spectra Parent mass check Spectra comparison Protein sequence DB >IPI:IPI |SWISS-PROT:P01127 MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA

49 Parent mass check Scores: Input data Experimental Spectra
Peptide assignment Validation Protein inference Quantitation Interpretation Scores: Input data Experimental Spectra Parent mass check Spectra comparison Protein sequence DB >IPI:IPI |SWISS-PROT:P01127 MNRCWALFLSLCCYLRLVSAEGDPIPEELYEMLSDHSIRSFDDLQRLLHGDPGEEDKAELDLNMTRSHSGGELESLARGRRSLGSLTIAEPAMIAECKTRTEVFEISRRLIDRTNANFLVWPPCVEVQRCSGCCNNRNVQCRPTQVQLRPVQVRKIEIVRKKPIFKKATVTLEDHLACKCETVAAARPVTRSPGGSQEQRAKTPQTRVTIRTVRVRRPPKGKHRKFKHTHDKTALKETLGA

50 Fast prescoring (used in SEQUEST) So called Sp score:
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Fast prescoring (used in SEQUEST) So called Sp score: R(q,t) is the maximum number of consecutive matched b-y ions. 100% 0% 1 Sp=3.2*7*( *4)/10=2.3072 SEQUEST selects the top 500 scoring peptides, scored by Sp, and rescores them using the Xcorr.

51 Sequence tag based filtering
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Sequence tag based filtering Extract short amino acid tags from the experimental spectra, Using spectrum graph, where nodes are the peaks, masses which differ by the mass of an amino acid are linked by an edge.

52 W R V A L G T E P Q K C W D T Input data Peptide assignment Validation
Protein inference Quantitation Interpretation W R A V L G T E P Q C K W D T

53 Input data Peptide assignment Validation Protein inference Quantitation Interpretation Generates short peptide sequence tags from the spectrum, and uses these tags to filter the protein sequence database. Tags make database search much faster, analogous to the way that BLAST’s filter speeds up sequence search. W R TAG Prefix Mass AVG WTD PET A V L T G E P L C K W D T

54 Tag-based filtering MDHPEDESHSEK QDDEEALARLEEIK SIEAKLTLR
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Tag-based filtering MDHPEDESHSEK QDDEEALARLEEIK SIEAKLTLR QNNLNPERPDSAYLR LKQINEEQREGLR FVSEAVTAICEAK SSDIQAAVQICSLLHQR EFSASLTQGLLK SAEDLEADK MDHPEDESHSEK QDDEEALARLEEIK SIEAKLTLR QNNLNPERPDSAYLR LKQINEEQREGLR FVSEAVTAICEAK SSDIQAAVQICSLLHQR EFSASLTQGLLK SAEDLEADK

55 Experimental spectra are compared to protein sequence database.
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Summary Experimental spectra are compared to protein sequence database. Scoring function, Protein Database, Speedup techniques,

56 Validation Input data Peptide assignment Protein inference
Quantitation Interpretation Validation

57 Peptide: SHLITLLLFLFHSETICR
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Input data Experimental Spectra Score: 32 Peptide: SHLITLLLFLFHSETICR Score: 4 Peptide: AELDLNMTR Score: 3 Peptide: MEICRGLR Score: 15 Peptide: LLHGDPGEEDK Score: 4 Peptide: MDHPEDESHSEK Score: 5 Peptide: SAEDLEADK Protein sequence DB Score: 3 Peptide: SIEAKLTLR

58 Peptide: SHLITLLLFLFHSETICR
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Scores: 15. 32 3. 4 14. 3 1. 2 2 7. 2 2. 1 4. 1 9. 1 12. 1 Input data Experimental Spectra Score: 32 Peptide: SHLITLLLFLFHSETICR Score: 4 Peptide: AELDLNMTR Score: 3 Peptide: MEICRGLR Score: 15 Peptide: LLHGDPGEEDK Score: 4 Peptide: MDHPEDESHSEK Score: 5 Peptide: SAEDLEADK Protein sequence DB Score: 3 Peptide: SIEAKLTLR

59 Peptide: SHLITLLLFLFHSETICR
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Scores: 13. 4 6. 4 1. 4 9. 3 4. 3 3. 2 7. 2 11. 2 8. 1 10. 1 2. 1 5. 1 12. 1 Input data Experimental Spectra Score: 32 Peptide: SHLITLLLFLFHSETICR Score: 4 Peptide: AELDLNMTR Score: 3 Peptide: MEICRGLR Score: 15 Peptide: LLHGDPGEEDK Score: 4 Peptide: MDHPEDESHSEK Score: 5 Peptide: SAEDLEADK Protein sequence DB Score: 3 Peptide: SIEAKLTLR

60 Peptide: SHLITLLLFLFHSETICR
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Scores: 11. 3 6. 3 9. 3 3. 3 1. 3 4. 2 7. 2 13. 2 1. 1 10. 1 Input data Experimental Spectra Score: 32 Peptide: SHLITLLLFLFHSETICR Score: 4 Peptide: AELDLNMTR Score: 3 Peptide: MEICRGLR Score: 15 Peptide: LLHGDPGEEDK Score: 4 Peptide: MDHPEDESHSEK Score: 5 Peptide: SAEDLEADK Protein sequence DB Score: 3 Peptide: SIEAKLTLR

61 Peptide: SHLITLLLFLFHSETICR
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Scores: 13. 15 6. 4 1. 4 9. 3 4. 3 3. 2 7. 2 11. 2 8. 1 10. 1 2. 1 5. 1 12. 1 Input data Experimental Spectra Score: 32 Peptide: SHLITLLLFLFHSETICR Score: 4 Peptide: AELDLNMTR Score: 3 Peptide: MEICRGLR Score: 15 Peptide: LLHGDPGEEDK Score: 4 Peptide: MDHPEDESHSEK Score: 5 Peptide: SAEDLEADK Protein sequence DB Score: 3 Peptide: SIEAKLTLR

62 How can peptide assignments be approved or rejected automatically?
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Input data Experimental Spectra Score: 32 Peptide: SHLITLLLFLFHSETICR How can peptide assignments be approved or rejected automatically? Why is it necessary? Score: 4 Peptide: AELDLNMTR Score: 3 Peptide: MEICRGLR Score: 15 Peptide: LLHGDPGEEDK Score: 4 Peptide: MDHPEDESHSEK Score: 5 Peptide: SAEDLEADK Protein sequence DB Score: 3 Peptide: SIEAKLTLR

63 Why is it necessary to do it automatically?
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Scores: 13. 15 6. 4 1. 4 9. 3 4. 3 3. 2 7. 2 11. 2 8. 1 10. 1 2. 1 5. 1 12. 1 Why is it necessary to do it automatically? Human judgment is biased and can be unreliable, Millions of spectra per day, Very difficult by looking at the spectrum visually.

64 Two computational approaches: Relative score probability based scoring
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Scores: 13. 15 6. 4 1. 4 9. 3 4. 3 3. 2 7. 2 11. 2 8. 1 10. 1 2. 1 5. 1 12. 1 Two computational approaches: Relative score probability based scoring

65 Input data Peptide assignment Validation Protein inference Quantitation Interpretation Scores: 13. 15 6. 4 1. 4 9. 3 4. 3 3. 2 7. 2 11. 2 8. 1 10. 1 2. 1 5. 1 12. 1 Relative score: SEQUEST: delta score

66 Peptide: SHLITLLLFLFHSETICR Cn=(32-4)/32=0.875
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Scores: 15. 32 3. 4 14. 3 1. 2 2 7. 2 2. 1 4. 1 9. 1 12. 1 Input data Experimental Spectra Score: 32 Peptide: SHLITLLLFLFHSETICR Cn=(32-4)/32=0.875 Score: 4 Peptide: AELDLNMTR Score: 3 Peptide: MEICRGLR Score: 15 Peptide: LLHGDPGEEDK Score: 4 Peptide: MDHPEDESHSEK Score: 5 Peptide: SAEDLEADK Protein sequence DB Score: 3 Peptide: SIEAKLTLR

67 Peptide: SHLITLLLFLFHSETICR Cn=(32-4)/32=0.875
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Scores: 13. 4 6. 4 1. 4 9. 3 4. 3 3. 2 7. 2 11. 2 8. 1 10. 1 2. 1 5. 1 12. 1 Input data Experimental Spectra Score: 32 Peptide: SHLITLLLFLFHSETICR Cn=(32-4)/32=0.875 Score: 4 Peptide: AELDLNMTR Cn=(4-4)/4=0 Score: 3 Peptide: MEICRGLR Score: 15 Peptide: LLHGDPGEEDK Score: 4 Peptide: MDHPEDESHSEK Score: 5 Peptide: SAEDLEADK Protein sequence DB Score: 3 Peptide: SIEAKLTLR

68 Peptide: SHLITLLLFLFHSETICR Cn=(32-4)/32=0.875
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Scores: 11. 3 6. 3 9. 3 3. 3 1. 3 4. 2 7. 2 13. 2 1. 1 10. 1 Input data Experimental Spectra Score: 32 Peptide: SHLITLLLFLFHSETICR Cn=(32-4)/32=0.875 Score: 4 Peptide: AELDLNMTR Cn=(4-4)/4=0 Score: 3 Peptide: MEICRGLR Cn=(3-3)/3=0 Score: 15 Peptide: LLHGDPGEEDK Score: 4 Peptide: MDHPEDESHSEK Score: 5 Peptide: SAEDLEADK Protein sequence DB Score: 3 Peptide: SIEAKLTLR

69 Peptide: SHLITLLLFLFHSETICR Cn=(32-4)/32=0.875
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Scores: 13. 15 6. 4 1. 4 9. 3 4. 3 3. 2 7. 2 11. 2 8. 1 10. 1 2. 1 5. 1 12. 1 Input data Experimental Spectra Score: 32 Peptide: SHLITLLLFLFHSETICR Cn=(32-4)/32=0.875 Score: 4 Peptide: AELDLNMTR Cn=(4-4)/4=0 Score: 3 Peptide: MEICRGLR Cn=(3-3)/3=0 Score: 15 Peptide: LLHGDPGEEDK Cn=(15-4)/15=0.733 Score: 4 Peptide: MDHPEDESHSEK Score: 5 Peptide: SAEDLEADK Protein sequence DB Score: 3 Peptide: SIEAKLTLR

70 Peptide: SHLITLLLFLFHSETICR Cn=(32-4)/32=0.875
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Scores: 13. 15 6. 4 1. 4 9. 3 4. 3 3. 2 7. 2 11. 2 8. 1 10. 1 2. 1 5. 1 12. 1 Input data Experimental Spectra Score: 32 Peptide: SHLITLLLFLFHSETICR Cn=(32-4)/32=0.875 Score: 4 Peptide: AELDLNMTR Cn=(4-4)/4=0 Score: 3 Peptide: MEICRGLR Cn=(3-3)/3=0 Score: 15 Peptide: LLHGDPGEEDK Cn=(15-4)/15=0.733 Score: 4 Peptide: MDHPEDESHSEK Score: 5 Peptide: SAEDLEADK Protein sequence DB Score: 3 Peptide: SIEAKLTLR Keep the peptide assignment that exceeds a certain limit.

71 Peptide: SHLITLLLFLFHSETICR Cn=(32-4)/32=0.875
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Scores: 13. 15 6. 4 1. 4 9. 3 4. 3 3. 2 7. 2 11. 2 8. 1 10. 1 2. 1 5. 1 12. 1 Input data Experimental Spectra Score: 32 Peptide: SHLITLLLFLFHSETICR Cn=(32-4)/32=0.875 Score: 4 Peptide: AELDLNMTR Cn=(4-4)/4=0 Score: 3 Peptide: MEICRGLR Cn=(3-3)/3=0 Score: 15 Peptide: LLHGDPGEEDK Cn=(15-4)/15=0.733 Score: 4 Peptide: MDHPEDESHSEK Score: 5 Peptide: SAEDLEADK Protein sequence DB Score: 3 Peptide: SIEAKLTLR Keep the peptide assignment that exceeds a certain limit.

72 Peptide: SHLITLLLFLFHSETICR Cn=(32-4)/32=0.875
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Scores: 13. 15 6. 4 1. 4 9. 3 4. 3 3. 2 7. 2 11. 2 8. 1 10. 1 2. 1 5. 1 12. 1 Input data Experimental Spectra Score: 32 Peptide: SHLITLLLFLFHSETICR Cn=(32-4)/32=0.875 Score: 4 Peptide: AELDLNMTR Cn=(4-4)/4=0 Score: 3 Peptide: MEICRGLR Cn=(3-3)/3=0 Score: 15 Peptide: LLHGDPGEEDK Cn=(15-4)/15=0.733 Score: 4 Peptide: MDHPEDESHSEK Score: 5 Peptide: SAEDLEADK Protein sequence DB Score: 3 Peptide: SIEAKLTLR Keep the peptide assignment that exceeds a certain limit.

73 Peptide: SHLITLLLFLFHSETICR Cn=(32-4)/32=0.875
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Scores: 13. 15 6. 4 1. 4 9. 3 4. 3 3. 2 7. 2 11. 2 8. 1 10. 1 2. 1 5. 1 12. 1 Input data Experimental Spectra Score: 32 Peptide: SHLITLLLFLFHSETICR Cn=(32-4)/32=0.875 Score: 4 Peptide: AELDLNMTR Cn=(4-4)/4=0 Score: 3 Peptide: MEICRGLR Cn=(3-3)/3=0 Score: 15 Peptide: LLHGDPGEEDK Cn=(15-4)/15=0.733 Score: 4 Peptide: MDHPEDESHSEK Score: 5 Peptide: SAEDLEADK Protein sequence DB Score: 3 Peptide: SIEAKLTLR Keep the peptide assignment that exceeds a certain limit.

74 Peptide: SHLITLLLFLFHSETICR Cn=(32-4)/32=0.875
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Scores: 13. 15 6. 4 1. 4 9. 3 4. 3 3. 2 7. 2 11. 2 8. 1 10. 1 2. 1 5. 1 12. 1 Input data Experimental Spectra Score: 32 Peptide: SHLITLLLFLFHSETICR Cn=(32-4)/32=0.875 Score: 4 Peptide: AELDLNMTR Cn=(4-4)/4=0 Score: 3 Peptide: MEICRGLR Cn=(3-3)/3=0 Score: 15 Peptide: LLHGDPGEEDK Cn=(15-4)/15=0.733 Score: 4 Peptide: MDHPEDESHSEK Score: 5 Peptide: SAEDLEADK Protein sequence DB Score: 3 Peptide: SIEAKLTLR Keep the peptide assignment that exceeds a certain limit.

75 Peptide: SHLITLLLFLFHSETICR Cn=(32-4)/32=0.875
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Scores: 13. 15 6. 4 1. 4 9. 3 4. 3 3. 2 7. 2 11. 2 8. 1 10. 1 2. 1 5. 1 12. 1 Input data Experimental Spectra Score: 32 Peptide: SHLITLLLFLFHSETICR Cn=(32-4)/32=0.875 Score: 4 Peptide: AELDLNMTR Cn=(4-4)/4=0 Score: 3 Peptide: MEICRGLR Cn=(3-3)/3=0 Score: 15 Peptide: LLHGDPGEEDK Cn=(15-4)/15=0.733 Score: 4 Peptide: MDHPEDESHSEK Score: 5 Peptide: SAEDLEADK Protein sequence DB Score: 3 Peptide: SIEAKLTLR Keep the peptide assignment that exceeds a certain limit.

76 Probability based peptide assignment validation:
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Scores: 13. 15 6. 4 1. 4 9. 3 4. 3 3. 2 7. 2 11. 2 8. 1 10. 1 2. 1 5. 1 12. 1 Probability based peptide assignment validation: Compute the statistical significance of the score. The statistical significance of a score s is the probability of observing a random score x that is higher or equal that the score s, formally P(s <= x). This probability is called the p-value. 3 approaches: 1. using analytical functions, 2. Fitting a distribution of the sample of random scores. 3. non-parametric approach. Compute the probability that the peptide assignment with the corresponding score is correct.

77 Probability based peptide assignment validation:
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Scores: 13. 15 6. 4 1. 4 9. 3 4. 3 3. 2 7. 2 11. 2 8. 1 10. 1 2. 1 5. 1 12. 1 Probability based peptide assignment validation: The probability based approach means, very loosely speaking, how far the score is from the random.

78 Probability based peptide assignment validation:
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Scores: 13. 15 6. 4 1. 4 9. 3 4. 3 3. 2 7. 2 11. 2 8. 1 10. 1 2. 1 5. 1 12. 1 Probability based peptide assignment validation: Random score is a score obtained by a comparison between a randomly selected experimental and a randomly selected theoretical spectrum. This random score has a probability density distribution, and it depends on the scoring functions. As a null hypothesis.

79 Probability based peptide assignment validation:
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Scores: 13. 15 6. 4 1. 4 9. 3 4. 3 3. 2 7. 2 11. 2 8. 1 10. 1 2. 1 5. 1 12. 1 Probability based peptide assignment validation: Random matches caused by match with noise The distribution depends on the scoring function.

80 Probability based peptide assignment validation:
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Scores: 13. 15 6. 4 1. 4 9. 3 4. 3 3. 2 7. 2 11. 2 8. 1 10. 1 2. 1 5. 1 12. 1 Probability based peptide assignment validation: 1. Analytical function. Depends on the scoring function. And the parameters are calculated from the spectra to be compared. 1. In the case of the SPC scoring function, the distribution of the random scores can be modeled with hyper geometrical distribution. 2. In the case of the inner product scoring function, the random scores can be modeled with normal distirbution.

81 Probability based approach:
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Scores: 13. 15 6. 4 1. 4 9. 3 4. 3 3. 2 7. 2 11. 2 8. 1 10. 1 2. 1 5. 1 12. 1 Probability based approach: Build a histogram of the scores that were obtained during the comparison. Fit a known distribution function, and use this for calculation of the p-value of the top score.

82 Probability based approach:
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Scores: 13. 15 6. 4 1. 4 9. 3 4. 3 3. 2 7. 2 11. 2 8. 1 10. 1 2. 1 5. 1 12. 1 Probability based approach: Decoy approach. Make a dummy dataset, big enough to obtain solid statistics. Decoy dataset can be made by: random shuffling Markov-chain generated amino acid sequences more typically, by simply reversing the sequence of proteins in the database. Sometimes it is called reverse database. No correct matches are expected from the decoy dataset, so the scores obtained on Decoy dataset are used for excellent estimate of random distribution.

83 Decoy Protein sequence DB
Scores: 13. 15 6. 4 1. 4 9. 3 4. 3 3. 2 7. 2 11. 2 8. 1 10. 1 2. 1 5. 1 12. 1 Input data Peptide assignment Validation Protein inference Quantitation Interpretation Input data Experimental Spectra Spectra comparison: Protein sequence DB Decoy Protein sequence DB >IPI:IPI |SWISS-PROT:P MEICRGLRSHLITLLLFLFHSETICRPSGRKSSKMQAFRIWDVNQKTFYLRNNQLVAGYLQGPNVNLEEKIDVVPIEPHALFLGIHGGKMCLSCVKSGDETRLQLEAVNITDLSENRKQDKRFAFIRSDSGPTTSFESAACPGWFLCTAMEADQPVSLTNMPDEGVMVTKFYFQEDE

84 Decoy Protein sequence DB
Scores: 13. 15 6. 4 1. 4 9. 3 4. 3 3. 2 7. 2 11. 2 8. 1 10. 1 2. 1 5. 1 12. 1 Decoy Scores: 5. 4 3. 4 4. 4 10. 3 8. 3 7. 3 2. 2 6. 2 1. 2 12. 1 9. 1 11. 1 Input data Peptide assignment Validation Protein inference Quantitation Interpretation Input data Experimental Spectra Spectra comparison: Protein sequence DB Decoy Protein sequence DB >Decoy_protein_sequence_1 EDEQFYFKTVMVGEDPMNTRLSVPQDAEMATCLFWGPCAASEFSTTPGSDSRIFAFRKDQKRNESLDTINVAELQLRTEDGSKVCSLCMKGGHIGLFLAHPEIPVVDIKEELNVNPGQLYGAVLQNNRLYFTKQNVDWIRFAQMKSSKRGSPRCITESHFLFLLLTILHSRLGRCIEM

85  Can provide more accurate random distribution model.
Scores: 13. 15 6. 4 1. 4 9. 3 4. 3 3. 2 7. 2 11. 2 8. 1 10. 1 2. 1 5. 1 12. 1 Decoy Scores: 5. 4 3. 4 4. 4 10. 3 8. 3 7. 3 2. 2 6. 2 1. 2 12. 1 9. 1 11. 1 Input data Peptide assignment Validation Protein inference Quantitation Interpretation Input data Experimental Spectra  Can provide more accurate random distribution model.  Doubles the execution time. Frequently applied approach! Protein sequence DB Decoy Protein sequence DB

86 Non-parametric approach.
Scores: 13. 15 6. 4 1. 4 9. 3 4. 3 3. 2 7. 2 11. 2 8. 1 10. 1 2. 1 5. 1 12. 1 Decoy Scores: 5. 4 3. 4 4. 4 10. 3 8. 3 7. 3 2. 2 6. 2 1. 2 12. 1 9. 1 11. 1 Input data Peptide assignment Validation Protein inference Quantitation Interpretation Input data Experimental Spectra Non-parametric approach. Instead of fitting probability density function to the histogram: Calculate the percentage of the scores on the decoy dataset, equal or higher score than the actual top score. Protein sequence DB Decoy Protein sequence DB

87 Decoy Protein sequence DB
Scores: 13. 15 6. 4 1. 4 9. 3 4. 3 3. 2 7. 2 11. 2 8. 1 10. 1 2. 1 5. 1 12. 1 Decoy Scores: 5. 4 3. 4 4. 4 10. 3 8. 3 7. 3 2. 2 6. 2 1. 2 12. 1 9. 1 11. 1 Input data Peptide assignment Validation Protein inference Quantitation Interpretation False Positive Rate (FPR), the probability of labelling a random score significant (area B in the figure). A FPR of 0.01 means that 1% of the random scores are labelled significant. E-value: The E-value of a query is the expected number for finding a database element with random score greater than or equal to the query hit s on a database of n data. For instance, an E-value of 10-2 means that the score h is expected to occur by chance only once in 100 independent similarity searches over the database. If the E-value is 10, then ten random hits with score greater or equal to h are expected within a single similarity search. Protein sequence DB Decoy Protein sequence DB

88 Decoy dataset is used to calculate the FDR. Protein sequence DB
Scores: 13. 15 6. 4 1. 4 9. 3 4. 3 3. 2 7. 2 11. 2 8. 1 10. 1 2. 1 5. 1 12. 1 Decoy Scores: 5. 4 3. 4 4. 4 10. 3 8. 3 7. 3 2. 2 6. 2 1. 2 12. 1 9. 1 11. 1 Input data Peptide assignment Validation Protein inference Quantitation Interpretation False Discovery Rate, the ratio of random scores within significant scores, formally FDR=A/(A+B). The FDR = 0.01 means the 1% of the scores labelled significant are actually observed by chance. FDR is often used to control the ratio of the false positives. The threshold T can be set to keep the FDR under a certain level, typical levels are 0.01 or 0.05, i.e experimenters set thresholds to allow 1% or 5% of false positives. The lower the FDR the more true (non-random) similarity hits are lost. Decoy dataset is used to calculate the FDR. Protein sequence DB Decoy Protein sequence DB

89 Peptide assignment has to be validated.
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Summary: Peptide assignment has to be validated. Relative scoring or probability based scoring can be applied. False positives (false assignments) can be kept under a certain level.

90 Protein Inference Input data Peptide assignment Validation
Quantitation Interpretation Protein Inference

91 Input data Peptide assignment Validation Protein inference Quantitation Interpretation Input data Experimental Spectra Take the peptides that passed the validation. This section is about to infer the proteins that could produces these peptides. The task is not trivial. Score: 32 Peptide: SHLITLLLFLFHSETICR Score: 15 Peptide: LLHGDPGEEDK

92 Proteins: Peptides: MDHPEDESHSEK QDDEEALARLEEIK SIETLR QNNLNPERPDSAYLR
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Proteins: Peptides: MDHPEDESHSEK QDDEEALARLEEIK SIETLR QNNLNPERPDSAYLR LKQINEEQREGLR FVSEAVTAICEAK SSDIQAAVQICSLLHQR EFSASLTQGLLK SAEDLEADK

93 Proteins: Peptides: MDHPEDESHSEK QDDEEALARLEEIK  SIETLR
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Proteins: Peptides: MDHPEDESHSEK QDDEEALARLEEIK SIETLR QNNLNPERPDSAYLR LKQINEEQREGLR FVSEAVTAICEAK SSDIQAAVQICSLLHQR EFSASLTQGLLK SAEDLEADK

94 Proteins: Peptides: MDHPEDESHSEK QDDEEALARLEEIK  SIETLR
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Proteins: Peptides: MDHPEDESHSEK QDDEEALARLEEIK SIETLR QNNLNPERPDSAYLR LKQINEEQREGLR FVSEAVTAICEAK SSDIQAAVQICSLLHQR EFSASLTQGLLK SAEDLEADK

95 Input data Peptide assignment Validation Protein inference Quantitation Interpretation By Occam’s razor, the Protein A should be preferred. Protein A, B ad C can be homologous proteins

96 Many models have been develop to cope with to this problem.
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Many models have been develop to cope with to this problem. Statistical based, Graph theory and spectral Network based. Well-known method ProteinProphet.

97 Peptide identification
Summary Input data Peptide identification Validation Protein inference Interpretation Data formats Database searching Statistical methods for validations Quantitation Protein assembling

98 Database Searching   Simple and straightforward
Input data Peptide assignment Validation Protein inference Quantitation Interpretation Database Searching Simple and straightforward Has a limited search space. Completeness Statistical analysis can be carried out. Has a limited search space. Limited to the database. Enumerating all candidates is too slow, particularly when modifications and non-tryptic peptides must be considered. (A modern instrument produces million spectra per day)


Download ppt "Protein Identification via Database searching"

Similar presentations


Ads by Google