Minimize Database-Dependence in Proteome Informatics Apr. 28, 2009 Kyung-Hoon Kwon Korea Basic Science Institute
Story which we experienced at my lab 1.Sequest search for the biomarker discovery after the 1 st experiment of a sample 2.Decoy approach within 5% error rate 3.Preliminary biological interpretation for the 1 st dataset 4.We installed high-performance Mascot server. 5.Mascot search for the 1 st data 6.Very different protein list from the previous Sequest result presumed markers disppeared. 7.Need to change biological interpretation for Mascot result 8.Two more experiments for the confirmation 9. We had to select which search engine will be used for the 2 nd and 3 rd dataset analysis. Which one is the correct interpretation?
Factor which affects on the Proteome Analysis Experimental dependence ▫Instrumentation : ESI/MS, MALDI/MS, … ▫Reagent : enzyme for proteolysis, isotope tag for quantitation, affinity tag for enrichment, …. ▫Protocol : MudPIT, IPAS,… Informatics dependence ▫Software : Mascot, Sequest, PEAKS, X!Tandem, OMSSA, Lutefisk… ▫Data analysis protocol : decoy, peptideProphet, … ▫Sequence database : SwissProt, IPI, NCBI nr, OWL, … Different result by different method
Different experiment P.A. Kirkland et al., J. Proteome Res. 2008, 7(11),
9% 19%7% 34% 5% 4%22% SEQUEST X!Tandem Mascot Each search engine identifies about the same number of spectra, But the overlap is surprisingly small. Different search engines match different spectra. But the overlap is surprisingly small. Different search engines match different spectra. B.C. Searle, Improving Sensitivity by Combining Results from Multiple Search Methodologies, Proteome Software Inc. Search engines
For Each Spectrum Get Mascot IDs Get SEQUEST IDs Get X!Tandem IDs Calculate SEQUEST Probability Calculate Mascot Probability Calculate X!Tandem Probability Calculate Combined Peptide Probability Peptide Prophet* Scaffold Merger Calculate Protein Probabilities Protein Prophet* … *Nesvizhskii, A. I. et al, Anal. Chem. 2003, 75, Scaffold uses Nesvizhskii’s algorithm to convert SEQUEST and Mascot scores to peptide probabilities Scaffold uses another algorithm by Nesvizskii to combine peptide probabilities. B.C. Searle, Improving Sensitivity by Combining Results from Multiple Search Methodologies, Proteome Software Inc.
Factors which effect on the Proteome Analysis Experimental dependence ▫Instrumentation : ESI/MS, MALDI/MS, … ▫Reagent : enzyme for proteolysis, isotope tag for quantitation, affinity tag for filtering, …. ▫Protocol : MudPIT, IPAS,… Data analysis method dependence ▫Software : Mascot, Sequest, PEAKS, X!Tandem, OMSSA, Lutefisk… ▫Data analysis protocol : decoy, peptideProphet, … ▫Sequence database : SwissProt, IPI, NCBI nr, OWL, … Solution : integration Expensive way to many small laboratories
Suggest to group proteins
As usual : Grouping the identified proteins X. Li, et al., ‘Comparison of alternative analytical techniues for the characterisation of the human serum proteome in HUPO Plasma Proteome Project ‘, Proteomics 2005, 5, Peptide 1 Peptide 2 Peptide 3 Peptide 4 Peptide 6 Peptide 5 Peptide 7 Protein D Protein E Protein F Protein A Protein B Protein C
Smaller database is fast but may miss many true sequences. less true positives Idea : Optimize the database size ( Rope walking ) Larger database will include more true sequences. It is not fast and it may also include many false positives
Sequence Database IPI human, IPI mouse (EBI) : HUPO recommendation SwissProt (EBI) nr (NCBI) EST database
Similar sequences Grouping proteins of sequence database before database search 1 st search 2 nd search 3 rd search
1 st search with IPI human representative database IPI kDa protein IPI Gamma-globin IPI Guanine deaminase IPI Guanine deaminase IPI kDa protein IPI Malate dehydrogenase
IPI Beta-globin IPI Hemoglobin IPI Gamma-globin IPI Hemoglobin IPI kDa protein IPI Hemoglobin IPI Hemoglobin -1 IPI Hemoglobin -2 IPI Malate dehydrogenase IPI Guanine deaminase IPI Guanine deaminase IPI Guanine deaminase IPI Guanine deaminase 2 nd search with groups of IPI human database selected from the IPI representative DB search IPI Hemoglobin IPI kDa protein IPI Beta-globin gene IPI Hemoglobin IPI Hemoglobin Lepore-Baltimore IPI kDa protein IPI Delta-hemoglobin IPI Gamma-G globin
3 rd search with groups of NCBI nr human database selected from the IPI representative DB search gi| gi| gi| gi| gi| gi| gi| gi| gi| gi| gi| gi| gi| gi| gi| gi| gi| gi| gi| gi| gi| gi| gi| gi| gi| gi| gi| gi| gi| gi| gi| gi| IPI kDa protein gi| IPI Gamma-globin gi| gi| gi| gi| IPI Guanine deaminase gi| IPI Guanine deaminase IPI kDa protein IPI Malate dehydrogenase gi| gi| gi| gi| gi| gi| gi| gi| gi|
Keratin Keratin: type I cytoskeletal,epide rmal type I, type I cuticular Cell division protein kinase, tyrosin- protein kinase, Serine/threoni ne-protein kinase, Fibroblast growth factor receptor Guanine nucleotide binding protein Septin Ras-related proteins 1 st search with representative DB 2 nd search with group DB of identified representative proteins
Result of the iterative MS/MS ion search databaseIPI human database IPI human representative IPI human selected groups NCBI nr human selected groups Number of proteins48,19324,1206,86032,916 Redundant proteins identified 5,5842,3365,28822,895 Non-redundant proteins identified 2,9442,1362,9344,090 Redundant peptides identified 10,4865,58511,06617,500 Non-redundant peptides identified 6,1245,1776,5696,580 Material : membranous fraction of human brain temporal lobe tissue Experimental Methods : Multidimensional separation / LTQ-MS/MS (ThermoFinnigan) Database Analysis : TurboSEQUEST(ThermoFinnigan), DTASelect (Scripps Institute) Database : IPI.HUMAN.v , NCBI nr human (283, 548 proteins)
Mascot vs. Sequest IPI, Sprot, nr, IPI-representative, Sprot- representptive, IPI-IDedGroup, Sprot-IDedGroup
Mascot = Sequest Mascot only
representative DB approach works differently. Sequest only
Compare Mascot, Sequest with IPI, Sprot, nr, representative DB (result from MudPIT analysis of one 1D-gel band of human cell line) Lower XcorrLower Xcorr
Lower XcorrLower Xcorr
Advantage of representative DB approach It mines more peptide sequences without consuming more time and more search engines. This method can connect different databases. Additionally, we expect that it can give more reliable information for PTM by selecting more probable proteins before PTM search.
Thank you. Why I have given this presentation at HUPO PSI?