Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hsin-Nan Lin, Ching-Tai Chen, Ting-Yi Sung,

Similar presentations


Presentation on theme: "Hsin-Nan Lin, Ching-Tai Chen, Ting-Yi Sung,"— Presentation transcript:

1 Protein Subcellular Localization Prediction of Eukaryotes using a Knowledge-based Approach
Hsin-Nan Lin, Ching-Tai Chen, Ting-Yi Sung, Shinn-Ying Ho, Wen-Lian Hsu Bioinformatics Program, TIGP (Taiwan International Graduate Program), Academia Sinica, Taiwan

2 Outline Introduction Methods Dataset Prediction Results Conclusions
Construction of the knowledge base SPKB KnowPredsite: a localization prediction method using SPKB Dataset Prediction Results Conclusions

3 Protein Subcellular Localization (PSL)
Given a protein, determine its subcellular compartment Mitochondria, cytoplasm, nuclear,….etc It is important because Help elucidate protein functions Identify potential diagnostic and drug targets Wet-lab experiment Time consuming Labor intensive Computational prediction is needed.

4 Existing PSL Predictors
Using various features Features extracted from literature or public databases Phylogenetic profiling Compartment-specific features Main problem of many predictors They only predict a limited number of locations Limited to subsets of proteomes, e.g., those containing signal peptide sequences Designed for specific species Designed for single-localized protein sequences Up to 35% of proteins move between different cellular compartments

5 Outline Introduction Methods Dataset Prediction Results Conclusions
Concept behind the method Construction of the knowledge base SPKB KnowPredsite: a localization prediction method using SPKB Dataset Prediction Results Conclusions

6 Basic Concept Behind the Method
Transitivity Relations

7 A New Protein Feature – Similar Peptide
High Scoring Pair (HSP) A significant local pairwise alignment of two proteins by PSI-BLAST Interchangeable Amino Acid Pair A positive score in the BLOSUM62. Similarity Level The number of interchangeable amino acid pairs within a sliding window Similar Peptide Represents possible sequence variation. An n-gram peptide fragment from a similar protein

8 An Example of High Scoring Pairs (HSP)
MYKKILY MY KIL MYSKILL Window size = 7 Pairwise similarity = 5

9 Construction of Similar Peptide Knowledge Base (SPKB)
A protein with annotated localization site CYT HSP from PSI-BLAST A protein from NCBI NR database MENIKKE ME +KK MEAVKKS If pairwise similarity ≧ similarity level , MEAVKKS is a similar peptide Inherit CYT Similarity level = 4 Pairwise similarity = 5

10 A similar peptide entry for protein subcellular localization
Peptide: MYSKILL SPKB We store the similar peptides into a knowledge base, which is called SPKB. Our knowledge base stores millions upon millions similar peptide records. For example, this is a similar peptide record whose representative sequence is MYSKILL. And it has two protein members, one is 1ark, the other is 1ata. It also inherits two different secondary structures. The similar peptide record is the basic unit we used for predicting protein secondary structures.

11 KnowPredsite: a localization prediction method using SPKB

12 Blast-hit Method Serves as a baseline approach
Use Blast to find the most similar sequence Inherit the localization annotation E-value = 10-3 If there is no hit, no annotation will be inherited

13 Evaluation

14 Outline Introduction Methods Dataset Prediction Results Conclusions
Concept behind the method Construction of the knowledge base SPKB KnowPredsite: a localization prediction method using SPKB Dataset Prediction Results Conclusions

15 Dataset From ngLOC method 25,887 Single-localized proteins
2,169 multi-localized proteins 1923 different species 10 different subcellular locations CYT (cytoplasm) CSK (cytoskeleton) END (endoplasmic reticulum) EXC (extracellular) GOL (Golgi apparatus) LYS (lysosome) MIT (mitochondria) NUC (nuclear) PLA (plasma membrane) POX (perixosome)

16 Outline Introduction Methods Dataset Prediction Results Conclusions
Concept behind the method Construction of the knowledge base SPKB KnowPredsite: a localization prediction method using SPKB Dataset Prediction Results Conclusions

17 Finding the Best Similarity Level and Window Size
Leave-one-out cross validation is performed. Similarity Level 1 2 3 4 5 6 7 8 Overall Accuracy w=7 91.2 91.3 91.4 91.5 91.8 92.0 91.6 w=8 91.7 90.9

18 Prediction Performance for Single-localized Proteins
*KnowPredsite : leave-one-out cross validation #KnowPredsite : ten-fold cross validation Overall Accuracy (%) Methods Top 1 Top 2 Top 3 Top 4 Single-localized *KnowPredsite 92.0 95.7 96.8 98.1 #KnowPredsite 91.7 95.4 96.6 97.9 ngLOC 88.8 92.2 94.5 96.3 Blast-hit 86.0

19 Prediction Performance for Multi-localized Proteins
Overall Accuracy (%) Methods Top 1 Top 2 Top 3 Top 4 Multi-localized (at least 1 correct) *KnowPredsite 90.8 96.4 98.2 98.9 #KnowPredsite 90.1 96.1 98.1 ngLOC 81.9 92.0 97.4 Blast-hit 78.8 (both correct) 74.3 83.3 88.7 72.1 82.2 87.5 59.7 73.8 83.2 45.7

20 Site-Specific Prediction Performance
Site i Occurrence in the dataset (%) Precision (%) Accuracyi (%) MCCi CYT 11.1 75.7 84.4 0.774 CSK 1.0 81.1 52.0 0.645 END 3.6 92.9 84.1 0.88 EXC 29.1 98.5 93.9 0.946 GOL 1.1 79.1 70.9 0.746 LYS 0.6 87.2 81.9 0.844 MIT 9.4 96.7 86.9 0.907 NUC 18.0 87.3 93.8 0.884 PLA 25.2 94.4 96.4 0.938 POX 0.8 85.1 0.861

21 Multi-localized Confidence Score (MLCS)
We follow King and Guda’s method, for a protein t MLCS(t) = (CS1 + CS2) - (CS12 -CS22)/100 CS1: highest confidence score among all the sites CS2: 2nd highest confidence score among all the sites Best MLCS threshold of KnowPredsite = 20 86.3% of multi-localized proteins have MLCS > 20 82.8% of single-localized proteins have MLCS < 20

22 Case Study: EF1A2_RABIT Swiss-Prot : NUC Gene Ontology: CYT & NUC
Query CYT CSK END EXC GOL LYS MIT NUC* PLA POX MLCS EF1A2_RABIT 95.45 1.45 0.04 2.97 0.05 7.40 Template NUC SI EF1A2_RAT 2.94 99.78 EF1A_CHICK 2.77 92.22 EF1A1_HUMAN 2.75 EF1A1_RAT EF1A0_XENLA 2.69 90.06 EF1A_BRARE 2.64 EF1A2_XENLA 88.79 EF1A3_XENLA 2.60 88.55 Swiss-Prot : NUC Gene Ontology: CYT & NUC

23 Case Study: MCA3_MOUSE Query CYT* CSK END EXC GOL LYS MIT NUC* PLA POX
MLCS MCA3_MOUSE 95.46 0.3 0.27 0.36 0.2 0.01 1.13 93.59 1.82 0.22 100 Template CYT NUC SI MCA3_HUMAN 89.16 88.51 EF1G1_YEAST 2.74 2.47 8.67 EF1G2_YEAST 0.49 8.50 GSTA_PLEPL 0.35 15.86 SYEC_YEAST 0.16 3.86 CCNA1_MOUSE 0.15 7.36 NU155_RAT 0.14 3.17 GCYB2_HUMAN 4.86

24 Outline Introduction Methods Dataset Prediction Results Conclusions
Concept behind the method Construction of the knowledge base SPKB KnowPredsite: a localization prediction method using SPKB Dataset Prediction Results Conclusions

25 Conclusions We proposed a sequence based prediction method KnowPredsite which can Predicts single-localized proteins with 92% accuracy Predicts multi-localized proteins with 74.3% accuracy Suitable for proteome-wide prediction 25887 single-localized proteins 2169 multi-localized proteins Provides interpretable prediction results through template proteins KnowPredsite can be easily improved by incrementally expanding the SPKB

26 Thank You.

27 Multi-localized Confidence Score
TP: a multi-localized protein with MLCS > 20 TN: a single-localized protein with MLCS < 20

28 Site-Specific Comparison
ngLOC performs better in CYT & LYS


Download ppt "Hsin-Nan Lin, Ching-Tai Chen, Ting-Yi Sung,"

Similar presentations


Ads by Google