Hsin-Nan Lin, Ching-Tai Chen, Ting-Yi Sung,

Protein Subcellular Localization Prediction of Eukaryotes using a Knowledge-based Approach
Hsin-Nan Lin, Ching-Tai Chen, Ting-Yi Sung, Shinn-Ying Ho, Wen-Lian Hsu Bioinformatics Program, TIGP (Taiwan International Graduate Program), Academia Sinica, Taiwan

Outline Introduction Methods Dataset Prediction Results Conclusions
Construction of the knowledge base SPKB KnowPredsite: a localization prediction method using SPKB Dataset Prediction Results Conclusions

Protein Subcellular Localization (PSL)
Given a protein, determine its subcellular compartment Mitochondria, cytoplasm, nuclear,….etc It is important because Help elucidate protein functions Identify potential diagnostic and drug targets Wet-lab experiment Time consuming Labor intensive Computational prediction is needed.

Existing PSL Predictors
Using various features Features extracted from literature or public databases Phylogenetic profiling Compartment-specific features Main problem of many predictors They only predict a limited number of locations Limited to subsets of proteomes, e.g., those containing signal peptide sequences Designed for specific species Designed for single-localized protein sequences Up to 35% of proteins move between different cellular compartments

Concept behind the method Construction of the knowledge base SPKB KnowPredsite: a localization prediction method using SPKB Dataset Prediction Results Conclusions

Basic Concept Behind the Method
Transitivity Relations

A New Protein Feature – Similar Peptide
High Scoring Pair (HSP) A significant local pairwise alignment of two proteins by PSI-BLAST Interchangeable Amino Acid Pair A positive score in the BLOSUM62. Similarity Level The number of interchangeable amino acid pairs within a sliding window Similar Peptide Represents possible sequence variation. An n-gram peptide fragment from a similar protein

An Example of High Scoring Pairs (HSP)
MYKKILY MY KIL MYSKILL Window size = 7 Pairwise similarity = 5

Construction of Similar Peptide Knowledge Base (SPKB)
A protein with annotated localization site CYT HSP from PSI-BLAST A protein from NCBI NR database MENIKKE ME +KK MEAVKKS If pairwise similarity ≧ similarity level , MEAVKKS is a similar peptide Inherit CYT Similarity level = 4 Pairwise similarity = 5

A similar peptide entry for protein subcellular localization
Peptide: MYSKILL SPKB We store the similar peptides into a knowledge base, which is called SPKB. Our knowledge base stores millions upon millions similar peptide records. For example, this is a similar peptide record whose representative sequence is MYSKILL. And it has two protein members, one is 1ark, the other is 1ata. It also inherits two different secondary structures. The similar peptide record is the basic unit we used for predicting protein secondary structures.

KnowPredsite: a localization prediction method using SPKB

Blast-hit Method Serves as a baseline approach
Use Blast to find the most similar sequence Inherit the localization annotation E-value = 10-3 If there is no hit, no annotation will be inherited

Evaluation

Dataset From ngLOC method 25,887 Single-localized proteins
2,169 multi-localized proteins 1923 different species 10 different subcellular locations CYT (cytoplasm) CSK (cytoskeleton) END (endoplasmic reticulum) EXC (extracellular) GOL (Golgi apparatus) LYS (lysosome) MIT (mitochondria) NUC (nuclear) PLA (plasma membrane) POX (perixosome)

Finding the Best Similarity Level and Window Size
Leave-one-out cross validation is performed. Similarity Level 1 2 3 4 5 6 7 8 Overall Accuracy w=7 91.2 91.3 91.4 91.5 91.8 92.0 91.6 － w=8 91.7 90.9

Prediction Performance for Single-localized Proteins
*KnowPredsite : leave-one-out cross validation #KnowPredsite : ten-fold cross validation Overall Accuracy (%) Methods Top 1 Top 2 Top 3 Top 4 Single-localized *KnowPredsite 92.0 95.7 96.8 98.1 #KnowPredsite 91.7 95.4 96.6 97.9 ngLOC 88.8 92.2 94.5 96.3 Blast-hit 86.0 －

Prediction Performance for Multi-localized Proteins
Overall Accuracy (%) Methods Top 1 Top 2 Top 3 Top 4 Multi-localized (at least 1 correct) *KnowPredsite 90.8 96.4 98.2 98.9 #KnowPredsite 90.1 96.1 98.1 ngLOC 81.9 92.0 97.4 Blast-hit 78.8 － (both correct) 74.3 83.3 88.7 72.1 82.2 87.5 59.7 73.8 83.2 45.7

Site-Specific Prediction Performance
Site i Occurrence in the dataset (%) Precision (%) Accuracyi (%) MCCi CYT 11.1 75.7 84.4 0.774 CSK 1.0 81.1 52.0 0.645 END 3.6 92.9 84.1 0.88 EXC 29.1 98.5 93.9 0.946 GOL 1.1 79.1 70.9 0.746 LYS 0.6 87.2 81.9 0.844 MIT 9.4 96.7 86.9 0.907 NUC 18.0 87.3 93.8 0.884 PLA 25.2 94.4 96.4 0.938 POX 0.8 85.1 0.861

Multi-localized Confidence Score (MLCS)
We follow King and Guda’s method, for a protein t MLCS(t) = (CS1 + CS2) － (CS12 －CS22)/100 CS1: highest confidence score among all the sites CS2: 2nd highest confidence score among all the sites Best MLCS threshold of KnowPredsite = 20 86.3% of multi-localized proteins have MLCS > 20 82.8% of single-localized proteins have MLCS < 20

Case Study: EF1A2_RABIT Swiss-Prot : NUC Gene Ontology: CYT & NUC
Query CYT CSK END EXC GOL LYS MIT NUC* PLA POX MLCS EF1A2_RABIT 95.45 1.45 0.04 2.97 0.05 7.40 Template NUC SI EF1A2_RAT 2.94 99.78 EF1A_CHICK 2.77 92.22 EF1A1_HUMAN 2.75 EF1A1_RAT EF1A0_XENLA 2.69 90.06 EF1A_BRARE 2.64 EF1A2_XENLA 88.79 EF1A3_XENLA 2.60 88.55 Swiss-Prot : NUC Gene Ontology: CYT & NUC

Case Study: MCA3_MOUSE Query CYT* CSK END EXC GOL LYS MIT NUC* PLA POX
MLCS MCA3_MOUSE 95.46 0.3 0.27 0.36 0.2 0.01 1.13 93.59 1.82 0.22 100 Template CYT NUC SI MCA3_HUMAN 89.16 88.51 EF1G1_YEAST 2.74 2.47 8.67 EF1G2_YEAST 0.49 8.50 GSTA_PLEPL 0.35 15.86 SYEC_YEAST 0.16 3.86 CCNA1_MOUSE 0.15 7.36 NU155_RAT 0.14 3.17 GCYB2_HUMAN 4.86

Conclusions We proposed a sequence based prediction method KnowPredsite which can Predicts single-localized proteins with 92% accuracy Predicts multi-localized proteins with 74.3% accuracy Suitable for proteome-wide prediction 25887 single-localized proteins 2169 multi-localized proteins Provides interpretable prediction results through template proteins KnowPredsite can be easily improved by incrementally expanding the SPKB

Thank You.

Multi-localized Confidence Score
TP: a multi-localized protein with MLCS > 20 TN: a single-localized protein with MLCS < 20

Site-Specific Comparison
ngLOC performs better in CYT & LYS

Hsin-Nan Lin, Ching-Tai Chen, Ting-Yi Sung,

Similar presentations

Presentation on theme: "Hsin-Nan Lin, Ching-Tai Chen, Ting-Yi Sung,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hsin-Nan Lin, Ching-Tai Chen, Ting-Yi Sung,

Similar presentations

Presentation on theme: "Hsin-Nan Lin, Ching-Tai Chen, Ting-Yi Sung,"— Presentation transcript:

Similar presentations

About project

Feedback