Hsin-Nan Lin, Ching-Tai Chen, Ting-Yi Sung,

Slides:

Advertisements

Similar presentations

Semantic Similarity Measures Across The Gene Ontology. Relating Sequence to Annotation. P.W. Lord, R.D. Stevens, A.Brass, and C. Goble Department of Computer.

Advertisements

Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.

1 Semi-supervised learning for protein classification Brian R. King Chittibabu Guda, Ph.D. Department of Computer Science University at Albany, SUNY Gen*NY*sis.

Using phylogenetic profiles to predict protein function and localization As discussed by Catherine Grasso.

Structural bioinformatics

Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids Y. Wang, O. Zaiane, R. Goebel.

An Introduction to Bioinformatics Protein Structure Prediction.

Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.

Summary Protein design seeks to find amino acid sequences which stably fold into specific 3-D structures. Modeling the inherent flexibility of the protein.

09 / 23 / Predicting Protein Function Using Machine-Learned Hierarchical Classifiers Roman Eisner Supervisors: Duane Szafron.

Similar Sequence Similar Function Charles Yan Spring 2006.

Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.

Signaling Pathways and Summary June 30, 2005 Signaling lecture Course summary Tomorrow Next Week Friday, 7/8/05 Morning presentation of writing assignments.

Blast heuristics Morten Nielsen Department of Systems Biology, DTU.

Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.

Protein Sequence Analysis - Overview Raja Mazumder Senior Protein Scientist, PIR Assistant Professor, Department of Biochemistry and Molecular Biology.

Overview of Bioinformatics A/P Shoba Ranganathan Justin Choo National University of Singapore A Tutorial on Bioinformatics.

Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization Man-Wai MAK and Wei WANG The Hong Kong Polytechnic.

Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization Shibiao WAN and Man-Wai MAK The Hong Kong Polytechnic University.

Predicting the Cellular Localization Sites of Proteins Using Decision Tree and Neural Networks Yetian Chen

Fast Search Protein Structure Prediction Algorithm for Almost Perfect Matches1 By Jayakumar Rudhrasenan S Primary Supervisor: Prof. Heiko Schroder.

NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)

1 Introduction(1/2)  Eukaryotic cells can synthesize up to 10,000 different kinds of proteins  The correct transport of a protein to its final destination.

Protein Secondary Structure Prediction: A New Improved Knowledge-Based Method Wen-Lian Hsu Institute of Information Science Academia Sinica, Taiwan.

Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.

Molecular Biology Primer. Starting 19 th century… Cellular biology: Cell as a fundamental building block 1850s+: ``DNA’’ was discovered by Friedrich Miescher.

I529: Lab5 02/20/2009 AI : Kwangmin Choi. Today’s topics Gene Ontology prediction/mapping – AmiGo –

Protein Folding Programs By Asım OKUR CSE 549 November 14, 2002.

Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:

Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,

Construction of Substitution Matrices

Localization prediction of transmembrane proteins Stefan Maetschke, Mikael Bodén and Marcus Gallagher The University of Queensland.

Meng-Han Yang September 9, 2009 A sequence-based hybrid predictor for identifying conformationally ambivalent regions in proteins.

Study of Protein Prediction Related Problems Ph.D. candidate Le-Yi WEI 1.

Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.

Basic Local Alignment Search Tool BLAST Why Use BLAST?

Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.

LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino.

Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.

Exercises Pairwise alignment Homology search (BLAST) Multiple alignment (CLUSTAL W) Iterative Profile Search: Profile Search –Pfam –Prosite –PSI-BLAST.

Construction of Substitution matrices

David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.

Sequence Search Abhishek Niroula Department of Experimental Medical Science Lund University

Step 3: Tools Database Searching

Final Report (30% final score) Bin Liu, PhD, Associate Professor.

Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: –Ignores gene/gene interaction: single.

Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State University – Rochester Center

BME435 BIOINFORMATICS.

Bioinformatics Overview

Protein databases Henrik Nielsen

Prediction of RNA Binding Protein Using Machine Learning Technique

Extra Tree Classifier-WS3 Bagging Classifier-WS3

Enrichment of sequence disorder in the cytosolic phosphoproteome.

KEY CONCEPT Entire genomes are sequenced, studied, and compared.

Geneomics and Database Mining and Genetic Mapping

Protein Structure Prediction

Protein Sequence Analysis - Overview -

Sequence Based Analysis Tutorial

Identify D. melanogaster ortholog

Protein Sequence Analysis - Overview -

Protein structure prediction.

Table 1. Occurrence of N-X-S/T motives in tryptic peptides1

KEY CONCEPT Eukaryotic cells share many similarities.

Protein Structure Prediction by A Data-level Parallel Proceedings of the 1989 ACM/IEEE conference on Supercomputing Speaker : Chuan-Cheng Lin Advisor.

Protein annotation and compartment fold change distribution for different proteomics datasets Protein annotation and compartment fold change distribution.

KEY CONCEPT Entire genomes are sequenced, studied, and compared.

Basic Local Alignment Search Tool

Sequence alignment, E-value & Extreme value distribution

KEY CONCEPT Entire genomes are sequenced, studied, and compared.

Presentation transcript:

Protein Subcellular Localization Prediction of Eukaryotes using a Knowledge-based Approach Hsin-Nan Lin, Ching-Tai Chen, Ting-Yi Sung, Shinn-Ying Ho, Wen-Lian Hsu Bioinformatics Program, TIGP (Taiwan International Graduate Program), Academia Sinica, Taiwan

Outline Introduction Methods Dataset Prediction Results Conclusions Construction of the knowledge base SPKB KnowPredsite: a localization prediction method using SPKB Dataset Prediction Results Conclusions

Protein Subcellular Localization (PSL) Given a protein, determine its subcellular compartment Mitochondria, cytoplasm, nuclear,….etc It is important because Help elucidate protein functions Identify potential diagnostic and drug targets Wet-lab experiment Time consuming Labor intensive Computational prediction is needed.

Existing PSL Predictors Using various features Features extracted from literature or public databases Phylogenetic profiling Compartment-specific features Main problem of many predictors They only predict a limited number of locations Limited to subsets of proteomes, e.g., those containing signal peptide sequences Designed for specific species Designed for single-localized protein sequences Up to 35% of proteins move between different cellular compartments

Outline Introduction Methods Dataset Prediction Results Conclusions Concept behind the method Construction of the knowledge base SPKB KnowPredsite: a localization prediction method using SPKB Dataset Prediction Results Conclusions

Basic Concept Behind the Method Transitivity Relations

A New Protein Feature – Similar Peptide High Scoring Pair (HSP) A significant local pairwise alignment of two proteins by PSI-BLAST Interchangeable Amino Acid Pair A positive score in the BLOSUM62. Similarity Level The number of interchangeable amino acid pairs within a sliding window Similar Peptide Represents possible sequence variation. An n-gram peptide fragment from a similar protein

An Example of High Scoring Pairs (HSP) MYKKILY MY KIL MYSKILL Window size = 7 Pairwise similarity = 5

Construction of Similar Peptide Knowledge Base (SPKB) A protein with annotated localization site CYT HSP from PSI-BLAST A protein from NCBI NR database MENIKKE ME +KK MEAVKKS If pairwise similarity ≧ similarity level , MEAVKKS is a similar peptide Inherit CYT Similarity level = 4 Pairwise similarity = 5

A similar peptide entry for protein subcellular localization Peptide: MYSKILL SPKB We store the similar peptides into a knowledge base, which is called SPKB. Our knowledge base stores millions upon millions similar peptide records. For example, this is a similar peptide record whose representative sequence is MYSKILL. And it has two protein members, one is 1ark, the other is 1ata. It also inherits two different secondary structures. The similar peptide record is the basic unit we used for predicting protein secondary structures.

KnowPredsite: a localization prediction method using SPKB

Blast-hit Method Serves as a baseline approach Use Blast to find the most similar sequence Inherit the localization annotation E-value = 10-3 If there is no hit, no annotation will be inherited

Evaluation

Outline Introduction Methods Dataset Prediction Results Conclusions Concept behind the method Construction of the knowledge base SPKB KnowPredsite: a localization prediction method using SPKB Dataset Prediction Results Conclusions

Dataset From ngLOC method 25,887 Single-localized proteins 2,169 multi-localized proteins 1923 different species 10 different subcellular locations CYT (cytoplasm) CSK (cytoskeleton) END (endoplasmic reticulum) EXC (extracellular) GOL (Golgi apparatus) LYS (lysosome) MIT (mitochondria) NUC (nuclear) PLA (plasma membrane) POX (perixosome)

Outline Introduction Methods Dataset Prediction Results Conclusions Concept behind the method Construction of the knowledge base SPKB KnowPredsite: a localization prediction method using SPKB Dataset Prediction Results Conclusions

Finding the Best Similarity Level and Window Size Leave-one-out cross validation is performed. Similarity Level 1 2 3 4 5 6 7 8 Overall Accuracy w=7 91.2 91.3 91.4 91.5 91.8 92.0 91.6 － w=8 91.7 90.9

Prediction Performance for Single-localized Proteins *KnowPredsite : leave-one-out cross validation #KnowPredsite : ten-fold cross validation Overall Accuracy (%) Methods Top 1 Top 2 Top 3 Top 4 Single-localized *KnowPredsite 92.0 95.7 96.8 98.1 #KnowPredsite 91.7 95.4 96.6 97.9 ngLOC 88.8 92.2 94.5 96.3 Blast-hit 86.0 －

Prediction Performance for Multi-localized Proteins Overall Accuracy (%) Methods Top 1 Top 2 Top 3 Top 4 Multi-localized (at least 1 correct) *KnowPredsite 90.8 96.4 98.2 98.9 #KnowPredsite 90.1 96.1 98.1 ngLOC 81.9 92.0 97.4 Blast-hit 78.8 － (both correct) 74.3 83.3 88.7 72.1 82.2 87.5 59.7 73.8 83.2 45.7

Site-Specific Prediction Performance Site i Occurrence in the dataset (%) Precision (%) Accuracyi (%) MCCi CYT 11.1 75.7 84.4 0.774 CSK 1.0 81.1 52.0 0.645 END 3.6 92.9 84.1 0.88 EXC 29.1 98.5 93.9 0.946 GOL 1.1 79.1 70.9 0.746 LYS 0.6 87.2 81.9 0.844 MIT 9.4 96.7 86.9 0.907 NUC 18.0 87.3 93.8 0.884 PLA 25.2 94.4 96.4 0.938 POX 0.8 85.1 0.861

Multi-localized Confidence Score (MLCS) We follow King and Guda’s method, for a protein t MLCS(t) = (CS1 + CS2) － (CS12 －CS22)/100 CS1: highest confidence score among all the sites CS2: 2nd highest confidence score among all the sites Best MLCS threshold of KnowPredsite = 20 86.3% of multi-localized proteins have MLCS > 20 82.8% of single-localized proteins have MLCS < 20

Case Study: EF1A2_RABIT Swiss-Prot : NUC Gene Ontology: CYT & NUC Query CYT CSK END EXC GOL LYS MIT NUC* PLA POX MLCS EF1A2_RABIT 95.45 1.45 0.04 2.97 0.05 7.40 Template NUC SI EF1A2_RAT 2.94 99.78 EF1A_CHICK 2.77 92.22 EF1A1_HUMAN 2.75 EF1A1_RAT EF1A0_XENLA 2.69 90.06 EF1A_BRARE 2.64 EF1A2_XENLA 88.79 EF1A3_XENLA 2.60 88.55 Swiss-Prot : NUC Gene Ontology: CYT & NUC

Case Study: MCA3_MOUSE Query CYT* CSK END EXC GOL LYS MIT NUC* PLA POX MLCS MCA3_MOUSE 95.46 0.3 0.27 0.36 0.2 0.01 1.13 93.59 1.82 0.22 100 Template CYT NUC SI MCA3_HUMAN 89.16 88.51 EF1G1_YEAST 2.74 2.47 8.67 EF1G2_YEAST 0.49 8.50 GSTA_PLEPL 0.35 15.86 SYEC_YEAST 0.16 3.86 CCNA1_MOUSE 0.15 7.36 NU155_RAT 0.14 3.17 GCYB2_HUMAN 4.86

Outline Introduction Methods Dataset Prediction Results Conclusions Concept behind the method Construction of the knowledge base SPKB KnowPredsite: a localization prediction method using SPKB Dataset Prediction Results Conclusions

Conclusions We proposed a sequence based prediction method KnowPredsite which can Predicts single-localized proteins with 92% accuracy Predicts multi-localized proteins with 74.3% accuracy Suitable for proteome-wide prediction 25887 single-localized proteins 2169 multi-localized proteins Provides interpretable prediction results through template proteins KnowPredsite can be easily improved by incrementally expanding the SPKB

Thank You.

Multi-localized Confidence Score TP: a multi-localized protein with MLCS > 20 TN: a single-localized protein with MLCS < 20

Site-Specific Comparison ngLOC performs better in CYT & LYS