Presentation is loading. Please wait.

Presentation is loading. Please wait.

Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization Shibiao WAN and Man-Wai MAK The Hong Kong Polytechnic University.

Similar presentations


Presentation on theme: "Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization Shibiao WAN and Man-Wai MAK The Hong Kong Polytechnic University."— Presentation transcript:

1 Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization Shibiao WAN and Man-Wai MAK The Hong Kong Polytechnic University Sun-Yuan KUNG Princeton University

2 Outline 1.Introduction and Motivation 2.Retrieval of GO Terms 3.Semantic Similarity Measures 4.Multi-label Multi-Class Classification 5.Results 6.Conclusions 2

3 Proteins and Their Subcellular Locations 3

4 Subcellular Localization Prediction The subcellular locations of proteins help biologists to elucidate the functions of proteins. Identifying the subcellular locations by entirely experimental means is time-consuming and costly. Computational methods are necessary for subcellular localization prediction. Previous research has found that gene ontology (GO) based methods outperform methods based on other protein features (e.g. AA composition). 4

5 Multi-label Problem Some proteins can simultaneously reside at, or move between, two or more subcellular locations. Multi-label (Multi-location) proteins play important roles in some metabolic processes taking place in multiple subcellular locations. State-of-the-art multi-label predictors, such as Plant- mPLoc, iLoc-Plant, and mGOASVM use frequency counts of GO terms as features. In this work, we propose using semantic similarity of GO terms as features for multi-label subcellular localization prediction. 5

6 GO Extraction by searching GOA database SVM Subcellular Location(s) Method’s Flowchart Semantic Similarity Measure 6 GOA Database BLAST Swiss-Prot Database homolog AC S AC SVM M Multi-label SVM............ SS: Semantic Similarity GO of training proteins Semantic Similarity Vector

7 Gene Ontology  Gene ontology is a set of standardized vocabularies annotating the functions of genes and gene products  GO terms, e.g., GO:0000187  A protein sequence may correspond to 0, 1 or many GO terms. 7

8 Gene Ontology: Example Search----GO:0000187 in http://www.geneontology.org/http://www.geneontology.org/ 8

9 GOA Database Gene Ontology Annotation database. – Provide structured annotations to proteins in UniProt Knowledgebase (UniProtKB) and other protein databases using standardized GO vocabularies. – Include a series of cross-references to other databases. Given an Accession Number, the GOA database allows us to find a set of GO terms associated with that accession number. 9

10 GOA Database Accession Number (AC) GO term(s) Search A0M8T9 in http://www.ebi.ac.uk/GOA/http://www.ebi.ac.uk/GOA/ 1 AC maps to many GO terms ! 10

11 GO Extraction by searching GOA database Finding GO Terms without an Accession Number 11 GOA Database BLAST Swiss-Prot Database homolog AC S AC GO Terms of Q i

12 Semantic Similarity Measure 12 Find Common Ancestors GO Database GO term x GO term y A(x,y) Computing Semantic Similarity Computing Semantic Similarity sim(x,y) SQL QueryAncestors

13 Finding Common Ancestors, A(x,y) 13

14 14 GO:0000187 is_a part_of Finding Common Ancestors, A(x,y)

15 Semantic Similarity Measure 15 We use Lin’s measure to estimate the semantic similarity between two GO terms (x and y):

16 Semantic Similarity between 2 Proteins 16 Semantic similarity between 2 proteins (G i, G j ): Semantic Similarity Vector: No. of training proteins where

17 Multi-label SVM Scoring 17 GO of training proteins GO of Q t =

18 Benchmark Datasets The Plant dataset 18

19 Performance Metrics Overall locative accuracy: Overall actual accuracy: Actual accuracy is more objective and stricter! 19

20 Performance Comparison The Plant dataset 20

21 Conclusions 21 Our Proposed predictor performs significantly better than Plant-mPLoc and iLoc-Plant, and also better than mGOASVM, in terms of locative and actual accuracies. As for individual locative accuracies, our proposed predictor are significantly higher than the three predictors for all of the 12 locations. In terms of GO information extraction, Plant-mPLoc, iLoc-Plant and mGOASVM use the occurrences of GO terms as features, whereas the proposed predictor discovers the semantic relationship between GO terms, from which the semantic similarity between proteins can be obtained.

22 Web Servers 22

23 Thank you! 23

24 Multi-label SVM Classifier Transformed labels for M-class problem: 24

25 25 Y AC known ? Retrieve homologs by BLAST; Retrieve a set of GO terms Multi-label SVM classification N Y Y N N Using back-up methods Using the homolog Retrieving GO Terms with/without AC

26 The relationships between GO terms in the GO hierarchy can be obtained from the SQL database through the link: http://archive.geneontology.org/latest- termdb/go_daily-termdb-tables.tar.gz. http://archive.geneontology.org/latest- termdb/go_daily-termdb-tables.tar.gz We only considered the ‘is-a’ relationship. 26 Finding Common Ancestors


Download ppt "Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization Shibiao WAN and Man-Wai MAK The Hong Kong Polytechnic University."

Similar presentations


Ads by Google