Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization Shibiao WAN and Man-Wai MAK The Hong Kong Polytechnic University.

Slides:



Advertisements
Similar presentations
Ontology annotation: mapping genomic regions biological function Paul D Thomas, Huaiyu Mi and Suzanna Lewis.
Advertisements

GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Semantic Similarity over the Gene Ontology F. M. Couto, M. J. Silva, P. M. Coutinho Family Correlation and Selecting Disjunctive Ancestors
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Seeing the forest for the trees : using the Gene Ontology to restructure hierarchical clustering Dikla Dotan-Cohen, Simon Kasif and Avraham A. Melkman.
Sequence Similarity Searching Class 4 March 2010.
Mismatch string kernels for discriminative protein classification By Leslie. et.al Presented by Yan Wang.
Jierui Xie, Boleslaw Szymanski, Mohammed J. Zaki Department of Computer Science Rensselaer Polytechnic Institute Troy, NY 12180, USA {xiej2, szymansk,
Remote homology detection  Remote homologs:  low sequence similarity, conserved structure/function  A number of databases and tools are available 
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
OCFS: Optimal Orthogonal Centroid Feature Selection for Text Categorization Jun Yan, Ning Liu, Benyu Zhang, Shuicheng Yan, Zheng Chen, and Weiguo Fan et.
09 / 23 / Predicting Protein Function Using Machine-Learned Hierarchical Classifiers Roman Eisner Supervisors: Duane Szafron.
Presented by Zeehasham Rasheed
M.W. Mak and S.Y. Kung, ICASSP’09 1 Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites M.W. Mak The Hong Kong Polytechnic University.
5/30/2006EE 148, Spring Visual Categorization with Bags of Keypoints Gabriella Csurka Christopher R. Dance Lixin Fan Jutta Willamowski Cedric Bray.
Semi-supervised protein classification using cluster kernels Jason Weston, Christina Leslie, Eugene Ie, Dengyong Zhou, Andre Elisseeff and William Stafford.
Toward Making Online Biological Data Machine Understandable Cui Tao Data Extraction Research Group Department of Computer Science, Brigham Young University,
We are developing a web database for plant comparative genomics, named Phytome, that, when complete, will integrate organismal phylogenies, genetic maps.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Mining the Medical Literature Chirag Bhatt October 14 th, 2004.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.
K.U.Leuven Department of Computer Science Predicting gene functions using hierarchical multi-label decision tree ensembles Celine Vens, Leander Schietgat,
SUPERVISED NEURAL NETWORKS FOR PROTEIN SEQUENCE ANALYSIS Lecture 11 Dr Lee Nung Kion Faculty of Cognitive Sciences and Human Development UNIMAS,
Automatic methods for functional annotation of sequences Petri Törönen.
Metagenomic Analysis Using MEGAN4
Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning
Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization Man-Wai MAK and Wei WANG The Hong Kong Polytechnic.
1 Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine Chenghai Xue, Fei Li, Tao He,
A Comparative Study of Kernel Methods for Classification Applications Yan Liu Oct 21, 2003.
BING: Binarized Normed Gradients for Objectness Estimation at 300fps
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National.
Frontiers in the Convergence of Bioscience and Information Technologies 2007 Seyed Koosha Golmohammadi, Lukasz Kurgan, Brendan Crowley, and Marek Reformat.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Automatic Detection of Social Tag Spams Using a Text Mining Approach Hsin-Chang Yang Associate Professor Department of Information Management National.
Meng-Han Yang September 9, 2009 A sequence-based hybrid predictor for identifying conformationally ambivalent regions in proteins.
Study of Protein Prediction Related Problems Ph.D. candidate Le-Yi WEI 1.
Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.
Algorithmic Detection of Semantic Similarity WWW 2005.
Using linked data to interpret tables Varish Mulwad September 14,
1 Improve Protein Disorder Prediction Using Homology Instructor: Dr. Slobodan Vucetic Student: Kang Peng.
A New Supervised Over-Sampling Algorithm with Application to Protein-Nucleotide Binding Residue Prediction Li Lihong (Anna Lee) Cumputer science 22th,Apr.
A Multiresolution Symbolic Representation of Time Series Vasileios Megalooikonomou Qiang Wang Guo Li Christos Faloutsos Presented by Rui Li.
Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Automating Readers’ Advisory to Make Book Recommendations for K-12 Readers by Alicia Wood.
Feature Extraction Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and.
Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation Bioinformatics, July 2003 P.W.Load,
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S
Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine 朱林娇 14S
A Supervised Machine Learning Algorithm for Research Articles Leonidas Akritidis, Panayiotis Bozanis Dept. of Computer & Communication Engineering, University.
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: –Ignores gene/gene interaction: single.
Computer Science and Engineering PhD in Computer Science Monday, November 07, :00 a.m. – 11:00 a.m. Swearingen Conference Room 3A75 Network Based.
Using category-Based Adherence to Cluster Market-Basket Data Author : Ching-Huang Yun, Kun-Ta Chuang, Ming-Syan Chen Graduate : Chien-Ming Hsiao.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Facial Smile Detection Based on Deep Learning Features Authors: Kaihao Zhang, Yongzhen Huang, Hong Wu and Liang Wang Center for Research on Intelligent.
` Comparison of Gene Ontology Term Annotations Between E.coli K12 Databases REDDYSAILAJA MARPURI WESTERN KENTUCKY UNIVERSITY.
BIOINFORMATION A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation - - 王红刚 14S
Experience Report: System Log Analysis for Anomaly Detection
Saccharomyces Genome Database (SGD)
Department of Genetics • Stanford University School of Medicine
Prediction of RNA Binding Protein Using Machine Learning Technique
Discriminative Frequent Pattern Analysis for Effective Classification
Hsin-Nan Lin, Ching-Tai Chen, Ting-Yi Sung,
Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University
Presentation transcript:

Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization Shibiao WAN and Man-Wai MAK The Hong Kong Polytechnic University Sun-Yuan KUNG Princeton University

Outline 1.Introduction and Motivation 2.Retrieval of GO Terms 3.Semantic Similarity Measures 4.Multi-label Multi-Class Classification 5.Results 6.Conclusions 2

Proteins and Their Subcellular Locations 3

Subcellular Localization Prediction The subcellular locations of proteins help biologists to elucidate the functions of proteins. Identifying the subcellular locations by entirely experimental means is time-consuming and costly. Computational methods are necessary for subcellular localization prediction. Previous research has found that gene ontology (GO) based methods outperform methods based on other protein features (e.g. AA composition). 4

Multi-label Problem Some proteins can simultaneously reside at, or move between, two or more subcellular locations. Multi-label (Multi-location) proteins play important roles in some metabolic processes taking place in multiple subcellular locations. State-of-the-art multi-label predictors, such as Plant- mPLoc, iLoc-Plant, and mGOASVM use frequency counts of GO terms as features. In this work, we propose using semantic similarity of GO terms as features for multi-label subcellular localization prediction. 5

GO Extraction by searching GOA database SVM Subcellular Location(s) Method’s Flowchart Semantic Similarity Measure 6 GOA Database BLAST Swiss-Prot Database homolog AC S AC SVM M Multi-label SVM SS: Semantic Similarity GO of training proteins Semantic Similarity Vector

Gene Ontology  Gene ontology is a set of standardized vocabularies annotating the functions of genes and gene products  GO terms, e.g., GO:  A protein sequence may correspond to 0, 1 or many GO terms. 7

Gene Ontology: Example Search----GO: in 8

GOA Database Gene Ontology Annotation database. – Provide structured annotations to proteins in UniProt Knowledgebase (UniProtKB) and other protein databases using standardized GO vocabularies. – Include a series of cross-references to other databases. Given an Accession Number, the GOA database allows us to find a set of GO terms associated with that accession number. 9

GOA Database Accession Number (AC) GO term(s) Search A0M8T9 in 1 AC maps to many GO terms ! 10

GO Extraction by searching GOA database Finding GO Terms without an Accession Number 11 GOA Database BLAST Swiss-Prot Database homolog AC S AC GO Terms of Q i

Semantic Similarity Measure 12 Find Common Ancestors GO Database GO term x GO term y A(x,y) Computing Semantic Similarity Computing Semantic Similarity sim(x,y) SQL QueryAncestors

Finding Common Ancestors, A(x,y) 13

14 GO: is_a part_of Finding Common Ancestors, A(x,y)

Semantic Similarity Measure 15 We use Lin’s measure to estimate the semantic similarity between two GO terms (x and y):

Semantic Similarity between 2 Proteins 16 Semantic similarity between 2 proteins (G i, G j ): Semantic Similarity Vector: No. of training proteins where

Multi-label SVM Scoring 17 GO of training proteins GO of Q t =

Benchmark Datasets The Plant dataset 18

Performance Metrics Overall locative accuracy: Overall actual accuracy: Actual accuracy is more objective and stricter! 19

Performance Comparison The Plant dataset 20

Conclusions 21 Our Proposed predictor performs significantly better than Plant-mPLoc and iLoc-Plant, and also better than mGOASVM, in terms of locative and actual accuracies. As for individual locative accuracies, our proposed predictor are significantly higher than the three predictors for all of the 12 locations. In terms of GO information extraction, Plant-mPLoc, iLoc-Plant and mGOASVM use the occurrences of GO terms as features, whereas the proposed predictor discovers the semantic relationship between GO terms, from which the semantic similarity between proteins can be obtained.

Web Servers 22

Thank you! 23

Multi-label SVM Classifier Transformed labels for M-class problem: 24

25 Y AC known ? Retrieve homologs by BLAST; Retrieve a set of GO terms Multi-label SVM classification N Y Y N N Using back-up methods Using the homolog Retrieving GO Terms with/without AC

The relationships between GO terms in the GO hierarchy can be obtained from the SQL database through the link: termdb/go_daily-termdb-tables.tar.gz. termdb/go_daily-termdb-tables.tar.gz We only considered the ‘is-a’ relationship. 26 Finding Common Ancestors