LREC 2008 Marrakech1 Clustering Related Terms with Definitions Scott Piao, John McNaught and Sophia Ananiadou

Slides:



Advertisements
Similar presentations
Language Technologies Reality and Promise in AKT Yorick Wilks and Fabio Ciravegna Department of Computer Science, University of Sheffield.
Advertisements

Enrichment and Structuring of Archival Description Metadata Kalliopi Zervanou*, Ioannis Korkontzelos**, Antal van den Bosch* & Sophia Ananiadou** * Tilburg.
FP7 meeting - Gent - Carlos Rodríguez - April 18 WP4: Conceptual Mining from Text for Knowledge Engineering State of the Art WP Coordinators: Alfonso Valencia.
ANTLR in SSP Xingzhong Xu Hong Man Aug Outline ANTLR Abstract Syntax Tree Code Equivalence (Code Re-hosting) Future Work.
Community Detection Laks V.S. Lakshmanan (based on Girvan & Newman. Finding and evaluating community structure in networks. Physical Review E 69,
A Linguistic Approach for Semantic Web Service Discovery International Symposium on Management Intelligent Systems 2012 (IS-MiS 2012) July 13, 2012 Jordy.
 Aim in building a phylogenetic tree is to use a knowledge of the characters of organisms to build a tree that reflects the relationships between them.
Molecular Evolution Revised 29/12/06
Applications Chapter 9, Cimiano Ontology Learning Textbook Presented by Aaron Stewart.
A New Biclustering Algorithm for Analyzing Biological Data Prashant Paymal Advisor: Dr. Hesham Ali.
Code recognition & CL modeling through AST Xingzhong Xu Hong Man.
An Efficient Parallel Approach for Identifying Protein Families from Large-scale Metagenomics Data Changjun Wu, Ananth Kalyanaraman School of Electrical.
L16: Micro-array analysis Dimension reduction Unsupervised clustering.
Bioinformatics and Phylogenetic Analysis
PROMPT: Algorithm and Tool for Automated Ontology Merging and Alignment Natalya Fridman Noy and Mark A. Musen.
1 Convolution and Its Applications to Sequence Analysis Student: Bo-Hung Wu Advisor: Professor Herng-Yow Chen & R. C. T. Lee Department of Computer Science.
Toward Making Online Biological Data Machine Understandable Cui Tao Data Extraction Research Group Department of Computer Science, Brigham Young University,
We are developing a web database for plant comparative genomics, named Phytome, that, when complete, will integrate organismal phylogenies, genetic maps.
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.
Task analysis 1 © Copyright De Montfort University 1998 All Rights Reserved Task Analysis Preece et al Chapter 7.
Human-Computer Interaction in Biodiversity Informatics Workshop in association with the 22 nd annual HCIL Symposium and Open House Sponsored by NBII and.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
An introduction to using the AmiGO Gene Ontology tool.
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
UML CLASS DIAGRAMS. Basics of UML Class Diagrams What is a UML class diagram? Imagine you were given the task of drawing a family tree. The steps you.
Partners Using NLP Techniques for Meaning Negotiation Bernardo Magnini, Luciano Serafini and Manuela Speranza ITC-irst, via Sommarive 18, I Trento-Povo,
OMAP: An Implemented Framework for Automatically Aligning OWL Ontologies SWAP, December, 2005 Raphaël Troncy, Umberto Straccia ISTI-CNR
Semantic Matching Pavel Shvaiko Stanford University, October 31, 2003 Paper with Fausto Giunchiglia Research group (alphabetically ordered): Fausto Giunchiglia,
Centre for Geo-information Fieldwork: the role of validation in geo- information science RS&GIS Integration Course (GRS ) Lammert Kooistra Contact:
Introduction to Adaptive Digital Filters Algorithms
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
BACKGROUND KNOWLEDGE IN ONTOLOGY MATCHING Pavel Shvaiko joint work with Fausto Giunchiglia and Mikalai Yatskevich INFINT 2007 Bertinoro Workshop on Information.
Machine Learning Approach for Ontology Mapping using Multiple Concept Similarity Measures IEEE/ACIS International Conference on Computer and Information.
GO and OBO: an introduction. Jane Lomax EMBL-EBI What is the Gene Ontology? What is OBO? OBO-Edit demo & practical What is the Gene Ontology? What is.
A hybrid method for Mining Concepts from text CSCE 566 semester project.
Semantic Enrichment of Ontology Mappings: A Linguistic-based Approach Patrick Arnold, Erhard Rahm University of Leipzig, Germany 17th East-European Conference.
Genome Alignment. Alignment Methods Needleman-Wunsch (global) and Smith- Waterman (local) use dynamic programming Guaranteed to find an optimal alignment.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
CONCLUSION & FUTURE WORK Normally, users perform search tasks using multiple applications in concert: a search engine interface presents lists of potentially.
Intel Confidential – Internal Only Co-clustering of biological networks and gene expression data Hanisch et al. This paper appears in: bioinformatics 2002.
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
Evaluating Semantic Metadata without the Presence of a Gold Standard Yuangui Lei, Andriy Nikolov, Victoria Uren, Enrico Motta Knowledge Media Institute,
Benchmarking ontology-based annotation tools for the Semantic Web Diana Maynard University of Sheffield, UK.
1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
1 Limitations of BLAST Can only search for a single query (e.g. find all genes similar to TTGGACAGGATCGA) What about more complex queries? “Find all genes.
ESIP Semantic Web Products and Services ‘triples’ “tutorial” aka sausage making ESIP SW Cluster, Jan ed.
Sudhakar Jonnalagadda and Rajagopalan Srinivasan
Automatic Question Answering  Introduction  Factoid Based Question Answering.
LogTree: A Framework for Generating System Events from Raw Textual Logs Liang Tang and Tao Li School of Computing and Information Sciences Florida International.
Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
A New OLAP Aggregation Based on the AHC Technique DOLAP 2004 R. Ben Messaoud, O. Boussaid, S. Rabaséda Laboratoire ERIC – Université de Lyon 2 5, avenue.
Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, All rights reserved.
1 An Efficient Optimal Leaf Ordering for Hierarchical Clustering in Microarray Gene Expression Data Analysis Jianting Zhang Le Gruenwald School of Computer.
PROTEIN INTERACTION NETWORK – INFERENCE TOOL DIVYA RAO CANDIDATE FOR MASTER OF SCIENCE IN BIOINFORMATICS ADVISOR: Dr. FILIPPO MENCZER CAPSTONE PROJECT.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Zachary Starr Dept. of Computer Science, University of Missouri, Columbia, MO 65211, USA Digital Image Processing Final Project Dec 11 th /16 th, 2014.
SERVICE ANNOTATION WITH LEXICON-BASED ALIGNMENT Service Ontology Construction Ontology of a given web service, service ontology, is constructed from service.
Experience Report: System Log Analysis for Anomaly Detection
PINALOG Protein Interaction Network Alignment and its implication in function prediction and complex detection Hang Phan Prof. Michael J.E. Sternberg.
GO : the Gene Ontology & Functional enrichment analysis
What is an Ontology An ontology is a set of terms, relationships and definitions that capture the knowledge of a certain domain. (common ontology ≠ common.
Business Process Management and Semantic Technologies
Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University
Presentation transcript:

LREC 2008 Marrakech1 Clustering Related Terms with Definitions Scott Piao, John McNaught and Sophia Ananiadou National Centre for Text Mining School of Computer Science The University of Manchester

LREC 2008 Marrakech2 Outline of talk Task: match related terms of ontology. Approach: detect and cluster related terms based on definitions. Implementation: definition matching and term clustering, user interface. Evaluation on GO terms. Conclusion.

LREC 2008 Marrakech3 Task: matching terms for ontology enrichment matching similar or related terms/expressions is important task in NLP and Text Mining applications. Ontology term matching is also closely related to ontology enrichment. In the EU BOOTSTrep Project, some techniques have been tested for ontology entities matching and alignment. Our work focuses on testing and evaluating a text matching tool for identifying related ontology terms with their definitions.

LREC 2008 Marrakech4 Definitions of term definitions Ontology terms, such as GO (Gene Ontology) terms, often contain detailed definitions:. –id: GO: –name: SAGA complex –def: "A large multiprotein complex that possesses histone acetyltransferase and is involved in regulation of transcription. The budding yeast complex includes Gcn5p, several proteins of the Spt and Ada families, and several TBP-associate proteins (TAFs); analogous complexes in other species have analogous compositions, and usually contain homologs of the yeast proteins.“ –id: GO: –name: Ada2/Gcn5/Ada3 transcription activator complex –def: "A multiprotein complex that possesses histone acetyltransferase and is involved in regulation of transcription. The budding yeast complex includes Gcn5p, two proteins of the Ada family, and two TBP-associate proteins (TAFs); analogous complexes in other species have analogous compositions, and usually contain homologs of the yeast proteins."

LREC 2008 Marrakech5 Our approach to the issue The definitions can provide a fundamental information source for detecting relations between terms. lexicon definitions have been previously used for analyzing relations between words/terms (Castillo et al., 2003). We assume text matching tools can be used to detect related terms based on the definitions.

LREC 2008 Marrakech6 A tool for clustering related texts Align similar sentences between texts. Measure the distances between texts based on the aligned sentences. Cluster similar texts based on a distance matrix.

LREC 2008 Marrakech7 Metrics for pairwise text comparison, (δ 1 =0.85,δ 2 =0.05,δ 3 =0.1), (0 <= d <= 1). For further details, see the paper.

LREC 2008 Marrakech8 An effective algorithm text comparison Cited from Clough et al. (2002)

LREC 2008 Marrakech9 Clustering texts Using the text comparison tool, produce distance matrix matrix elements: e ij =1 – d ij, (0<=e ij <=1) Error Sum of Squares (ESS) hierarchical clustering

LREC 2008 Marrakech10 Sample of cluster tree {layer=9 {layer=10 {layer=11 {layer=12 GO: GO: } {layer=12 GO: } } {layer=11 {layer=12 GO: } {layer=12 GO: } } {layer=10 {layer=11 {layer=12 GO: GO: } {layer=12 GO: } }

LREC 2008 Marrakech11 A package for definition comparison and term clustering pairwise definitions comparison term clusterer user interface checkupdate synonym lexicon extended Porter’s stemmer distance matrix clusters term database

LREC 2008 Marrakech12 User interface for checking and updating terms

LREC 2008 Marrakech13 Evaluation The text comparison and clustering components are evaluated on a set of GO terms as test data. In the evaluation, we consider GO terms to be related if they: –share a parent term within three layers of ancestor trees via IS_A relation, or –have direct parent/child relations (e.g. X is_a Y), or –have direct part-of relations (e.g. X is part of Y).

LREC 2008 Marrakech14 Evaluation Test data –GO terms under the namespace of cellular_component –2,027 found, of which 2,010 have definitions --- actual test data. –All of the 2,010 test terms are related as defined previously with one or more other test terms. Our evaluation strategy is to examine: –How many clustered terms have the relations defined previously, and –How many of the related terms can be covered by the clusters.

LREC 2008 Marrakech15 Evaluation of bottom-layer clusters Total_clustered_terms=1,076 depths of parent nodes considered clustered true pairsprecision (%) coverage (%) (834 terms) (978 terms) (1,062 terms)

LREC 2008 Marrakech16 Distribution of relation types IS_A and PART_OF in the clustered terms 1 parent node2 parent nodes3 parent nodes typeis-apart-ofis-apart-ofis-apart-of numb percent

LREC 2008 Marrakech17 Evaluation of the second layer clusters depths of parent nodes considered correctly clustered terms precision/coverage (%) 1 1, ,47473,33 3 1,68583,83 Total_clustered_terms=2,010

LREC 2008 Marrakech18 Evaluation of the third layer clusters depths of parent nodes considered correctly clustered terms precision/coverage (%) 11, , , Total_clustered_terms=2,010

LREC 2008 Marrakech19 This package can be used as an assistant tool for modifying and enriching ontology and terminology. (Brief demo of interface) Application of this package

LREC 2008 Marrakech20 Conclusion Ontology term definitions provide an important information source for term matching. Text comparing and clustering tool can provide useful tool for matching the terms. For a better performance, the tool needs domain knowledge resources.

LREC 2008 Marrakech21 Acknowledgements This research was supported by EC BOOTStrep Project (ref. FP ). The UK National Centre for Text Mining is sponsored by the JISC/BBSRC/EPSRC.

LREC 2008 Marrakech22 References BOOTStrep Project website: Castillo, Gabriel, Gerardo Sierra, John McNaught (2003). An improved Algorithm for Semantic Clustering. Proceedings of the 1st international symposium on Information and communication technologies, Dublin. Clough, Paul, Robert Gaizauskas, Scott Piao, Yorick Wilks (2002), METER: MEasuring TExt Reuse, In Proceedings of the ACL-2002, University of Pennsylvania, Philadelphia, USA, pp Gene Ontology Piao, Scott and Tony McEnery (2003). A tool for text comparison. Proceedings of the Corpus Linguistics