Semi-automatic compound nouns annotation for data integration systems Tuesday, 23 June 2009 SEBD 2009 Sonia Bergamaschi Serena Sorrentino www.dbgroup.unimo.it.

Slides:



Advertisements
Similar presentations
Entity Relationship (E-R) Modeling
Advertisements

ISDSI 2009 Francesco Guerra– Università di Modena e Reggio Emilia 1 DB unimo Searching for data and services F. Guerra 1, A. Maurino 2, M. Palmonari.
ISDSI 2009 Francesco Guerra– Università di Modena e Reggio Emilia 1 DB unimo Semantic Analysis for an Advanced ETL framework S.Bergamaschi 1, F.
Università di Modena e Reggio Emilia ;-)WINK Maurizio Vincini UniMORE Researcher Università di Modena e Reggio Emilia WINK System: Intelligent Integration.
IST SEWASIE 16 May 2002 Sonia Bergamaschi Università di Modena e Reggio Emilia.
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Why Do We Need a (Plan) Portability API? Gerd Breiter Frank Leymann Thomas Spatzier.
07 - Special Session on Agricultural Metadata & Semantics Antonio Sala - Università di Modena e Reggio Emilia 1 Creating and Querying.
the Entity-Relationship (ER) Model
1 Modeling and Simulation: Exploring Dynamic System Behaviour Chapter9 Optimization.
Lecture plan Outline of DB design process Entity-relationship model
Chapter 2 Entity-Relationship Data Modeling: Tools and Techniques
CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio.
Chapter 10: The Traditional Approach to Design
Systems Analysis and Design in a Changing World, Fifth Edition
Semantic Access to Data from the Web Raquel Trillo *, Laura Po +, Sergio Ilarri *, Sonia Bergamaschi + and E. Mena * 1st International Workshop on Interoperability.
1 Extended Gloss Overlaps as a Measure of Semantic Relatedness Satanjeev Banerjee Ted Pedersen Carnegie Mellon University University of Minnesota Duluth.
Heterogeneous Data Warehouse Analysis and Dimensional Integration Marius Octavian Olaru XXVI Cycle Computer Engineering and Science Advisor: Prof. Maurizio.
Creating a Similarity Graph from WordNet
D ETERMINING THE S ENTIMENT OF O PINIONS Presentation by Md Mustafizur Rahman (mr4xb) 1.
A Framework for Ontology-Based Knowledge Management System
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Università degli Studi di Modena e Reggio Emilia The MOMIS project - Sonia Bergamaschi, Alberto Corni, Francesco Guerra,
Planning a measurement program What is a metrics plan? A metrics plan must describe the who, what, where, when, how, and why of metrics. It begins with.
Sentiment Lexicon Creation from Lexical Resources BIS 2011 Bas Heerschop Erasmus School of Economics Erasmus University Rotterdam
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Creating a Bilingual Ontology: A Corpus-Based Approach for Aligning WordNet and HowNet Marine Carpuat Grace Ngai Pascale Fung Kenneth W.Church.
Queensland University of Technology An Ontology-based Mining Approach for User Search Intent Discovery Yan Shen, Yuefeng Li, Yue Xu, Renato Iannella, Abdulmohsen.
Designing clustering methods for ontology building: The Mo’K workbench Authors: Gilles Bisson, Claire Nédellec and Dolores Cañamero Presenter: Ovidiu Fortu.
INTEGRATION INTEGRATION Ramon Lawrence University of Iowa
Ontology-based Access Ontology-based Access to Digital Libraries Sonia Bergamaschi University of Modena and Reggio Emilia Modena Italy Fausto Rabitti.
Mining and Summarizing Customer Reviews
Semantic Matching Pavel Shvaiko Stanford University, October 31, 2003 Paper with Fausto Giunchiglia Research group (alphabetically ordered): Fausto Giunchiglia,
WORDNET Approach on word sense techniques - AKILAN VELMURUGAN.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
 Copyright 2005 Digital Enterprise Research Institute. All rights reserved. Towards Translating between XML and WSML based on mappings between.
Logics for Data and Knowledge Representation Semantic Matching.
Automatic Lexical Annotation Applied to the SCARLET Ontology Matcher Laura Po and Sonia Bergamaschi DII, University of Modena and Reggio Emilia, Italy.
Jiuling Zhang  Why perform query expansion?  WordNet based Word Sense Disambiguation WordNet Word Sense Disambiguation  Conceptual Query.
Exploiting Ontologies for Automatic Image Annotation M. Srikanth, J. Varner, M. Bowden, D. Moldovan Language Computer Corporation
Università degli Studi di Modena and Reggio Emilia Dipartimento di Ingegneria dell’Informazione Prototypes selection with.
Spoken dialog for e-learning supported by domain ontologies Dario Bianchi, Monica Mordonini and Agostino Poggi Dipartimento di Ingegneria dell’Informazione.
Serena SorrentinoLabel Normalization and Lexical Annotation for Schema and Ontology Matching 1 Label Normalization and Lexical Annotation for Schema and.
Semantic Matching Fausto Giunchiglia work in collaboration with Pavel Shvaiko The Italian-Israeli Forum on Computer Science, Haifa, June 17-18, 2003.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications.
Semantic Enrichment of Ontology Mappings: A Linguistic-based Approach Patrick Arnold, Erhard Rahm University of Leipzig, Germany 17th East-European Conference.
WORD SENSE DISAMBIGUATION STUDY ON WORD NET ONTOLOGY Akilan Velmurugan Computer Networks – CS 790G.
SYMPOSIUM ON SEMANTICS IN SYSTEMS FOR TEXT PROCESSING September 22-24, Venice, Italy Combining Knowledge-based Methods and Supervised Learning for.
Dimitrios Skoutas Alkis Simitsis
Efficiently Computed Lexical Chains As an Intermediate Representation for Automatic Text Summarization H.G. Silber and K.F. McCoy University of Delaware.
Dr. Francisco Perlas Dumanig
Wordnet - A lexical database for the English Language.
Assessment. Levels of Learning Bloom Argue Anderson and Krathwohl (2001)
Exploiting Ontologies for Automatic Image Annotation Munirathnam Srikanth, Joshua Varner, Mitchell Bowden, Dan Moldovan Language Computer Corporation SIGIR.
From Words to Senses: A Case Study of Subjectivity Recognition Author: Fangzhong Su & Katja Markert (University of Leeds, UK) Source: COLING 2008 Reporter:
Semantic Grounding of Tag Relatedness in Social Bookmarking Systems Ciro Cattuto, Dominik Benz, Andreas Hotho, Gerd Stumme ISWC 2008 Hyewon Lim January.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Word Sense and Subjectivity (Coling/ACL 2006) Janyce Wiebe Rada Mihalcea University of Pittsburgh University of North Texas Acknowledgements: This slide.
Presented by Kyumars Sheykh Esmaili Description Logics for Data Bases (DLHB,Chapter 16) Semantic Web Seminar.
SERVICE ANNOTATION WITH LEXICON-BASED ALIGNMENT Service Ontology Construction Ontology of a given web service, service ontology, is constructed from service.
2016/9/301 Exploiting Wikipedia as External Knowledge for Document Clustering Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou Proceeding.
Logical Database Design and the Rational Model
Assessment.
Assessment.
Cross-Ontological Relationships
Kiril Simov1, Alexander Popov1, Iliana Simova2, Petya Osenova1
Bulgarian WordNet Svetla Koeva Institute for Bulgarian Language
Block Matching for Ontologies
Presentation transcript:

Semi-automatic compound nouns annotation for data integration systems Tuesday, 23 June 2009 SEBD 2009 Sonia Bergamaschi Serena Sorrentino www.dbgroup.unimo.it Dipartimento di Ingegneria dell’Informazione Università di Modena e Reggio Emilia, via Vignolese 905, 41100 Modena

The Problem Data integration systems: produce a comprehensive global schema successfully integrating data from heterogeneous structured and semi-structured data sources Starting from the “meanings” associated to schema elements it is possible to discover mappings among the elements of different schemata Lexical Annotation : the explicit inclusion of the “meaning“ of a data source element (i.e. class/attribute name) w.r.t. a thesaurus (WordNet (WN) in our case) Automatic Lexical Annotation becomes crucial as a starting point for mapping discovery Problem : many schemata names are non-dictionary words (compound nouns, acronyms, abbreviations etc.) i.e. not be present in the lexical resource in this work, we will concentrate only on non-dictionary Compound Nouns (CNs) the result of lexical annotation is strongly affected by the presence of these non-dictionary CNs in the schema The focus of data integration systems is on producing a comprehensive global schema successfully integrating data from heterogeneous structured and semi-structured data sources. Therefore, it is important to deal with labels of schemata , i.e. to understand the “meaning" behind the names denoting schemata elements. Lexical Annotation becomes, thus, crucial to understand the meaning of schemata. Lexical annotation is the explicit inclusion of the “meaning“ of a data source element (i.e. class/attribute name) w.r.t. a thesaurus (WordNet (WN) in our case). Problem : a lexical resource does not cover with the same detail different domains of knowledge and many domain dependent terms, say non-dictionary words, may not be present in it; non-dictionary words include compound nouns, acronyms etc. In this work, we will concentrate only on non-dictionary Compound Nouns (CNs).

Proposed Solution & Motivation In some approaches the constituents of a CN are treated as single words. E.g. the CN “teacher_judgment" is split into two tokens (“teacher" and “judgment") and its relatedness to other sources element is calculated as an average relatedness between each token and the other elements A large set of relationships among different schemata is discovered, including a great amount of false positive relationships We propose a semi-automatic method for the lexical annotation of non-dictionary CNs Starting from our previous works on lexical annotation of structured and semi-structured data sources, we propose a semi-automatic method for the lexical annotation of non-dictionary CNs by creating a new WN meaning. Our method is implemented in the MOMIS (Mediator envirOnment for Multiple Information Sources) system. However, it may be applied in general in the context of schema mapping discovery, ontology merging and data integration system.

Compound Noun annotation Compound Noun (CN): a word composed of more than one words called CN constituents In order to perform semi-automatic CNs annotation a method for their interpretation has to be devised The interpretation of a CN is the task of determining the semantic relationships among the constituents of a CN CNs can be divided in four categories: endocentric, exocentric, copulative and appositional and to consider only endocentric CNs Endocentric CN: consists of a head (i.e. the part that contains the basic meaning of the whole CN) and modifiers, which restricts this meaning. A CN exhibits a modifier-head structure with a sequence of nouns composed of a head noun and one or more modifiers where the head noun occurs always after the modifiers Endocentric CN: consists of a head (i.e. the categorical part that contains the basic meaning of the whole CN) and modifiers, which restrict this meaning. A CN exhibits a modifier-head structure with a sequence of nouns composed of a head noun and one or more modifiers where the head noun occurs always after the modifiers.

Compound Noun annotation Our restriction is motivated by different elements: the the vast majority of schemata CNs fall in the endocentric category endocentric CNs are the most common type of CNs in English exocentric and copulative CNs, which are represented by a unique word, are often present in a dictionary (e.g. “loudmouth”, “sleepwalk”, etc.) appositional compound are not very common in English and less likely used as element of a schema (e.g.“sweet-sour”) Our method can be summed up into four main steps: CN constituents disambiguation redundant constituents identification and pruning CN interpretation via semantic relationships creation of a new WN meaning for a CN

CN constituents disambiguation & pruning Compound Noun syntactic analysis: syntactic analysis of CN constituents, performed by a parser Disambiguating head and modifier: by applying our CWSD (Combined Word Sense Disambiguation) algorithm, each word is automatically mapped into its corresponding WordNet 2.0 synsets Redundant constituents identification and pruning Redundant words: words that do not contribute new information, i.e. derived from the schema or from the lexical resource E.g. the attribute “company_address” of the class “company”: “company” is not considered as the relationship holding among a class and its attributes is implicit in the schema the syntactic analysis is performed by a parser (in our case the Charniak parser [6]), to identify the part of speech of each constituent. If the CN does not fall under the endocentric syntactic structure (noun-noun or adjective-noun), it is ignored.

CN interpretation via semantic relationships Our goal is to select, among a set of predefined semantic relationships, the one that best capture the relation between the head and the modifier 9 possible semantic relationship: CAUSE, HAVE, MAKE, IN, FOR, ABOUT, USE, BE, FROM (Levi’s semantic relationships set) the semantic relationship between the head and the modifier of a CN is the same holding between their top level WN nouns in the WN hierarchy The top level concepts of the WN hierarchy are the 25 unique beginners for WN English nouns defined by Miller

CN interpretation via semantic relationships To each couple of unique beginners we associate the relationship from the Levi's set that best describes their combined meaning For example, we interpret the CN “teacher judgment“ by the MAKE relationship as “teacher" is an hyponym of “person" and “judgment" is an hyponym of “act“ and for the couple (person, act) of unique beginners we choose the relationship MAKE MAKE Person#1 Act#2 … hyponym hyponym Educator#1 … hyponym MAKE Teacher#1 Judgment#2

Creation of a new WN meaning for a CN (a) Gloss definition: we create the gloss to be associated to a CN, starting from the relationship associated to a CN and exploiting the glosses of the CN constituents Teacher #1 Gloss A person whose occupation is teaching. judgment #2 Gloss The act to judging or assessing a person or situation or event. + + Modifier MAKE Head ew WN meaning for a CN; it can be divided in two prevalent steps: Gloss definition: during this step we create the gloss to be associated to a CN, starting from the relationship associated to a CN and exploiting the glosses of the CN constituents. Inclusion of the new CN meaning in WN: the insertion of a new CN meaning in the WN hierarchy implies the definition of its relationships with the other WN meanings. As the concepts denoted by a CN are a subset of the concepts denoted by the head we assume that a CN inherits most of its semantic from its head. Starting from this consideration, we can infer that the CN is related, in the WN hierarchy, with its head by an hyponym relationship. Moreover, we represent the CN semantics related to its modifier by inserting a generic relationship RT (Related term), corresponding to WN relationships as member meronym, part meronym etc. Moreover, we use the WNEditor tool to create/manage the new meaning and to set relationships between it and WN ones Teacher_judgment Gloss: A person whose occupation is teaching make the act to judging or assessing a person or situation or event.

Creation of a new WN meaning for a CN (b) Inclusion of the new CN meaning in WN: as the concept denoted by a CN is a subset of the concept denoted by the head we create  an hyponym relationship between the new CN meaning and its head meaning  a generic relationship RT (Related term), corresponding to WN relationships as member meronym, part meronym etc. between the CN meaning and its modifier  we use the WNEditor tool to create/manage the new meaning and to set new relationships between it and WN meanings Teacher_judgment#1 judgment#2 Teacher#1 SYNSETβ WNEditor hypernym/hyponym Related To Teacher_judgment#1 SYNSETµ

Example Related To hypernym Teacher_judgment#1

Evaluation: Experimental Result CNs annotation extends the automatic annotation tool within the MOMIS system Evaluation over a real data sources environment: three sources of an application scenario of the NeP4B project (491 schema elements) which contain a lot of CNs (about 50%). Without CNs annotation, CWSD obtains a very low recall value. Our method increases the recall without significantly worsening precision. However, the recall value is not very high: presence of a lot of acronym terms. A CN has been considered correctly annotated if the Levi's relationship selected manually by the user is the same returned by our method

Conclusion The experimental results showed the effectiveness of our method which significantly improves the result of the lexical annotation process Our method may be applied in general in the context of mapping discovery, ontology merging and data integration system Future work will be devoted to investigate on the role of the set of semantic relationships chosen for the CNs interpretation process We will extend the tool with a component which deals with acronyms and abbreviations expansion (to appear at 28th International Conference on Conceptual Modeling, ER 2009)

Thanks for your attention!