Comparing Two Thesaurus Representations for Russian

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Advertisements

COGEX at the Second RTE Marta Tatu, Brandon Iles, John Slavick, Adrian Novischi, Dan Moldovan Language Computer Corporation April 10 th, 2006.
Using Link Grammar and WordNet on Fact Extraction for the Travel Domain.
Text Operations: Preprocessing. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination,
1/27 Semantics Going beyond syntax. 2/27 Semantics Relationship between surface form and meaning What is meaning? Lexical semantics Syntax and semantics.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 8 The Enhanced Entity- Relationship (EER) Model.
Using resources WordNet and the BNC. WordNet: History 1985: a group of psychologists and linguists start to develop a “lexical database” –Princeton University.
XML on Semantic Web. Outline The Semantic Web Ontology XML Probabilistic DTD References.
Article by: Feiyu Xu, Daniela Kurz, Jakub Piskorski, Sven Schmeier Article Summary by Mark Vickers.
Problems of Ontology Development for a Broad Domain Loukachevitch Natalia Leading Researcher of Lomonosov Moscow State University Center.
Adam Pease and Christiane Fellbaum Presenter: 吳怡安
Sociopolitical Domain as a Bridge from General Words to Terms of Specific Domains Research Computing Center of Moscow State University NCO Center for Information.
1 A Student Guide to Object- Orientated Systems Chapter 4 Objects and Classes: the basic concepts.
Wordnet, Raw Text Pinker, continuing Chapter 2
Automatic Lexical Annotation Applied to the SCARLET Ontology Matcher Laura Po and Sonia Bergamaschi DII, University of Modena and Reggio Emilia, Italy.
Name : Emad Zargoun Id number : EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF Computing and technology “ITEC547- text mining“ Prof.Dr. Nazife Dimiriler.
INF 384 C, Spring 2009 Ontologies Knowledge representation to support computer reasoning.
Artificial intelligence project
Nancy Lawler U.S. Department of Defense ISO/IEC Part 2: Classification Schemes Metadata Registries — Part 2: Classification Schemes The revision.
ICS-FORTH January 11, Thesaurus Mapping Martin Doerr Foundation for Research and Technology - Hellas Institute of Computer Science Bath, UK, January.
ArchiWordNet Integrating WordNet with Domain-Specific Knowledge Luisa Bentivogli 1, Andrea Bocco 2, Emanuele Pianta 1 1 ITC-irst Trento, Italy 2 Politecnico.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
WORD SENSE DISAMBIGUATION STUDY ON WORD NET ONTOLOGY Akilan Velmurugan Computer Networks – CS 790G.
Ontology-based information retrieval of scientific information Natalia V. Loukachevitch Laboratory of Information Resources Analysis Research Computing.
10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.
WordNet: Connecting words and concepts Christiane Fellbaum Cognitive Science Laboratory Princeton University.
CS3773 Software Engineering Lecture 04 UML Class Diagram.
Terminology and documentation*  Object of the study of terminology:  analysis and description of the units representing specialized knowledge in specialized.
Logics for Data and Knowledge Representation Applications of ClassL: Lightweight Ontologies.
UML Class Diagram Trisha Cummings. What we will be covering What is a Class Diagram? Essential Elements of a UML Class Diagram UML Packages Logical Distribution.
NLP ? Natural Language is one of fundamental aspects of human behaviors. One of the final aim of human-computer communication. Provide easy interaction.
Sergey Gromov Yulia Krasilnikova Vladimir Polyakov (NRTU MISIS, Moscow) KNOWLEDGE BASE CREATION FOR NATIONAL NANOTECHNOLOGY NETWORKS «CONSTRUCTIONAL NANOMATERIALS»
Natural Language Processing for Information Retrieval -KVMV Kiran ( )‏ -Neeraj Bisht ( )‏ -L.Srikanth ( )‏
Wordnet - A lexical database for the English Language.
GermaNet-WS II A WordNet “Detour” to FrameNet Aljoscha Burchardt Katrin Erk Anette Frank* Saarland University, DFKI* Saarbrücken
WordNet Enhancements: Toward Version 2.0 WordNet Connectivity Derivational Connections Disambiguated Definitions Topical Connections.
The meaning of Language Chapter 5 Semantics and Pragmatics Week10 Nov.19 th -23 rd.
1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.
1 Chen Yirong, Lu Qin, Li Wenjie, Cui Gaoying Department of Computing The Hong Kong Polytechnic University Chinese Core Ontology Construction from a Bilingual.
Knowledge Structure Vijay Meena ( ) Gaurav Meena ( )
2/10/2016Semantic Similarity1 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis.
TUNING HIERARCHIES IN PRINCETON WORDNET AHTI LOHK | CHRISTIANE D. FELLBAUM | LEO VÕHANDU THE 8TH MEETING OF THE GLOBAL WORDNET CONFERENCE IN BUCHAREST.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Semantic Interoperability in GIS N. L. Sarda Suman Somavarapu.
Linked Open Data Dataset from Related Documents Petya Osenova and Kiril Simov IICT-BAS LDL-2016, LREC, Portoroz.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Best pTree organization? level-1 gives te, tf (term level)
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
Business System Development
DOMAIN ONTOLOGY DESIGN
Learning Attributes and Relations
ece 627 intelligent web: ontology and beyond
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
IDEF1X Standard IDEF1X (Integrated Definition 1, Extended) was announced as a national standard in 1993 It defines entities, relationships, and attributes.
ArtsSemNet: From Bilingual Dictionary To Bilingual Semantic Network
Cross-language Information Retrieval
UML Class Diagrams: Basic Concepts
Social Knowledge Mining
CSC 594 Topics in AI – Applied Natural Language Processing
WordNet: A Lexical Database for English
Bulgarian WordNet Svetla Koeva Institute for Bulgarian Language
Object-Oriented Knowledge Representation
Center for Natural Language Processing School of Information Studies
WordNet WordNet, WSD.
Understand and Use Object Oriented Methods
CSE 635 Multimedia Information Retrieval
Ying Dai Faculty of software and information science,
Giannis Varelas Epimenidis Voutsakis Paraskevi Raftopoulou
Semantics Going beyond syntax.
Presentation transcript:

Comparing Two Thesaurus Representations for Russian Natalia Loukachevitch, German Lashevich, Boris Dobrov Lomonosov Moscow State University louk_nat@mail.ru

Russian Thesauri for NLP More than four attempts to create Russian wordnet Existing large RuThes thesaurus, which can be used for NLP Another structure but most techniques developed for WordNet can be applied But people want to have a wordnet for their own language This talk: semi-automatic conversion of data from thesaurus RuThes into WordNet-like structure-> RuWordNet Conversion process allows better understanding the differences between resources

Outline Wordnets for Russian Thesaurus of the Russian language RuThes Differences from WordNet Generation of the RuWordNet basic structure Additional relationships in RuWordNet

Projects of Russian Wordnets Automatically-generated Balkova et al., 2008 State of the project is unknown http://wordnet.ru/ (Gelfenbeyn et al., 2003) direct translation without any manual revision Developed from scratch RussNet (Azarowa, 2008) YARN – Yet Another RussNet (2012) Crowdsourcing, use of Wiktionary https://russianword.net/ Many naïve decisions Only synsets without relations Новый проект RussNet+YARN (2016)

RuThes Linguistic Ontology Linguistic Ontology - most concepts are based on senses of real language expressions Developed more than 20 years Corporate-owned, now partially published (RuThes-lite) Unified representation – single net of concepts For different parts of speech For lexical units and domain terms Words and multiword expressions Current size 55 thousand concepts, 4.1 relations per concept 168 thousand unique Russian words and multiword expressions 190 thousand senses

RuThes-Based Projects Informational-retrieval applications Conceptual indexing Knowledge-based text categorization Semantic search and query expansion Visualization of search results Document clustering Single document and multidocument summarization Sentiment analysis Projects with State Bodies Central Bank of the Russian Federation (2006 – ..) Central Election Committee of the RF (1999 – 2011) ... Commercial organizations Rambler Media company (2007– 2012) Garant Legal Information Company (2002 – 2013..) Yandex (2014) … 6 6

Units of RuThes Main principles Distinguishable concepts – distinctions with neighbor concepts on the denotational level Concept should have an unambiguous and concise name Text entries should be equivalent in respect to concept relations A concept unites the following language expressions (ontological synonyms): words that belong to different parts of speech: red, redness, red color, red colour linguistic expressions relating to different linguistic styles, genres single words, idioms, free multiword expressions, which senses correspond to the concept

Examples of ontological synonyms ДУШЕВНОЕ СТРАДАНИЕ (wound in the soul) боль, боль в душе, в душе наболело, душа болит, душа саднит, душевная пытка, душевная рана, душевный недуг, наболеть, рана в душе, рана в сердце, рана души, саднить English ontological synonyms can look as: emotional hurt, emotional pain, emotional wound, heartache, pain, pain in the soul, wound, wound in the heart, wound in the soul but: WN 3.0: pain, painfulness (emotional distress; a fundamental feeling that people try to avoid) "the pain of loneliness"

RuThes Conceptual Relations Small set of relations: motivated by information-retrieval thesauri and formal ontologies Class – subclass Transitivity, inheritance Part-whole Transitivity of part-whole relations External ontological dependence (Gangemi et al., 2001; Guarino, 2009) Existence of Car plant depends on existence of car Main principle for establishing relations – reliable relations Concepts of lower levels of the hierarchy should be rigidly related to upper concepts

Part-Whole Relations in RuThes Parts described in RuThes should be “attached” to their wholes Existential or generic dependence of part from whole (Gangemi et al., 2001 Guizzardi, 2011) Inseparable parts, Mandatory wholes Different semantic types Physical entities, elements, processes Roles in processes (investor – investing) Processes in spheres of activities Properties of entities Such a part-whole relation is close to Guarino internal relations (Guarino, 2009) Property of transitivity of part-whole is supposed

External dependence External dependence relation concept C2 from concept C1 (asc1 (C2, C1)) can be established if: neither taxonomic nor part-whole relations can be established between C1 and C2 in RuThes linguistic ontology, the following assertion is true: C2 exists means C1 exists Relations asc1 are inherited on subclasses and parts Examples: asc1 (automative industry, car (vehicle)) asc1 (forest, tree) asc1(forest fire, forest) asc1(forestry, forest)

RuThes-like Linguistic Ontologies Domain-Specific Lexicons Banking Thesaurus Ontology on Natural Sciencies and Technologies 94 K concepts,262 K terms Sociopolitical Thesaurus General Lexicon Avia*Ontology Sociopolitical thesaurus 41.4 K concepts, 121 K terms Security Thesaurus 66.8 K concepts, 236 K terms Domain-specific Lexicons 12

Generating RuWordNet Source: RuThes-lite 2.0 115 thousands words and expressions Division to part of speech nets Use of morpho-syntactic representation of RuThes text entries Division to three synset nets Cross-category synonymy between divided concepts’ text entries Providing WordNet-like (lexical) relations

Transfer of Relations: RuThes-> RuWordNet Class-Subclass relations=>hyponym-hypernym relations + closure relations RuThes: C1 (verb) –> C2 (no verb) –> C3 (Verb) Geographical synsets to their types=>instance - hypernym+H Part-whole relations=>part-whole, domain relations +H Associations=>Antonyms+H Ontological dependence relations => cause, entailment, phrase-component relations+H

RuWordNet Statistics Part of speech Number of synsets Number of unique entries Number of senses Noun 29,296 68,695 77,153 Verb 7,634 26,356 35,067 Adjective 12,864 15,191 18,195 130,415 senses Part of speech Hypernyms Instance- class Wholes Pos-synonymy Antonyms Noun 39,155 1,863 10,010 18,179 455 Verb 10,440 7,143 20 Adjective 16,423 13,794 457

RuWordNet: Noun Relations Hyponym-hypernym Instance-hypernym (geographical locations) Antonyms (properties and states) POS-synonymy Part-whole relations functional parts (nostrils  nose), ingredients (additives  substance), geographic parts (Sevilia  Andalusia), members (monk  monastery), dwellers (Moscow citizen  Moscow), temporal parts (gambit  chess party)

RuWordNet: Adjective Relations hyponym-hypernym relations Hierarchies as in GermaNet and Polish wordnet Antonyms Cross-category synonymy links to noun and verb synsets: word строительный – POS links to the noun synset {стройка, постройка, возведение, сооружение..} to the verb synset {строить, построить, возводить ...}.

Enrichment of Relation Set in RuWordNet Cause and entailment relations Domain relations Phrase and its component relations Derivational relations

Cause and Entailment Relations for Verb synsets 'A cause B’, No coincidence in time Entailment, "Someone V1" logically entails "Someone V2". Coincidence in time RuThes concepts with verb text entries Relations of ontological dependence (directed associations) were looked through by experts 610 cause relations: сажать – сесть (cause to sit – sit) 943 entailment relations: сниться (dream) - спать, поспать, почивать..(sleep).,

Domain Relations In RuThes: domain relations are considered as a kind of part-whole relations: industrial plant – industry Thematically related concepts are grouped together WordNet: most relations are taxonomic=> tennis problem: Related synsets belong to different hierarchies Therefore the system of domains has been introduced WordNet’s domain system was adapted for RuWordNet (Magnini, Pianta, 2000) Some domains were added (World religions) Some domains were removed Domain is considered as a category in knowledge-based categorization system and described in a special interface Relations from synsets to domains are inferred using RuThes relation properties (transitivity and inheritance) Post-editing

Relations between phrases and their components in RuWordNet Phrases as text entries in RuThes There are many phrases, including compositional or semi-compositional – now they are in RuWordNet For compositional phrases, ontological dependence relations are often used (=directed associations): car plant - car Such relations are not present in RuWordNet, relations can be lost Special file for describing relations between phrase and its components (synsets) The relations are inferred using relation properties of RuThes (transitivity and inheritance) Cargo vehicle: <sense name="ГРУЗОВОЕ СРЕДСТВО ТРАНСПОРТА" id="101933" synset_id="N26202"> <composed_of> <sense name="СРЕДСТВО" id="28238" synset_id="N28331"/> <sense name="ГРУЗОВОЙ" id="38045" synset_id="A9059"/> <sense name="ТРАНСПОРТ" id="41294" synset_id="N21760"/> </composed_of> </sense>

Derivation Relations in RuWordNet Derivation relations are also inferred using the properties of relations Аренда: арендатор, арендаторский, арендаторша, арендно-хозяйственный, арендный, арендование, арендователь, арендовать, арендодатель. (Lease, leaseholder, lessee, etc.) Ambiguous words are connected correctly <sense name="ДОНОСИТЬ" id="70038" synset_id="V44416"> <derived_from> <sense name="ДОНОСИТЕЛЬСТВО" id="47412" synset_id="N24310"/> <sense name="ДОНОСИТЬСЯ" id="73759" synset_id="V46525"/> <sense name="ДОНОСНЫЙ" id="24104" synset_id="A9883"/> <sense name="ДОНОСЧИК" id="55658" synset_id="N35980"/> <sense name="ДОНОСЧИЦА" id="55660" synset_id="N35980"/> <sense name="ДОНОСИТЕЛЬСКИЙ" id="47411" synset_id="A4423"/> …</derived_from> </sense>

Ruwordnet.ru: посадить Synset – to plant.1 Botany domain hypernym hyponyms

Accessibility of RuThes and RuWordNet RuThes web-site http://www.labinform.ru/pub/ruthes/index.htm RuWordNet web-sites http://www.labinform.ru/pub/ruwordnet/index.htm ruwordnet.ru Xml-files can be obtained non-commercial use: louk_nat@mail.ru

Conclusion We have described the semi-automatic process of transforming the Russian language thesaurus RuThes (in version, RuThes-lite 2.0) to WordNet-like thesaurus, called RuWordNet (130 thousand senses) In this procedure we attempted to achieve two main characteristic features of wordnet-like resources: division of data into part-of-speech-oriented structures with cross-references between them providing a set of relations similar to wordnet-like relations Both thesauri, RuThes-lite 2.0 and RuWordNet, are currently published Researchers can obtain both types of thesauri, compare them in applications We would like to develop both resources because the relations are different and can be useful in different applications