Ewa Rudnicka, Wojciech Witkowski, Maciej Piasecki G4.19 Research Group Institute of Informatics, Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl.

Slides:



Advertisements
Similar presentations
Building Wordnets Piek Vossen, Irion Technologies.
Advertisements

PlWordNet as the Cornerstone of a Toolkit of Lexico-semantic Resources Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanis ł aw Szpakowicz* G4.19 Research.
YAGO: A Large Ontology from Wikipedia and WordNet Fabian M. Suchanek, Gjergji Kasneci, Gerhard Weikum Max-Planck-Institute for Computer Science, Saarbruecken,
Using Link Grammar and WordNet on Fact Extraction for the Travel Domain.
The WordNet Lexical Database Bernardo Magnini ITC-irst, Istituto per la Ricerca Scientifica e Tecnologica Trento - Italy.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
The Bulgarian National Corpus and Its Application in Bulgarian Academic Lexicography Diana Blagoeva, Sia Kolkovska, Nadezhda Kostova, Cvetelina Georgieva.
Section 4: Language and Intelligence Overview Instructor: Sandiway Fong Department of Linguistics Department of Computer Science.
C SC 620 Advanced Topics in Natural Language Processing Lecture 22 4/15.
Building an Ontology-based Multilingual Lexicon for Word Sense Disambiguation in Machine Translation Lian-Tze Lim & Tang Enya Kong Unit Terjemahan Melalui.
Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanisław Szpakowicz G4.19 Research Group Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl.
Ewa Rudnicka, Marek Maziarz, Maciej Piasecki G4.19 Research Group Institute of Informatics, Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl.
Klaus M. Frei1 WordNet „An On-line Lexical Database“ (Miller, G. A.; Beckwith, R.; Fellbaum, Chr.; Gross, D.; Miller, K. 1993, title). Based on psycho-linguistic.
Sentiment Lexicon Creation from Lexical Resources BIS 2011 Bas Heerschop Erasmus School of Economics Erasmus University Rotterdam
The Informative Role of WordNet in Open-Domain Question Answering Marius Paşca and Sanda M. Harabagiu (NAACL 2001) Presented by Shauna Eggers CS 620 February.
Introduction to Lexical Semantics Vasileios Hatzivassiloglou University of Texas at Dallas.
Creating a Bilingual Ontology: A Corpus-Based Approach for Aligning WordNet and HowNet Marine Carpuat Grace Ngai Pascale Fung Kenneth W.Church.
Using resources WordNet and the BNC. WordNet: History 1985: a group of psychologists and linguists start to develop a “lexical database” –Princeton University.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Article by: Feiyu Xu, Daniela Kurz, Jakub Piskorski, Sven Schmeier Article Summary by Mark Vickers.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.
A Framework for Named Entity Recognition in the Open Domain Richard Evans Research Group in Computational Linguistics University of Wolverhampton UK
From Semantic Similarity to Semantic Relations Georgeta Bordea, November 25 Based on a talk by Alessandro Lenci titled “Will DS ever become Semantic?”,
LREC 2008 AWN 1 Building WordNets: The Arabic case H. Rodríguez.
Saarbrucken / Germany ¨
1 Indo WordNet A WordNet for Hindi Centre for Technology Development for Indian Languages Computer Science and Engineering Department, IIT Bombay.
Integrating Greek and English Digital Resources Sean Boisen Computer Assisted Research Section, S Slides at:
Medical WordNet A Proposal Christiane Fellbaum Princeton University and Berlin-Brandenburg Academy of Sciences.
WORDNET Approach on word sense techniques - AKILAN VELMURUGAN.
Adam Pease and Christiane Fellbaum Presenter: 吳怡安
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
Jiuling Zhang  Why perform query expansion?  WordNet based Word Sense Disambiguation WordNet Word Sense Disambiguation  Conceptual Query.
WordNet ® and its Java API ♦ Introduction to WordNet ♦ WordNet API for Java Name: Hao Li Uni: hl2489.
Multilingual Information Exchange APAN, Bangkok 27 January 2005
LREC 2008 AWN 1 Arabic WordNet: Semi-automatic Extensions using Bayesian Inference H. Rodríguez 1, D. Farwell 1, J. Farreres 1, M. Bertran 1, M. Alkhalifa.
ICS-FORTH January 11, Thesaurus Mapping Martin Doerr Foundation for Research and Technology - Hellas Institute of Computer Science Bath, UK, January.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications.
Chapter 10 Language and Computer English Linguistics: An Introduction.
1 Query Operations Relevance Feedback & Query Expansion.
WORD SENSE DISAMBIGUATION STUDY ON WORD NET ONTOLOGY Akilan Velmurugan Computer Networks – CS 790G.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
Quality Control for Wordnet Development in BalkaNet Pavel Smrž Faculty of Informatics, Masaryk University in Brno, Czech.
Application of INTEX in refinement and validation of Serbian WordNet Ivan Obradović, Ranka Stanković Cvetana Krstev, Gordana Pavlović-Lažetić University.
Integrating Semantic Dictionaries for English, French and Bulgarian into the NooJ System for the Purposes of Information Retrieval Svetla Koeva, Max Silbetztein.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
Terminology and documentation*  Object of the study of terminology:  analysis and description of the units representing specialized knowledge in specialized.
 Motivation:  Actor: [awards, height, age, weight, birthdate, birthplace, cause of death, real name]  Painter: [paintings, biography, bibliography,
Wordnet - A lexical database for the English Language.
Using Semantic Relatedness for Word Sense Disambiguation
1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.
1 Chen Yirong, Lu Qin, Li Wenjie, Cui Gaoying Department of Computing The Hong Kong Polytechnic University Chinese Core Ontology Construction from a Bilingual.
Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.
TUNING HIERARCHIES IN PRINCETON WORDNET AHTI LOHK | CHRISTIANE D. FELLBAUM | LEO VÕHANDU THE 8TH MEETING OF THE GLOBAL WORDNET CONFERENCE IN BUCHAREST.
Experiences of (Lexicographers and) Computer Scientists in Validating Estonian Wordnet with Test Patterns Ahti Lohk | Kadri Vare | Heili Orav | Leo Võhandu.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Enhancing Text Clustering by Leveraging Wikipedia Semantics.
Query expansion COMP423. Menu Query expansion Two approaches Relevance feedback Thesaurus-based Most Slides copied from
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics Semantic distance between two words.
Mapping the NCI Thesaurus and the Collaborative Inter-Lingual Index Amanda Hicks University of Florida HealthInsight Workshop, Oslo, Norway.
Ontologies Introduction to Computational Linguistics – 23 March 2016.
Talp Research Center, UPC, Barcelona, Spain
Generating sets of synonyms between languages
ArtsSemNet: From Bilingual Dictionary To Bilingual Semantic Network
WordNet: A Lexical Database for English
Bulgarian WordNet Svetla Koeva Institute for Bulgarian Language
WordNet WordNet, WSD.
Presentation transcript:

Ewa Rudnicka, Wojciech Witkowski, Maciej Piasecki G4.19 Research Group Institute of Informatics, Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl Large Polish-English Lexico- Semantic Resource Based on plWordNet - Princeton WordNet Mapping

Outline What is a wordnet? Mapping plWordNet on Princeton WordNet Extending Princeton WordNet Applications Conclusions

What is a wordnet? (1) A huge electronic lexico-semantic database (a kind of thesaurus) Basic building blocks: - lemma – base form representing different inflectional forms and different meanings e.g. czwórka – 'good' - lexical unit – lemma plus sense pair (in wordnets marked with number) e.g. czwórka 3 (por – 'communication') - synset – a set of synonymous lexical units e.g. {czwórka 3 (por), czwóra 1 (por)}

What is a wordnet? (2) Both lexical units and synsets linked via different lexico-semantic relations such as: synonymy, near-synonymy, hypernymy/hyponymy, meronymy/holonymy, fuzzynymy Examples: Lexical relations: czwórka 3 (por) has a derivativity relation to czwórka 4 (por) czwórka 3 (por) has an expressiveness relation to czwóra 1(por) Synset relations: {czwórka 3 (por), czwóra 1 (por)} is a hyponym of {stopień 3(il), ocena 1(il), nota 3(il)}

Princeton WordNet Princeton WordNet (Fellbaum 1998): the first wordnet ever built on psycholinguistic principles – mapping the structure of human lexical memory (cf. Miller 1998) taxonomic hierarchies for nouns, entailment relations for verbs, antonym relations for adjectives synsets represent 'lexicalised concepts' (cf. Miller 1998); synsets built of lexical units linked by synonymy relation, understood as a conceptual relation established on the basis of linguist's intuitions and dictionary definitions No major changes since 2006, last version 2012

plWordNet - Słowosieć plWordNet (plWN) developed fairly independently of Princeton WordNet (PWN) by applying a unique corpus-based method one of the biggest existing wordnets Number ofplWNPWNenWN lemmas156,402155,593157,541 lexical units220,129206,978209,147 synsets162,629117,659119,290

the emphasis on relations between lexical units, new relations specially designed to cover the pecularities of morphosyntactic structure of Polish cf. Piasecki et al. 2009, Maziarz et al synsets built of lexical units sharing the same set of constitutive relations such as hyponymy, hypernymy, meronymy, holonymy partly linked to Princeton WordNet cf. Rudnicka et al plWordNet vs. Princeton WordNet

Mapping plWordNet on Princeton WordNet Goal: Linking plWordNet synsets with Princeton Wordnet synsets Steps: Defining a set of inter-lingual relations and setting their hierarchy Designing mapping procedures for nouns and adjectives Mapping direction: plWordNet > Princeton WordNet Bottom-up approach – starting from the lowest levels in the hierarchy Currently mapped lexical categories: nouns (most of them), adjectives (about a half)

Automatic prompts Two systems, based on: 1) relaxation labeling algorithm (nouns) 2) rules relying on the network of the existing intra and inter-lingual relations (adjectives) Resource: cascade dictionary Generated prompts: - visible in the form of special links in WordNetLoom editing system - verified by lexicographers

A set of inter-lingual relations and current statistics A set of inter-lingual relations between plWN and PWN inspired by: inter-lingual relations from EuroWordNet (Vossen 2002) intra-lingual relations from plWordNet (Maziarz et al. 2011) Statistics of the established inter-lingual links: Nouns Adjectives 1. Synonymy Partial synonymy Inter-register synonymy Hyponymy Hypernymy Meronymy Holonymy Cross-categorial synonymy 3 891

Motivation for the extension of Princeton WordNet the high percentage of inter-lingual hyponymy links between plWordNet and Princeton WordNet synsets Established due to a number of lexical coverage gaps in Princeton WordNet And the resulting impossibility to establish much more informative and useful inter-lingual synonymy links possible to be used as ‘pointers’ to specific Princeton WordNet gaps (‘missing’ lexical units) and whole ‘empty nests’ (several missing co-hyponyms of one hypernym synset) in the network

Inter-lingual hyponymy links

General extension procedure The starting point -- existing inter-lingual hyponymy links Lemmas of plWordNet synsets translated by a cascade dictionary Which combines several traditional dictionaries, the data ordered in the hierarchy of importance; the topmost gaining more priority the results are filtered by lemmas of Princeton WordNet, to gain: A list of plWN lemmas with the ‘equivalent’ cascade dictionary lemmas absent from PWN A list of plWN lemmas without the ‘equivalent’ cascade dictionary lemmas A list of plWN lemmas with the ‘equivalent’ cascade dictionary lemmas present in PWN

Extension procedure Start is at the lowest level of hierarchy in order not to change the structure of the original Princeton WordNet Verification of the suggested English equivalent(s) in corpora and other reliable sources on the basis of the researcher’s knowledge dictionaries frequency lists from corpora Creation of the new Princeton WordNet synset The synset is linked via intra-lingual hyponymy relation to a proper PWN hypernym synset via inter-lingual synonymy relation to its direct counterpart in plWordNet

Extension results Each added synset provided with: a definition major source - English Wikipedia a usage example from a corpus or other reliable English source Total number of selected plWN synsets Domains selected for the first stage : shape (156) substance (1181) quantity (547) food (885) property (1492)

Extension via plWN. Pros and cons Pros: There is a definite vocabulary basis for the extension New synsets can be easily and safely located in the structure of the original PWN Cons: Polish orientation of the extension Addition of lexical units related to strictly Polish domains

Extension via corpora data. An alternative strategy This extesion procedure uses frequency lists derived from: British National Corpus Wacky corpus Corpus of Contemporary American English American National Corpus English Wikipedia Independent of plWordNet Criterion for inclusion of a new lexical unit its appearance in five different texts

Pros and cons Pros: English oriented no Polish bias Cons: new synsets have to be introduced at different levels of the PWN hierarchy there is a risk of changing the structure of the original PWN

Cross-lingual Applications Cross-lingual Semantic searching, Semantic indexing of texts, Text classification, Statistical semantic analysis of corpora in different languages Information Extraction, Machine Translation Multi-lingual Princeton WordNet 3.1 is linked to more than 60 languages

Conclusions The created bilingual resource will become a gateway to CLARIN bilingual resources It has a number of practical applications Princeton WordNet can be enriched and updated Extension of Princeton WordNet allows one to replace the existing inter-lingual hyponymy links between plWN and PWN synsets with more precise and useful inter-lingual synonymy links

References Fellbaum, Ch. ( ed ) WordNet : An Electronic Lexical Database. MIT Press : Cambridge, Massachusets. Maziarz, M., Piasecki, M. and S. Szpakowicz Approaching plWordNet Proceedings of the 6th Global Wordnet Conference, Matsue. Piasecki, M., Maziarz, M. Szpakowicz, S & Rudnicka, E. (2014). plWordNet as the Cornerstone of a Toolkit of Lexico-semantic Resources. W Proc. 7th International Global Wordnet Conference. Princeton WordNet Rudnicka, E., Maziarz, M., Piasecki, M., & Szpakowicz, S. (2012). A Strategy of Mapping Polish WordNet onto Princeton WordNet. In Proceedings of COLING ACL. S ł owosie ć http :// plwordnet. pwr. wroc. pl / wordnet / Vossen, P. (ed) EuroWordNet. General Document. Amsterdam.