Download presentation
Presentation is loading. Please wait.
Published byArchibald Blankenship Modified over 9 years ago
1
Ewa Rudnicka, Wojciech Witkowski, Maciej Piasecki G4.19 Research Group Institute of Informatics, Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl Large Polish-English Lexico- Semantic Resource Based on plWordNet - Princeton WordNet Mapping
2
Outline What is a wordnet? Mapping plWordNet on Princeton WordNet Extending Princeton WordNet Applications Conclusions
3
What is a wordnet? (1) A huge electronic lexico-semantic database (a kind of thesaurus) Basic building blocks: - lemma – base form representing different inflectional forms and different meanings e.g. czwórka – 'good' - lexical unit – lemma plus sense pair (in wordnets marked with number) e.g. czwórka 3 (por – 'communication') - synset – a set of synonymous lexical units e.g. {czwórka 3 (por), czwóra 1 (por)}
4
What is a wordnet? (2) Both lexical units and synsets linked via different lexico-semantic relations such as: synonymy, near-synonymy, hypernymy/hyponymy, meronymy/holonymy, fuzzynymy Examples: Lexical relations: czwórka 3 (por) has a derivativity relation to czwórka 4 (por) czwórka 3 (por) has an expressiveness relation to czwóra 1(por) Synset relations: {czwórka 3 (por), czwóra 1 (por)} is a hyponym of {stopień 3(il), ocena 1(il), nota 3(il)}
5
Princeton WordNet Princeton WordNet (Fellbaum 1998): the first wordnet ever built on psycholinguistic principles – mapping the structure of human lexical memory (cf. Miller 1998) taxonomic hierarchies for nouns, entailment relations for verbs, antonym relations for adjectives synsets represent 'lexicalised concepts' (cf. Miller 1998); synsets built of lexical units linked by synonymy relation, understood as a conceptual relation established on the basis of linguist's intuitions and dictionary definitions No major changes since 2006, last version 2012
6
plWordNet - Słowosieć plWordNet (plWN) developed fairly independently of Princeton WordNet (PWN) by applying a unique corpus-based method one of the biggest existing wordnets Number ofplWNPWNenWN lemmas156,402155,593157,541 lexical units220,129206,978209,147 synsets162,629117,659119,290
7
the emphasis on relations between lexical units, new relations specially designed to cover the pecularities of morphosyntactic structure of Polish cf. Piasecki et al. 2009, Maziarz et al. 2012 synsets built of lexical units sharing the same set of constitutive relations such as hyponymy, hypernymy, meronymy, holonymy partly linked to Princeton WordNet cf. Rudnicka et al. 2012 plWordNet vs. Princeton WordNet
8
Mapping plWordNet on Princeton WordNet Goal: Linking plWordNet synsets with Princeton Wordnet synsets Steps: Defining a set of inter-lingual relations and setting their hierarchy Designing mapping procedures for nouns and adjectives Mapping direction: plWordNet > Princeton WordNet Bottom-up approach – starting from the lowest levels in the hierarchy Currently mapped lexical categories: nouns (most of them), adjectives (about a half)
9
Automatic prompts Two systems, based on: 1) relaxation labeling algorithm (nouns) 2) rules relying on the network of the existing intra and inter-lingual relations (adjectives) Resource: cascade dictionary Generated prompts: - visible in the form of special links in WordNetLoom editing system - verified by lexicographers
10
A set of inter-lingual relations and current statistics A set of inter-lingual relations between plWN and PWN inspired by: inter-lingual relations from EuroWordNet (Vossen 2002) intra-lingual relations from plWordNet (Maziarz et al. 2011) Statistics of the established inter-lingual links: Nouns Adjectives 1. Synonymy 28 736 3 199 2. Partial synonymy 2 580 1 003 3. Inter-register synonymy 1 510 35 4. Hyponymy 57 029 6 561 5. Hypernymy 3 744 34 6. Meronymy 6 034 7. Holonymy 1 204 8. Cross-categorial synonymy 3 891
11
Motivation for the extension of Princeton WordNet the high percentage of inter-lingual hyponymy links between plWordNet and Princeton WordNet synsets Established due to a number of lexical coverage gaps in Princeton WordNet And the resulting impossibility to establish much more informative and useful inter-lingual synonymy links possible to be used as ‘pointers’ to specific Princeton WordNet gaps (‘missing’ lexical units) and whole ‘empty nests’ (several missing co-hyponyms of one hypernym synset) in the network
12
Inter-lingual hyponymy links
13
General extension procedure The starting point -- existing inter-lingual hyponymy links Lemmas of plWordNet synsets translated by a cascade dictionary Which combines several traditional dictionaries, the data ordered in the hierarchy of importance; the topmost gaining more priority the results are filtered by lemmas of Princeton WordNet, to gain: A list of plWN lemmas with the ‘equivalent’ cascade dictionary lemmas absent from PWN A list of plWN lemmas without the ‘equivalent’ cascade dictionary lemmas A list of plWN lemmas with the ‘equivalent’ cascade dictionary lemmas present in PWN
14
Extension procedure Start is at the lowest level of hierarchy in order not to change the structure of the original Princeton WordNet Verification of the suggested English equivalent(s) in corpora and other reliable sources on the basis of the researcher’s knowledge dictionaries frequency lists from corpora Creation of the new Princeton WordNet synset The synset is linked via intra-lingual hyponymy relation to a proper PWN hypernym synset via inter-lingual synonymy relation to its direct counterpart in plWordNet
15
Extension results Each added synset provided with: a definition major source - English Wikipedia a usage example from a corpus or other reliable English source Total number of selected plWN synsets --- 42785 Domains selected for the first stage : shape (156) substance (1181) quantity (547) food (885) property (1492)
16
Extension via plWN. Pros and cons Pros: There is a definite vocabulary basis for the extension New synsets can be easily and safely located in the structure of the original PWN Cons: Polish orientation of the extension Addition of lexical units related to strictly Polish domains
17
Extension via corpora data. An alternative strategy This extesion procedure uses frequency lists derived from: British National Corpus Wacky corpus Corpus of Contemporary American English American National Corpus English Wikipedia Independent of plWordNet Criterion for inclusion of a new lexical unit its appearance in five different texts
18
Pros and cons Pros: English oriented no Polish bias Cons: new synsets have to be introduced at different levels of the PWN hierarchy there is a risk of changing the structure of the original PWN
19
Cross-lingual Applications Cross-lingual Semantic searching, Semantic indexing of texts, Text classification, Statistical semantic analysis of corpora in different languages Information Extraction, Machine Translation Multi-lingual Princeton WordNet 3.1 is linked to more than 60 languages
20
Conclusions The created bilingual resource will become a gateway to CLARIN bilingual resources It has a number of practical applications Princeton WordNet can be enriched and updated Extension of Princeton WordNet allows one to replace the existing inter-lingual hyponymy links between plWN and PWN synsets with more precise and useful inter-lingual synonymy links
21
References Fellbaum, Ch. ( ed ). 1998. WordNet : An Electronic Lexical Database. MIT Press : Cambridge, Massachusets. Maziarz, M., Piasecki, M. and S. Szpakowicz. 2012. Approaching plWordNet 2. 0. Proceedings of the 6th Global Wordnet Conference, Matsue. Piasecki, M., Maziarz, M. Szpakowicz, S & Rudnicka, E. (2014). plWordNet as the Cornerstone of a Toolkit of Lexico-semantic Resources. W Proc. 7th International Global Wordnet Conference. Princeton WordNet http://wordnet.princeton.edu/wordnet/ Rudnicka, E., Maziarz, M., Piasecki, M., & Szpakowicz, S. (2012). A Strategy of Mapping Polish WordNet onto Princeton WordNet. In Proceedings of COLING 2012. ACL. S ł owosie ć http :// plwordnet. pwr. wroc. pl / wordnet / Vossen, P. (ed). 2002. EuroWordNet. General Document. Amsterdam.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.