plWordNet as the Cornerstone of a Toolkit of Lexico-semantic Resources Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanis ł aw Szpakowicz* G4.19 Research Group, Institute of Informatics Wroc ł aw University of Technology * School of Electrical Engineering and Computer Science University of Ottawa
Wordnet as a Lexical Resource Princeton WordNet defines de facto standard –large size and coverage –open access –thousands of applications Applications: dictionary vs knowledge representation Range of description Ideal size and natural development limits
plWordNet model: linguistic resource Wordnet vs ontology –O: a strict knowledge representation –W: concepts expressed entirely in a natural language –W: synonymy is a matter of degree –O: certainty and a rigorous construction –W: shaped by the lexico-semantic dependencies Alternative to formalisation –Corpus analysis and substitution tests –Minimal commitment: defining lexico-semantic relations without committing to any particular theory of lexical semantic or human cognition
plWordNet model: corpus-based development Main source of lexical knowledge: a very large monolingual corpus –tools for corpus browsing –semi-automatic knowledge extraction Additional sources: dictionaries and encyclopedias Lexical unit –lemma-sense pair –a linguistically motivated primitive
plWordNet model: synset definition Synsets –groups of lexical units sharing certain relations {afekt 1 `passion’, uczucie 2 `feeling’} hypernym {mi ł o ść 1 `love’, umi ł owanie 1 `affection’, kochanie 1 ~`loving’} Constitutive relations –fairly frequent (to describe many LUs) –shared among LUs (to define groups) –grounded in the linguistic tradition (to facilitate their consistent understanding) –used in other wordnets (to improve compatibility)
plWordNet model: non-relational aspects Constitutive features –stylistic registers, –verb aspect –and semantic verb classes Referred to in the relation definitions –e.g. relations limited to verbs of the same aspect and semantic class Glosses helps wordnet editors Usage examples: direct links to the corpus
Relation density Synset relation density in PWN 3.1 and in plWordNet 2.0
Size matters: lexical coverage Coverage of PWN/plWN for lemmas of different frequency in two similar 1.2G words corpora (Wikipedia)
Size matters: plWordNet 2.2 POSSynsetsLemmasLUsAverage synset Nouns Verbs Adjectives All
plWordNet: ongoing work
Size matters: comparison of wordnets
How many words are there? - existing dictionaries ● Woordenboek der Nederlandsche Taal 430k lemmas ● dictionary of Grimm brothers 330k lemmas ● Oxford English Dictionary 300k lemmas ● `Warsaw’ Polish Dictionary 280k lemmas ● contemporary Polish dictionaries 130k lemmas unabridged dictionaries
~174k (10+ lemmas) COBUILD data How many words are there? - approximation
# entries Polish dictionaries k plWordNet corpus (10+ lemmas) [K]174k doubled plWordNet corpus (0+ lemmas) [GT]+200k How many words are there? K - Krishnamurthy’s data (2002), GT - Good & Toulmin approximation (1956) plWordNet k lemmas
Toolkit of Lexico-semantic Resources Lexicon of lexico-syntactic structures of multi-word expressions plWordNet 3.0 (Słowosieć 3.0) plWordNet 3.0 to WordNet 3.1 mapping Semantic lexicon of proper names Mapping to an ontology And a valency lexicon linked to plWordNet
Lexicon of multi-word expressions Non-trivial morphology of Polish MWEs –more than 100 nominal structural patterns Description of the lexico-syntactic structures of MWEs Multi-word LUs as semantic atoms –no internal semantic relations Dynamic lexicon –a tool for automatic MWE extraction – described in the lexicon and plWordNet
Lexicon of Proper Names PNs are not a part of the lexicon PN is an instance of a type –characterised by referents –not by their semantic properties Linking PNs via a wordnet –some lexico-syntactic contexts signal instance of –PNs are represented in wordnets PNs as derivational bases for Common Nouns Dynamic lexicon with 2.5 milion PNs verified manually
plWordNet to WordNet 3.1 mapping plWordNet: built independently to obtain faithful description Manual mapping –bottom-up order –comparison of the relations structures –a cascading list of Interlingual-relations plWordNet verification as an important side effect Present state: N and Adj synsets mapped Target: complete plWordNet 3.0 mapped
Wordnet editor: WordnetLoom
WordnetLoom: editing the mapping
Mapping to ontology Ontology: unambiguous concepts defined formally Lexical meanings –imprecisely delimited –constrained by usage, stylistic register and sentiment Mapping to ontology –precise, formal description for meanings –association: concepts – their lexical embodiment SUMO selected –Princeton WordNet mapping –Semi-automated mapping of plWordNet
Expectations plWordNet 3.0 Valence lexiconMWE lexicon WordNet extension Proper Names Ontology: SUMO + intermediate level describes
Applications Strong universal basis –a comprehensive wordnet > lemmas resulting in ~ LUs and ~ synsets –one of the largest ever Polish dictionaries Modularly constructed toolkit –a layered architecture of large software systems –separate but linked layers –each layer based on limited set of notions and principles and exchangeable The core of the CLARIN-PL language technology infrastructure
Thank-you Thank you!