Experiences of (Lexicographers and) Computer Scientists in Validating Estonian Wordnet with Test Patterns Ahti Lohk | Kadri Vare | Heili Orav | Leo Võhandu The 8th Meeting of The Global Wordnet Conference in BUCHAREST January 27-30, 2016
Motivation – why to validate? Every expandable and developing human-machine system needs a feedback mechanism The quality of wordnet has a strong impact on the quality of NLP tasks that use it Multiple inheritance cases in the semantic hierarchies of wordnet are prone to different semantic errors 2
Main aim To prove that semantic hierarchies of wordnet-type dictionaries do contain yet undiscovered substructures which correspond to certain descriptions (test patterns) and … the usage of these patterns to validate semantic hierarchies may improve wordnet structure significantly 3
Previous work Cycles (Šmrz, 2004), (Kubis, 2012) Shortcuts (Fischer, 1997) Rings (Liu et al., 2004; Richens, 2008) Dangling uplinks (Koeva et al., 2004; Šmrz, 2004) Orphan nodes (null graphs) (Čapek, 2012) 4
An artificial hierarchy 5
An artificial hierarchy and specific substructures 6 1 Short cut 2 Heart-shaped substructure 3 Ring 4 Closed subset 5 Dense component 6 Connected roots + 4 substructures Specific substructures = test patterns
Example 1: synset with many roots 7
Example 2: dense component 8
Example 3: „Compound“ pattern 9
Example 4: connected roots Side view Top view 10
Estonian Wordnet iterative evolution Version Noun roots Verb roots Multiple inheritance cases Short cuts Rings Synset with many roots Heart-shaped substructure Dense component “Compound ” pattern The largest closed subset , ,4451,1231, ,057×457 …………………………… , , ,875× , , ,907×218 …………………………… × ×4 11
Statistics of the correction operations Over ten versions of EstWN (during 4 years) 21,911 – removing the hypernymy and hyponymy relations 5,344 – the lexical units in synsets were changed 4,122 – hypernymy and hyponymy relations were replaced by another semantic relation, mainly by near synonymy and fuzzynymy 12
Wordnets in comparison Wordnet Noun roots Verb roots Multiple inheritance cases Short cuts Rings Synset with many roots Heart-shaped substructure Dense component „Compound“ pattern The largest closed subsets Princeton WordNet Version ,453402, ,333×167 Finnish Wordnet Version ,453402, ,334×167 Cornetto Version , ,309621, ,032×589 Polish Wordnet Version , , ,254 5, ,794×4,683 Estonian Wordnet Version x4 13
Summary In this presentation we studied: how to validate semantic hierarchies of wordnet and we proposed to use test patterns which are descriptions of the substructures with the specific nature. To prove the efficiency of test patterns we partially applied these test patterns over 10 versions of EstWN. Instances of different test patterns were extracted by programs of ours and validated by lexicographers. We discovered that the number of multiple inheritance cases decreased during last five versions about 97 procent. 14
Future works Applying test patterns on: other semantic relations other wordnets 15