Download presentation
Presentation is loading. Please wait.
Published byWhitney Henry Modified over 9 years ago
1
Semantic distance & WordNet Serge B. Potemkin Moscow State University Philological faculty
2
Distance and metrics Fundamental concept = distance between entities under consideration Semantic distance between words or concepts Metrical space axioms?
3
Distance is needed for: word sense disambiguation, determining the structure of texts, text summarization and annotation, information extraction and retrieval, automatic indexing, lexical selection, the automatic correction of word errors in text …
4
Approaches to distance measuring: Corpora-based Dictionary-based Roget-structured thesauri WordNet and other semantic networks
5
WordNet Synonym sets (synsets) Subsumption hierarchy (hyponymy / hypernymy), 3 meronymic (PART-OF) relations COMPONENT-OF, MEMBER-OF, SUBSTANCE-OF and their inverses; Antonymy, COMPLEMENT-OF
6
WordNet shortcomings: 150000 synsets – inadequate coverage Non-English versions 20 – 70% of English (100000 synsets for Russian) Extension is hard Distance measuring is controversial
7
Corpora-based approach Two words wa and wb are as close as often their neighbors (+/- 5 words) coincide. Ex. (distributional profile of the word) star: space 0.28, movie 0.2, famous 0.13, light 0.09, rich 0.04,..
8
Dictionary-based approach Two words wa and wb are as close as often words in definitions coincide. Ex. wa=linguistics wb=stylistics {the, study, of, language, in, general, and, of, particular, languages, and, their, structure, and, grammar, and, history} {the, study, of, style, in, written, or, spoken, language}. 2 words coincide in definitions
9
Bilingual dictionary approach Two words wa and wb are as close as often their equivalents coincide. ρ(Wa, Wb) = 1/Σni, Where Σ is the sum over all coinciding Russian equivalents and ni is the number of dictionaries where an equivalent occurs Or ρ(Wa, Wb) = Σ nai nbi /(||aR|| ||bR||)
10
Multidimensional scaling Semantic network is a graph nodes -- words edges -- links between words via bilingual lexicon || edge || = ρ(Wa, Wb) Immersion of graph is possible to N-dimensional space where N=number of words in the lexicon (>100000) Multidimensional scaling for visualization
11
New synonyms
12
1-neighborhood of accolade Links between synonyms (black) Links between synonyms from the dictionary (green) 2 isolated clusters.
13
Dominant in acerbity neighborhood ascerbity (терпкость) excluded cluster (bold lines) derived by Markovian process asperity (резкость) is the centre of the cluster
14
2 dominants for bicycling (wheel+crook)
15
Adjustable parameters - space dimension; - minimal number of dictionaries linking synonyms; - maximal distance from the word under consideration - maximal number of displayed words - word excluded from clustering …
16
Compare LDB with WordNet (accolade) SynsetWordNet # of syn. LDB # of syn. Synonyms in LDB award3n+2v80 accolade1n8commendation, praise, approbation, applause, + honorable mention, mention, positive mention honor = honour 4n+3v>100 laurels2n15 n – noun, v - verb
17
Controversy 1 Immediate hyperonym for the accolade synset in WordNet is symbol -- (an arbitrary sign (written or printed) that has acquired a conventional significance). Immediate hyperonym for commendation, (more frequent than accolade) is accolade synset Actually accolade is hyponym for commendation It is impossible to disambiguate accolade (bracket) from accolade (praise)
18
Controversy 2 WordNet: dog 1 – «domestic dog» hyperonym - canine, canid. further – mammal, …, entity Nor animal, neither pet, are linked with dog as hyperonyms. Tree structure is inadequate for semantic coding.
19
Conclusion Each meaning of the polysemic word could be coded as pair (wE, wR) in contrast to synset coding. Metrics superimposed over LDB enables homograph disambiguation and extraction of dominants Network has particular advantages over hierarchical representation of semantic relations
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.