WordNet WordNet, WSD
WordNet What is WordNet? Miller 95: “WordNet is an online lexical database designed for use under program control. English nouns, verbs, adjectives, and adverbs are organized into sets of synonyms, each representing a lexicalized concept. Semantic relations link the synonym sets.”
WordNet Go to the main WordNet site: http://wordnet.princeton.edu/ Open the wordnet folder on pongo: ~/dropbox/570/wordnet/dict
WordNet Vocabulary See glossary at: http://wordnet.princeton.edu/gloss synset: A synonym set; a set of words that are interchangeable in some context lemma: lower case ASCII text of word as found in the WordNet database index files lexical pointer: A lexical pointer indicates a relation between words in synsets
Navigating WordNet files data.* files – the actual network files (synsets) index.* files – contains lower case instances of all words in WordNet, with pointers to the synset entries in the network
WordNet data file See: wndb Synset file offset Synset type File number # words in synset word 00045430 04 n 01 performance 3 003 @ 00033580 n 0000 ~ 00045680 n 0000 ~ 00045874 n 0000 | any recognized accomplishment; "they admired his performance under stress“ 00045680 04 n 01 overachievement 0 003 @ 00045430 n 0000 + 02537922 v 0101 ! 00045874 n 0101 | better than expected performance (better than might have been predicted from intelligence tests) # pointers to other synsets Type of pointer POS Pointer See: wndb
Pointer symbols See: wninput For nouns: ! Antonym @ Hypernym ~ Hyponym #m Member holonym #s Substance holonym #p Part holonym %m Member meronym %s Substance meronym %p Part meronym = Attribute + Derivationally related form See: wninput
WordNet index file lemma (word) POS # pointers pointers abomination n 3 2 @ + 3 0 09613960 07401317 00734041 synset file offset # synsets
WordNet tools Many, many tools General documentation: http://wordnet.princeton.edu/doc Online query and lookup: http://wordnet.princeton.edu/perl/webwn APIs and tools: http://wordnet.princeton.edu/links WordNet::similarity: http://wn-similarity.sourceforge.net/ WordNet::similarity web interface: http://marimba.d.umn.edu/cgi-bin/similarity/similarity.cgi
WordNet and WSD Milhalcea 2002 describes system to sense encode text using WordNet (and related tools and resources)
Milhalcea 2002 Some tools and resources described: Senseval http://www.senseval.org/ Evalutation exercises for Word Sense Disambiguation Senseval-1 – 3, held in last several years, workshops at ACL Senseval-4 coming up Data and materials from Senseval-3 can be downloaded Some useful materials for multiple languages Materials and test data for English, Italian, Basque, Catalan, Chinese, Romanian, and Spanish
Milhalcea 2002 Some tools and resources described: Semcor Sense tagged Brown corpus Created at Princeton Used for training WSD systems Can be downloaded from Milhalcea’s web site: http://www.cs.unt.edu/~rada/downloads.html We’re also planning on installing it on Pongo
McCarthy et al 2004 Task: find the predominant word senses in untagged text Unlike Milhalcea 2002, did not rely on supervised method using SemCor Built a thesaurus from raw text and Wordnet Intuition: word sense more likely to be determined from untagged corpus from context, affected by genre, domain or text type Rather than relying on SemCor’s 250,000 words, where the word senses are rather limited
McCarthy et al Thesaurus development relies on dependencies between “neighbors” Look at distributional similarities between a word and its neighbors
McCarthy et al Experimented with several similarity measures available in WordNet::similarity First experiment used SemCor to see how well the unsupervised system worked 2595 polysemous nouns in SemCor
McCarthy et al Experiment #2 against SENSEVAL-2 English All Words Data Comparison between the precision and recall for SemCor vs. their automatic data (and the SENSEVAL ceiling)
McCarthy et al Some experiments with domain specific corpora gave these results: