Lecture 24 Distributional Word Similarity II

Lecture 24 Distributional Word Similarity II
CSCE Natural Language Processing Lecture 24 Distributional Word Similarity II Topics Distributional based word similarity example PMI context = syntactic dependencies Readings: NLTK book Chapter 2 (wordnet) Text Chapter 20 April 15, 2013

Overview Readings: Text 19,20 NLTK Book: Chapter 10
Last Time Finish up Thesaurus based similarity … Distributional based word similarity Today Last Lectures slides 21- Distributional based word similarity II syntax based contexts Readings: Text 19,20 NLTK Book: Chapter 10 Next Time: Computational Lexical Semantics II

Pointwise Mutual Informatiom (PMI)
mutual Information Church and Hanks 1989 (eq 20.36) Pointwise Mutual Information (PMI) Fano 1961 . (eq 20.37) assoc-PMI (eq 20.38)

Computing PPMI Matrix F with W (words) rows and C (contexts) columns
fij is frequency of wi in cj,

Example computing PPMI
computer data pinch result salt apricot 1 pineapple digital 2 information 6 4 p(w information, c=data) = p(w information) = p(c=data) = Word Similarity_ Distributional Similarity I --NLP Jurafsky & Manning

Associations

PMI: More data trumps smarter algorithms
Comparing pointwise mutual information with latent semantic analysis” Indiana University, 2009 “we demonstrate that this metric benefits from training on extremely large amounts of data and correlates more closely with human semantic similarity ratings than do publicly available implementations of several more complex models. “

Figure 20.10 Co-occurrence vectors Based on syntactic dependencies
Dependency based parser – special case of shallow parsing identify from “I discovered dried tangerines.” (20.32) discover(subject I) I(subject-of discover) tangerine(obj-of discover) tangerine(adj-mod dried)

Defining Context using syntactic info
dependency parsing chunking discover(subject I) -- S  NP VP I(subject-of discover) tangerine(obj-of discover) -- VP verb NP tangerine(adj-mod dried) -- NP  det ? ADJ N

Figure 20.11 Objects of the verb drink Hindle 1990 ACL
frequencies it, much and anything more frequent than wine PMI-Assoc wine more drinkable Object Count PMI-Assoc tea 4 11.75 Pepsi 2 champagne liquid 10.53 beer 5 10.20 wine 9.34 water 7 7.65 anything 3 5.15 much 2.54 it 1.25 <some Amounnt> 1.22

vectors review dot-product length sim-cosine

Figure 20.10 Co-occurrence vectors
Dependency based parser – special case of shallow parsing identify from “I discovered dried tangerines.” (20.32) discover(subject I) I(subject-of discover) tangerine(obj-of discover) tangerine(adj-mod dried)

Figure 20.11 Objects of the verb drink Hindle 1990

vectors review dot-product length sim-cosine

Figure 20.12 Similarity of Vectors

Fig 20.13 Vector Similarity Summary

Figure 20.14 Hand-built patterns for hypernyms Hearst 1992
Finding hypernyms (IS-A links)

Figure 20.15

Figure 20.16

google(Wordnet NLTK) .

wn01.py # Wordnet examples from nltk.googlecode.com import nltk
from nltk.corpus import wordnet as wn motorcar = wn.synset('car.n.01') types_of_motorcar = motorcar.hyponyms() types_of_motorcar[26] print wn.synset('ambulance.n.01') print sorted([lemma.name for synset in types_of_motorcar for lemma in synset.lemmas])

wn01.py continued print "wn.synsets('dog', pos=wn.VERB)= ", wn.synsets('dog', pos=wn.VERB) print wn.synset('dog.n.01') ### Synset('dog.n.01') print wn.synset('dog.n.01').definition ###'a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds' print wn.synset('dog.n.01').examples ### ['the dog barked all night']

wn01.py continued print wn.synset('dog.n.01').lemmas ###[Lemma('dog.n.01.dog'), Lemma('dog.n.01.domestic_dog'), Lemma('dog.n.01.Canis_familiaris')] print [lemma.name for lemma in wn.synset('dog.n.01').lemmas] ### ['dog', 'domestic_dog', 'Canis_familiaris'] print wn.lemma('dog.n.01.dog').synset

Section 2 synsets, hypernyms, hyponyms
# Section 2 Synsets, hypernyms, hyponyms import nltk from nltk.corpus import wordnet as wn dog = wn.synset('dog.n.01') print "dog hyperyms=", dog.hypernyms() ###dog hyperyms= [Synset('domestic_animal.n.01'), Synset('canine.n.02')] print "dog hyponyms=", dog.hyponyms() print "dog holonyms=", dog.member_holonyms() print "dog.roo_hyperyms=", dog.root_hypernyms() good = wn.synset('good.a.01') ###print "good.antonyms()=", good.antonyms() print "good.lemmas[0].antonyms()=", good.lemmas[0].antonyms()

wn03-Lemmas.py ### Section 3 Lemmas eat = wn.lemma('eat.v.03.eat') print eat print eat.key print eat.count() print wn.lemma_from_key(eat.key) print wn.lemma_from_key(eat.key).synse t print wn.lemma_from_key( 'feebleminded%5:00:00:retarded:00' ) for lemma in wn.synset('eat.v.03').lemmas: print lemma, lemma.count() for lemma in wn.lemmas('eat', 'v'): print lemma, lemma.count() vocal = wn.lemma('vocal.a.01.vocal') print vocal.derivationally_related_forms() #[Lemma('vocalize.v.02.vocalize')] print vocal.pertainyms() #[Lemma('voice.n.02.voice')] print vocal.antonyms()

wn04-VerbFrames.py # Section 4 Verb Frames print wn.synset('think.v.01').frame_ids for lemma in wn.synset('think.v.01').lemmas: print lemma, lemma.frame_ids print lemma.frame_strings print wn.synset('stretch.v.02').frame_ids for lemma in wn.synset('stretch.v.02').lemmas:

wn05-Similarity.py ### Section 5 Similarity import nltk from nltk.corpus import wordnet as wn dog = wn.synset('dog.n.01') cat = wn.synset('cat.n.01') print dog.path_similarity(cat) print dog.lch_similarity(cat) print dog.wup_similarity(cat) from nltk.corpus import wordnet_ic brown_ic = wordnet_ic.ic('ic-brown.dat') semcor_ic = wordnet_ic.ic('ic-semcor.dat')

wn05-Similarity.py continued
from nltk.corpus import genesis genesis_ic = wn.ic(genesis, False, 0.0) print dog.res_similarity(cat, brown_ic) print dog.res_similarity(cat, genesis_ic) print dog.jcn_similarity(cat, brown_ic) print dog.jcn_similarity(cat, genesis_ic) print dog.lin_similarity(cat, semcor_ic)

wn06-AccessToAllSynsets.py ### Section 6 access to all synsets import nltk from nltk.corpus import wordnet as wn for synset in list(wn.all_synsets('n'))[:10]: print synset wn.synsets('dog') wn.synsets('dog', pos='v') from itertools import islice for synset in islice(wn.all_synsets('n'), 5): print synset, synset.hypernyms()

wn07-Morphy.py # Wordnet in NLTK # import nltk from nltk.corpus import wordnet as wn ### Section 7 Morphy print wn.morphy('denied', wn.NOUN) print wn.synsets('denied', wn.NOUN) print wn.synsets('denied', wn.VERB)

8 Regression Tests Bug 85: morphy returns the base form of a word, if it's input is given as a base form for a POS for which that word is not defined: >>> wn.synsets('book', wn.NOUN) [Synset('book.n.01'), Synset('book.n.02'), Synset('record.n.05'), Synset('script.n.01'), Synset('ledger.n.01'), Synset('book.n.06'), Synset('book.n.07'), Synset('koran.n.01'), Synset('bible.n.01'), Synset('book.n.10'), Synset('book.n.11')] >>> wn.synsets('book', wn.ADJ) [] >>> wn.morphy('book', wn.NOUN) 'book' >>> wn.morphy('book', wn.ADJ)

nltk.corpus.reader.wordnet.
ic(self, corpus, weight_senses_equally=False, smoothing=1.0) Creates an information content lookup dictionary from a corpus. def demo(): import nltk print('loading wordnet') wn = WordNetCorpusReader(nltk.data.find('corpora/wordnet')) print('done loading') S = wn.synset L = wn.lemma

root_hypernyms def root_hypernyms(self): """Get the topmost hypernyms of this synset in WordNet.""" result = [] seen = set() todo = [self] while todo: next_synset = todo.pop() if next_synset not in seen: seen.add(next_synset) next_hypernyms = next_synset.hypernyms() + … return result

Lecture 24 Distributional Word Similarity II

Similar presentations

Presentation on theme: "Lecture 24 Distributional Word Similarity II"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 24 Distributional Word Similarity II

Similar presentations

Presentation on theme: "Lecture 24 Distributional Word Similarity II"— Presentation transcript:

Similar presentations

About project

Feedback