Lecture 24 Distributional Word Similarity II Topics Distributional based word similarity example PMI context = syntactic dependenciesReadings: NLTK book.

Lecture 24 Distributional Word Similarity II Topics Distributional based word similarity example PMI context = syntactic dependenciesReadings: NLTK book Chapter 2 (wordnet) Text Chapter 20 April 15, 2013 CSCE 771 Natural Language Processing

– 2 – CSCE 771 Spring 2013 Overview Last Time Finish up Thesaurus based similarity … Distributional based word similarityToday Last Lectures slides 21- Distributional based word similarity II syntax based contextsReadings: Text 19,20 NLTK Book: Chapter 10 Next Time: Computational Lexical Semantics II

– 3 – CSCE 771 Spring 2013 Pointwise Mutual Informatiom (PMI)  mutual Information Church and Hanks 1989  (eq 20.36)  Pointwise Mutual Information (PMI) Fano 1961 . (eq 20.37)  assoc-PMI  (eq 20.38)

– 4 – CSCE 771 Spring 2013 Computing PPMI  Matrix F with W (words) rows and C (contexts) columns  f ij is frequency of w i in c j,

– 5 – CSCE 771 Spring 2013 Example computing PPMI computerdatapinchresultsalt apricot00101 pineapple00101 digital21010 information16040 Word Similarity_ Distributional Similarity I --NLP Jurafsky & Manning p(w information, c=data) = p(w information) = p(c=data) =

– 6 – CSCE 771 Spring 2013 Example computing PPMI computerdatapinchresultsalt apricot00101 pineapple00101 digital21010 information16040 Word Similarity_ Distributional Similarity I --NLP Jurafsky & Manning p(w information, c=data) = p(w information) = p(c=data) =

– 7 – CSCE 771 Spring 2013 Associations

– 8 – CSCE 771 Spring 2013 PMI: More data trumps smarter algorithms “More data trumps smarter algorithms: Comparing pointwise mutual information with latent semantic analysis” Indiana University, 2009 http://www.indiana.edu/~clcl/Papers/BSC901.pdf “we demonstrate that this metric benefits from training on extremely large amounts of data and correlates more closely with human semantic similarity ratings than do publicly available implementations of several more complex models. “

– 9 – CSCE 771 Spring 2013 Figure 20.10 Co-occurrence vectors Based on syntactic dependencies  Dependency based parser – special case of shallow parsing  identify from “I discovered dried tangerines.” (20.32)  discover(subject I)I(subject-of discover)  tangerine(obj-of discover)tangerine(adj-mod dried)

– 10 – CSCE 771 Spring 2013 Defining Context using syntactic info dependency parsingdependency parsing chunkingchunking  discover(subject I)-- S  NP VP  I(subject-of discover)  tangerine(obj-of discover)-- VP  verb NP  tangerine(adj-mod dried)-- NP  det ? ADJ N

– 11 – CSCE 771 Spring 2013 Figure 20.11 Objects of the verb drink Hindle 1990 ACL frequenciesfrequencies it, much and anything more frequent than wine PMI-AssocPMI-Assoc wine more drinkable ObjectCountPMI-Assoc tea411.75 Pepsi211.75 champagne411.75 liquid210.53 beer510.20 wine29.34 water77.65 anything35.15 much32.54 it31.25 21.22 http://acl.ldc.upenn.edu/P/P90/P90-1034.pdf

– 12 – CSCE 771 Spring 2013 vectors review dot-productlengthsim-cosine

– 13 – CSCE 771 Spring 2013 Figure 20.12 Similarity of Vectors

– 14 – CSCE 771 Spring 2013 Fig 20.13 Vector Similarity Summary

– 15 – CSCE 771 Spring 2013 Figure 20.14 Hand-built patterns for hypernyms Hearst 1992  Finding hypernyms (IS-A links)  (20.58) One example of red algae is Gelidium.  one example of *** is a ***  500,000 hits on google  Semantic drift in bootstrapping

– 16 – CSCE 771 Spring 2013 Hyponym Learning Alg. (Snow 2005) Rely on wordnet to learn large numbers of weak hyponym patterns Snow’s Algorithm 1.Collect all pairs of wordnet noun concepts with 1.Collect all pairs of wordnet noun concepts with 2.For each pair collect all sentences containing the pair 3.Parse the sentences and automatically extract every possible Hearst-style syntactic patterns from the parse tree 4.Use the large set of patterns as features in a logistic regression classifier 5.Given each pair extract features and use the classifier to determine if the pair is a hypernym/hyponym  New patterns learned  NP H like NP  NP is a NP H  NP H called NP  NP, a NP H (appositive)

– 17 – CSCE 771 Spring 2013 Vector Similarities from Lin 1998  hope (N):  optimism 0.141, chance 0.137, expectation 0.137, prospect 0.126, dream 0.119, desire 0.118, fear 0.116, effort 0.111, confidence 0.109, promise 0.108  hope(V)  would like 0.158, wish 0.140. …  brief (N)  legal brief 0.256, affidavit 0.191, …  brief (A)  lengthy.256, hour-long 0.191, short 0.174, extended 0.163 …  full lists on page 667

– 18 – CSCE 771 Spring 2013 Supersenses 26 broad-category “lexicograher class” wordnet labels

– 19 – CSCE 771 Spring 2013 Figure 20.15 Semantic Role Labelling

– 20 – CSCE 771 Spring 2013 Figure 20.16

– 21 – CSCE 771 Spring 2013 google(Wordnet NLTK).

– 22 – CSCE 771 Spring 2013 wn01.py # Wordnet examples from nltk.googlecode.com import nltk from nltk.corpus import wordnet as wn motorcar = wn.synset('car.n.01') types_of_motorcar = motorcar.hyponyms() types_of_motorcar[26] print wn.synset('ambulance.n.01') print sorted([lemma.name for synset in types_of_motorcar for lemma in synset.lemmas]) http://nltk.googlecode.com/svn/trunk/doc/howto/wordnet.html

– 23 – CSCE 771 Spring 2013 wn01.py continued print "wn.synsets('dog', pos=wn.VERB)= ", wn.synsets('dog', pos=wn.VERB) print wn.synset('dog.n.01') ### Synset('dog.n.01') ### Synset('dog.n.01') print wn.synset('dog.n.01').definition ###'a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds' ###'a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds' print wn.synset('dog.n.01').examples ### ['the dog barked all night'] ### ['the dog barked all night']

– 24 – CSCE 771 Spring 2013 wn01.py continued print wn.synset('dog.n.01').lemmas ###[Lemma('dog.n.01.dog'), Lemma('dog.n.01.domestic_dog'), Lemma('dog.n.01.Canis_familiaris')] ###[Lemma('dog.n.01.dog'), Lemma('dog.n.01.domestic_dog'), Lemma('dog.n.01.Canis_familiaris')] print [lemma.name for lemma in wn.synset('dog.n.01').lemmas] ### ['dog', 'domestic_dog', 'Canis_familiaris'] ### ['dog', 'domestic_dog', 'Canis_familiaris'] print wn.lemma('dog.n.01.dog').synset

– 25 – CSCE 771 Spring 2013 Section 2 synsets, hypernyms, hyponyms # Section 2 Synsets, hypernyms, hyponyms import nltk from nltk.corpus import wordnet as wn dog = wn.synset('dog.n.01') print "dog hyperyms=", dog.hypernyms() ###dog hyperyms= [Synset('domestic_animal.n.01'), Synset('canine.n.02')] print "dog hyponyms=", dog.hyponyms() print "dog holonyms=", dog.member_holonyms() print "dog.roo_hyperyms=", dog.root_hypernyms() good = wn.synset('good.a.01') ###print "good.antonyms()=", good.antonyms() print "good.lemmas[0].antonyms()=", good.lemmas[0].antonyms()

– 26 – CSCE 771 Spring 2013 wn03-Lemmas.py ### Section 3 Lemmas eat = wn.lemma('eat.v.03.eat') print eat print eat.key print eat.count() print wn.lemma_from_key(eat.key) print wn.lemma_from_key(eat.key).synse t print wn.lemma_from_key( 'feebleminded%5:00:00:retarded:00' ) for lemma in wn.synset('eat.v.03').lemmas: print lemma, lemma.count() print lemma, lemma.count() for lemma in wn.lemmas('eat', 'v'): print lemma, lemma.count() print lemma, lemma.count() vocal = wn.lemma('vocal.a.01.vocal') print vocal.derivationally_related_forms( ) #[Lemma('vocalize.v.02.vocalize')] print vocal.pertainyms() #[Lemma('voice.n.02.voice')] print vocal.antonyms()

– 27 – CSCE 771 Spring 2013 wn04-VerbFrames.py # Section 4 Verb Frames print wn.synset('think.v.01').frame_ids for lemma in wn.synset('think.v.01').lemmas: print lemma, lemma.frame_ids print lemma, lemma.frame_ids print lemma.frame_strings print lemma.frame_strings print wn.synset('stretch.v.02').frame_ids for lemma in wn.synset('stretch.v.02').lemmas: print lemma, lemma.frame_ids print lemma, lemma.frame_ids print lemma.frame_strings print lemma.frame_strings

– 28 – CSCE 771 Spring 2013 wn05-Similarity.py ### Section 5 Similarity import nltk from nltk.corpus import wordnet as wn dog = wn.synset('dog.n.01') cat = wn.synset('cat.n.01') print dog.path_similarity(cat) print dog.lch_similarity(cat) print dog.wup_similarity(cat) from nltk.corpus import wordnet_ic brown_ic = wordnet_ic.ic('ic-brown.dat') semcor_ic = wordnet_ic.ic('ic-semcor.dat')

– 29 – CSCE 771 Spring 2013 wn05-Similarity.py continued from nltk.corpus import genesis genesis_ic = wn.ic(genesis, False, 0.0) print dog.res_similarity(cat, brown_ic) print dog.res_similarity(cat, genesis_ic) print dog.jcn_similarity(cat, brown_ic) print dog.jcn_similarity(cat, genesis_ic) print dog.lin_similarity(cat, semcor_ic)

– 30 – CSCE 771 Spring 2013 wn06-AccessToAllSynsets.py ### Section 6 access to all synsets import nltk from nltk.corpus import wordnet as wn for synset in list(wn.all_synsets('n'))[:10]: print synset print synsetwn.synsets('dog') wn.synsets('dog', pos='v') from itertools import islice for synset in islice(wn.all_synsets('n'), 5): print synset, synset.hypernyms() print synset, synset.hypernyms()

– 31 – CSCE 771 Spring 2013 wn07-Morphy.py # Wordnet in NLTK # http://nltk.googlecode.com/svn/trunk/doc/howto/wordnet.html import nltk from nltk.corpus import wordnet as wn ### Section 7 Morphy print wn.morphy('denied', wn.NOUN) print wn.synsets('denied', wn.NOUN) print wn.synsets('denied', wn.VERB)

– 32 – CSCE 771 Spring 2013 8 Regression Tests Bug 85: morphy returns the base form of a word, if it's input is given as a base form for a POS for which that word is not defined: >>> wn.synsets('book', wn.NOUN) [Synset('book.n.01'), Synset('book.n.02'), Synset('record.n.05'), Synset('script.n.01'), Synset('ledger.n.01'), Synset('book.n.06'), Synset('book.n.07'), Synset('koran.n.01'), Synset('bible.n.01'), Synset('book.n.10'), Synset('book.n.11')] >>> wn.synsets('book', wn.ADJ) [] >>> wn.morphy('book', wn.NOUN) 'book' >>> wn.morphy('book', wn.ADJ)

– 33 – CSCE 771 Spring 2013 nltk.corpus.reader.wordnet.  ic(self, corpus, weight_senses_equally=False, smoothing=1.0) Creates an information content lookup dictionary from a corpus. ic  http://nltk.googlecode.com/svn/trunk/doc/api/nltk.corpu s.reader.wordnet-pysrc.html#WordNetCorpusReader.ic http://nltk.googlecode.com/svn/trunk/doc/api/nltk.corpu s.reader.wordnet-pysrc.html#WordNetCorpusReader.ic http://nltk.googlecode.com/svn/trunk/doc/api/nltk.corpu s.reader.wordnet-pysrc.html#WordNetCorpusReader.ic  def demo(): import nltk print('loading wordnet') wn = WordNetCorpusReader(nltk.data.find('corpora/wordnet')) print('done loading') S = wn.synset L = wn.lemma

– 34 – CSCE 771 Spring 2013 root_hypernyms def root_hypernyms(self): """Get the topmost hypernyms of this synset in WordNet.""" result = [] seen = set() todo = [self] while todo: next_synset = todo.pop() if next_synset not in seen: seen.add(next_synset) next_hypernyms = next_synset.hypernyms() + … return result return result

Lecture 24 Distributional Word Similarity II Topics Distributional based word similarity example PMI context = syntactic dependenciesReadings: NLTK book.

Similar presentations

Presentation on theme: "Lecture 24 Distributional Word Similarity II Topics Distributional based word similarity example PMI context = syntactic dependenciesReadings: NLTK book."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 24 Distributional Word Similarity II Topics Distributional based word similarity example PMI context = syntactic dependenciesReadings: NLTK book.

Similar presentations

Presentation on theme: "Lecture 24 Distributional Word Similarity II Topics Distributional based word similarity example PMI context = syntactic dependenciesReadings: NLTK book."— Presentation transcript:

Similar presentations

About project

Feedback