Lecture 22 Word Similarity Topics word similarity Thesaurus based word similarity Intro. Distributional based word similarityReadings: NLTK book Chapter 2 (wordnet) Text Chapter 20 April 8, 2013 CSCE 771 Natural Language Processing
– 2 – CSCE 771 Spring 2013 Overview Last Time (Programming) Features in NLTK NL queries SQL NLTK support for Interpretations and Models Propositional and predicate logic support Prover9Today Last Lectures slides Features in NLTK Computational Lexical SemanticsReadings: Text 19,20 NLTK Book: Chapter 10 Next Time: Computational Lexical Semantics II
– 3 – CSCE 771 Spring 2013 ACL Anthology -
– 4 – CSCE 771 Spring 2013 Figure 20.8 Summary of Thesaurus Similarity measures
– 5 – CSCE 771 Spring 2013 Wordnet similarity functions path_similarity()? lch_similarity()? lch_similarity()? wup_similarity()? wup_similarity()? res_similarity()? res_similarity()? jcn_similarity()? jcn_similarity()? lin_similarity()? lin_similarity()?
– 6 – CSCE 771 Spring 2013 Examples: but first a Pop Quiz How do you get hypernyms from wordnet?
– 7 – CSCE 771 Spring 2013 Example: P(c) values entity physical thing abstraction living thing thing (not specified) non-living thing mammalsamphibiansreptilesnovel snake frogcatdogwhale rightminke idea pacifier#2 pacifier#1 Color code Blue: wordnet Red: Inspired
– 8 – CSCE 771 Spring 2013 Example: counts (made-up) entity physical thing abstraction living thing thing (not specified) non-living thing mammalsamphibiansreptilesnovel snake frogcatdogwhale rightminke idea pacifier#2 pacifier#1 Color code Blue: wordnet Red: Inspired
– 9 – CSCE 771 Spring 2013 Example: P(c) values entity physical thing abstraction living thing thing (not specified) non-living thing mammalsamphibiansreptilesnovel snake frogcatdogwhale rightminke idea pacifier#2 pacifier#1 Color code Blue: wordnet Red: Inspired
– 10 – CSCE 771 Spring 2013 Example: entity physical thing abstraction living thing thing (not specified) non-living thing mammalsamphibiansreptilesnovel snake frogcatdogwhale rightminke idea pacifier#2 pacifier#1 Color code Blue: wordnet Red: Inspired
– 11 – CSCE 771 Spring 2013 sim Lesk (cat, dog) ??? (42)S: (n) dog#1 (dog%1:05:00::), domestic dog#1 (domestic_dog%1:05:00::), Canis familiaris#1 (canis_familiaris%1:05:00::) (a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds) "the dog barked all night“ S:domestic dog#1 (domestic_dog%1:05:00::)Canis familiaris#1 (canis_familiaris%1:05:00::)S:domestic dog#1 (domestic_dog%1:05:00::)Canis familiaris#1 (canis_familiaris%1:05:00::) (18)S: (n) cat#1 (cat%1:05:00::), true cat#1 (true_cat%1:05:00::) (feline mammal usually having thick soft fur and no ability to roar: domestic cats; wildcats) S:true cat#1 (true_cat%1:05:00::)S:true cat#1 (true_cat%1:05:00::) (1)S: (n) wolf#1 (wolf%1:05:00::) (any of various predatory carnivorous canine mammals of North America and Eurasia that usually hunt in packs) S:
– 12 – CSCE 771 Spring 2013 Problems with thesaurus-based don’t always have a thesaurus Even so problems with recall missing words phrases missing thesauri work less well for verbs and adjectives less hyponymy structure Distributional Word Similarity D. Jurafsky
– 13 – CSCE 771 Spring 2013 Distributional models of meaning vector-space models of meaning offer higher recall than hand-built thesauri less precision probably intuition Distributional Word Similarity D. Jurafsky
– 14 – CSCE 771 Spring 2013 Word Similarity Distributional Methods tezguino example (Nida) A bottle of tezguino is on the table.A bottle of tezguino is on the table. Everybody likes tezguino.Everybody likes tezguino. tezguino makes you drunk.tezguino makes you drunk. We make tezguino out of corn.We make tezguino out of corn. What do you know about tezguino?What do you know about tezguino?
– 15 – CSCE 771 Spring 2013 Term-document matrix Collection of documents Identify collection of important terms, discriminatory terms(words) Matrix: terms X documents – term frequency tf w,d = each document a vector in Z V : Z= integers; N=natural numbers more accurate but perhaps misleading Example Distributional Word Similarity D. Jurafsky
– 16 – CSCE 771 Spring 2013 Example Term-document matrix Subset of terms = {battle, soldier, fool, clown} Distributional Word Similarity D. Jurafsky As you like it12 th NightJulius CaesarHenry V Battle11815 Soldier fool clown611700
– 17 – CSCE 771 Spring 2013 Figure 20.9 Term in context matrix for word similarity (Co-occurrence vectors) window of 20 words – 10 before 10 after from Brown corpus – words that occur together non Brown example The Graduate School requires that all PhD students to be admitted to candidacy at least one year prior to graduation. Passing … Small table from the Brown 10 before 10 after
– 18 – CSCE 771 Spring 2013 Pointwise Mutual Information td-idf (inverse document frequency) rating instead of raw countstd-idf (inverse document frequency) rating instead of raw counts idf intuition again – pointwise mutual information (PMI)pointwise mutual information (PMI) Do events x and y occur more than if they were independent? PMI(X,Y)= log2 P(X,Y) / P(X)P(Y) PMI between wordsPMI between words Positive PMI between two words (PPMI)Positive PMI between two words (PPMI)
– 19 – CSCE 771 Spring 2013 Computing PPMI Matrix F with W (words) rows and C (contexts) columns f ij is frequency of w i in c j,
– 20 – CSCE 771 Spring 2013 Example computing PPMI Need counts so lets make up someNeed counts so lets make up some we need to edit this table to have counts
– 21 – CSCE 771 Spring 2013 Associations PMI-assoc assoc PMI (w, f) = log 2 P(w,f) / P(w) P(f)assoc PMI (w, f) = log 2 P(w,f) / P(w) P(f) Lin- assoc - f composed of r (relation) and w’ assoc LIN (w, f) = log 2 P(w,f) / P(r|w) P(w’|w)assoc LIN (w, f) = log 2 P(w,f) / P(r|w) P(w’|w) t-test_assoc (20.41)
– 22 – CSCE 771 Spring 2013 Figure Co-occurrence vectors Dependency based parser – special case of shallow parsing identify from “I discovered dried tangerines.” (20.32) discover(subject I)I(subject-of discover) tangerine(obj-of discover)tangerine(adj-mod dried)
– 23 – CSCE 771 Spring 2013 Figure Objects of the verb drink Hindle 1990
– 24 – CSCE 771 Spring 2013 vectors review dot-productlengthsim-cosine
– 25 – CSCE 771 Spring 2013 Figure Similarity of Vectors
– 26 – CSCE 771 Spring 2013 Fig Vector Similarity Summary
– 27 – CSCE 771 Spring 2013 Figure Hand-built patterns for hypernyms Hearst 1992
– 28 – CSCE 771 Spring 2013 Figure 20.15
– 29 – CSCE 771 Spring 2013 Figure 20.16
– 30 – CSCE 771 Spring how to do in nltk NLTK 3.0a1 released : February 2013 This version adds support for NLTK’s graphical user interfaces. This version adds support for NLTK’s graphical user interfaces. which similarity function in nltk.corpus.wordnet is Appropriate for find similarity of two words? I want use a function for word clustering and yarowsky algorightm for find similar collocation in a large text.