Lecture 22 Word Similarity CSCE 771 Natural Language Processing Lecture 22 Word Similarity Topics word similarity Thesaurus based word similarity Intro. Distributional based word similarity Readings: NLTK book Chapter 2 (wordnet) Text Chapter 20 April 8, 2013
Overview Readings: Text 19,20 NLTK Book: Chapter 10 Last Time (Programming) Features in NLTK NL queries SQL NLTK support for Interpretations and Models Propositional and predicate logic support Prover9 Today Last Lectures slides 25-29 Computational Lexical Semantics Readings: Text 19,20 NLTK Book: Chapter 10 Next Time: Computational Lexical Semantics II
ACL Anthology - http://aclweb.org/anthology-new/
Figure 20.8 Summary of Thesaurus Similarity measures
Wordnet similarity functions path_similarity()? lch_similarity()? wup_similarity()? res_similarity()? jcn_similarity()? lin_similarity()?
Examples: but first a Pop Quiz How do you get hypernyms from wordnet?
Example: P(c) values entity thing (not specified) physical thing abstraction idea pacifier#2 living thing non-living thing mammals amphibians reptiles novel pacifier#1 cat dog whale frog snake right minke Color code Blue: wordnet Red: Inspired
Example: counts (made-up) entity thing (not specified) physical thing abstraction idea pacifier#2 living thing non-living thing mammals amphibians reptiles novel pacifier#1 cat dog whale frog snake right minke Color code Blue: wordnet Red: Inspired
Example: P(c) values entity thing (not specified) physical thing abstraction idea pacifier#2 living thing non-living thing mammals amphibians reptiles novel pacifier#1 cat dog whale frog snake right minke Color code Blue: wordnet Red: Inspired
Example: entity thing (not specified) physical thing abstraction idea pacifier#2 living thing non-living thing mammals amphibians reptiles novel pacifier#1 cat dog whale frog snake right minke Color code Blue: wordnet Red: Inspired
simLesk(cat, dog) ??? (42)S: (n) dog#1 (dog%1:05:00::), domestic dog#1 (domestic_dog%1:05:00::), Canis familiaris#1 (canis_familiaris%1:05:00::) (a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds) "the dog barked all night“ (18)S: (n) cat#1 (cat%1:05:00::), true cat#1 (true_cat%1:05:00::) (feline mammal usually having thick soft fur and no ability to roar: domestic cats; wildcats) (1)S: (n) wolf#1 (wolf%1:05:00::) (any of various predatory carnivorous canine mammals of North America and Eurasia that usually hunt in packs)
Problems with thesaurus-based don’t always have a thesaurus Even so problems with recall missing words phrases missing thesauri work less well for verbs and adjectives less hyponymy structure Distributional Word Similarity D. Jurafsky
Distributional models of meaning vector-space models of meaning offer higher recall than hand-built thesauri less precision probably intuition Distributional Word Similarity D. Jurafsky
Word Similarity Distributional Methods 20.31 tezguino example (Nida) A bottle of tezguino is on the table. Everybody likes tezguino. tezguino makes you drunk. We make tezguino out of corn. What do you know about tezguino?
Distributional Word Similarity D. Jurafsky Term-document matrix Collection of documents Identify collection of important terms, discriminatory terms(words) Matrix: terms X documents – term frequency tfw,d = each document a vector in ZV: Z= integers; N=natural numbers more accurate but perhaps misleading Example Distributional Word Similarity D. Jurafsky
Example Term-document matrix Subset of terms = {battle, soldier, fool, clown} As you like it 12th Night Julius Caesar Henry V Battle 1 8 15 Soldier 2 12 36 fool 37 58 5 clown 6 117 Distributional Word Similarity D. Jurafsky
Figure 20.9 Term in context matrix for word similarity (Co-occurrence vectors) window of 20 words – 10 before 10 after from Brown corpus – words that occur together non Brown example The Graduate School requires that all PhD students to be admitted to candidacy at least one year prior to graduation. Passing … Small table from the Brown 10 before 10 after
Pointwise Mutual Information td-idf (inverse document frequency) rating instead of raw counts idf intuition again – pointwise mutual information (PMI) Do events x and y occur more than if they were independent? PMI(X,Y)= log2 P(X,Y) / P(X)P(Y) PMI between words Positive PMI between two words (PPMI)
Computing PPMI Matrix F with W (words) rows and C (contexts) columns fij is frequency of wi in cj,
Example computing PPMI Need counts so lets make up some we need to edit this table to have counts
Associations PMI-assoc assocPMI(w, f) = log2 P(w,f) / P(w) P(f) Lin- assoc - f composed of r (relation) and w’ assocLIN(w, f) = log2 P(w,f) / P(r|w) P(w’|w) t-test_assoc (20.41)
Figure 20.10 Co-occurrence vectors Dependency based parser – special case of shallow parsing identify from “I discovered dried tangerines.” (20.32) discover(subject I) I(subject-of discover) tangerine(obj-of discover) tangerine(adj-mod dried)
Figure 20.11 Objects of the verb drink Hindle 1990
vectors review dot-product length sim-cosine
Figure 20.12 Similarity of Vectors
Fig 20.13 Vector Similarity Summary
Figure 20.14 Hand-built patterns for hypernyms Hearst 1992
Figure 20.15
Figure 20.16
http://www. cs. ucf. edu/courses/cap5636/fall2011/nltk http://www.cs.ucf.edu/courses/cap5636/fall2011/nltk.pdf how to do in nltk NLTK 3.0a1 released : February 2013 This version adds support for NLTK’s graphical user interfaces. http://nltk.org/nltk3-alpha/ which similarity function in nltk.corpus.wordnet is Appropriate for find similarity of two words? I want use a function for word clustering and yarowsky algorightm for find similar collocation in a large text. http://en.wikipedia.org/wiki/Wikipedia:WikiProject_Linguistics http://en.wikipedia.org/wiki/Portal:Linguistics http://en.wikipedia.org/wiki/Yarowsky_algorithm http://nltk.googlecode.com/svn/trunk/doc/howto/wordnet.html