Download presentation
Presentation is loading. Please wait.
1
Lecture 22 Word Similarity
CSCE Natural Language Processing Lecture 22 Word Similarity Topics word similarity Thesaurus based word similarity Intro. Distributional based word similarity Readings: NLTK book Chapter 2 (wordnet) Text Chapter 20 April 8, 2013
2
Overview Readings: Text 19,20 NLTK Book: Chapter 10
Last Time (Programming) Features in NLTK NL queries SQL NLTK support for Interpretations and Models Propositional and predicate logic support Prover9 Today Last Lectures slides 25-29 Computational Lexical Semantics Readings: Text 19,20 NLTK Book: Chapter 10 Next Time: Computational Lexical Semantics II
3
ACL Anthology - http://aclweb.org/anthology-new/
4
Figure 20.8 Summary of Thesaurus Similarity measures
5
Wordnet similarity functions
path_similarity()? lch_similarity()? wup_similarity()? res_similarity()? jcn_similarity()? lin_similarity()?
6
Examples: but first a Pop Quiz
How do you get hypernyms from wordnet?
7
Example: P(c) values entity thing (not specified) physical thing
abstraction idea pacifier#2 living thing non-living thing mammals amphibians reptiles novel pacifier#1 cat dog whale frog snake right minke Color code Blue: wordnet Red: Inspired
8
Example: counts (made-up)
entity thing (not specified) physical thing abstraction idea pacifier#2 living thing non-living thing mammals amphibians reptiles novel pacifier#1 cat dog whale frog snake right minke Color code Blue: wordnet Red: Inspired
9
Example: P(c) values entity thing (not specified) physical thing
abstraction idea pacifier#2 living thing non-living thing mammals amphibians reptiles novel pacifier#1 cat dog whale frog snake right minke Color code Blue: wordnet Red: Inspired
10
Example: entity thing (not specified) physical thing abstraction idea
pacifier#2 living thing non-living thing mammals amphibians reptiles novel pacifier#1 cat dog whale frog snake right minke Color code Blue: wordnet Red: Inspired
11
simLesk(cat, dog) ??? (42)S: (n) dog#1 (dog%1:05:00::), domestic dog#1 (domestic_dog%1:05:00::), Canis familiaris#1 (canis_familiaris%1:05:00::) (a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds) "the dog barked all night“ (18)S: (n) cat#1 (cat%1:05:00::), true cat#1 (true_cat%1:05:00::) (feline mammal usually having thick soft fur and no ability to roar: domestic cats; wildcats) (1)S: (n) wolf#1 (wolf%1:05:00::) (any of various predatory carnivorous canine mammals of North America and Eurasia that usually hunt in packs)
12
Problems with thesaurus-based
don’t always have a thesaurus Even so problems with recall missing words phrases missing thesauri work less well for verbs and adjectives less hyponymy structure Distributional Word Similarity D. Jurafsky
13
Distributional models of meaning
vector-space models of meaning offer higher recall than hand-built thesauri less precision probably intuition Distributional Word Similarity D. Jurafsky
14
Word Similarity Distributional Methods
20.31 tezguino example (Nida) A bottle of tezguino is on the table. Everybody likes tezguino. tezguino makes you drunk. We make tezguino out of corn. What do you know about tezguino?
15
Distributional Word Similarity D. Jurafsky
Term-document matrix Collection of documents Identify collection of important terms, discriminatory terms(words) Matrix: terms X documents – term frequency tfw,d = each document a vector in ZV: Z= integers; N=natural numbers more accurate but perhaps misleading Example Distributional Word Similarity D. Jurafsky
16
Example Term-document matrix
Subset of terms = {battle, soldier, fool, clown} As you like it 12th Night Julius Caesar Henry V Battle 1 8 15 Soldier 2 12 36 fool 37 58 5 clown 6 117 Distributional Word Similarity D. Jurafsky
17
Figure 20.9 Term in context matrix for word similarity (Co-occurrence vectors)
window of 20 words – 10 before 10 after from Brown corpus – words that occur together non Brown example The Graduate School requires that all PhD students to be admitted to candidacy at least one year prior to graduation. Passing … Small table from the Brown 10 before 10 after
18
Pointwise Mutual Information
td-idf (inverse document frequency) rating instead of raw counts idf intuition again – pointwise mutual information (PMI) Do events x and y occur more than if they were independent? PMI(X,Y)= log2 P(X,Y) / P(X)P(Y) PMI between words Positive PMI between two words (PPMI)
19
Computing PPMI Matrix F with W (words) rows and C (contexts) columns
fij is frequency of wi in cj,
20
Example computing PPMI
Need counts so lets make up some we need to edit this table to have counts
21
Associations PMI-assoc assocPMI(w, f) = log2 P(w,f) / P(w) P(f)
Lin- assoc - f composed of r (relation) and w’ assocLIN(w, f) = log2 P(w,f) / P(r|w) P(w’|w) t-test_assoc (20.41)
22
Figure 20.10 Co-occurrence vectors
Dependency based parser – special case of shallow parsing identify from “I discovered dried tangerines.” (20.32) discover(subject I) I(subject-of discover) tangerine(obj-of discover) tangerine(adj-mod dried)
23
Figure 20.11 Objects of the verb drink Hindle 1990
24
vectors review dot-product length sim-cosine
25
Figure 20.12 Similarity of Vectors
26
Fig 20.13 Vector Similarity Summary
27
Figure 20.14 Hand-built patterns for hypernyms Hearst 1992
28
Figure 20.15
29
Figure 20.16
30
http://www. cs. ucf. edu/courses/cap5636/fall2011/nltk
how to do in nltk NLTK 3.0a1 released : February 2013 This version adds support for NLTK’s graphical user interfaces. which similarity function in nltk.corpus.wordnet is Appropriate for find similarity of two words? I want use a function for word clustering and yarowsky algorightm for find similar collocation in a large text
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.