Download presentation
Presentation is loading. Please wait.
Published byRoy Willis Modified over 9 years ago
1
Natural Language Processing Spring 2007 V. “Juggy” Jagannathan
2
Foundations of Statistical Natural Language Processing By Christopher Manning & Hinrich Schutze Course Book
3
Chapter 5 Collocations January 29, 2007
4
Collocations Eg: –Strong tea; weapons of mass destruction; to make up; rich and powerful; Compositionality –Meaning is the sum of parts Not true for collocations and in particular idioms –Kick the bucket; heard it through the grapevine Contextual theory of meaning –Firth’s quote: you judge a word by the company it keeps
5
Frequency
6
Filtering by POS
7
Collocation in New York Times News wire articles
9
Mean and Variance Not all collocations are bi-grams Eg: –She knocked on his door –They knocked at the door –A man knocked on the metal front door The terms: “knocked” & “door” go together You would not say: “hit” or “beat” or “rap” instead of “knock”
10
Counting collocations with intervening words
11
Mean & Variance She knocked on his door They knocked at the door A man knocked on the metal front door 100 women knocked on Donaldson’s door
12
Strong support Strong leftist support Strong business support
13
Analysis of variance Larger variance indicates “strong” & “for” do not form interesting collocations.
15
Hypothesis Testing The fact that two words co-occur does not necessarily make it a collocation. Words can appear together merely as a chance appearance. Null Hypothesis H 0 –There is no association between the words beyond chance. P(w 1 w 2 ) = P(w 1 )P(w 2 )
16
The t test We can reject the null hypothesis, that the mean of the population is 158cm As the t value is large. The confidence level for this 99.5%. Null Hypothesis: the mean height of a Population is 158 cm.
17
Applying t test to corpora Applying t test to the collocation: “new companies” The goal is to find out if the occurrence of “new companies” explained by chance behavior? Text corpus is viewed as a long string of bigrams – a string of two word pairs
18
Applying t test to text corpora The word “new” occurs: 15828 times The word “companies” occurs: 4675 There are 14,307,668 tokens (words) P(new) = 15828/14,307,668 P(companies) = 4675/14,307,668 Null Hypothesis: occurrence of these two words are independent. H 0 = P(new)xP(companies) =~ 3.615x10 -7
19
Applying the t test to the bigrams The null hypothesis cannot be rejected since the t value is less than 2.576 – new companies is NOT a collocation.
20
Finding Collocations using t test It should be noted, that stop words were used to eliminate most collocations.
21
Applying t tests to detect differences in semantics
23
Pearson’s chi-square test Chi-square test does not assume normal distribution. Basically a measure of Observed vs. expected values and determine if observed is significantly different From expected values. Used in variety of interesting ways: - in machine translations - in determining similarity between text sources
24
Likelihood ratios
25
Calculating the likelihood ratio
27
Relative Frequency Ratios
28
Mutual Information
31
Notion of Collocation Non-Compositionality –Idioms: kick the bucket; white wine Non-Substitutability –Cannot say “yellow wine” Non-Modifiability –Phrase: “was poor as church mouse” cannot be modified to “people as poor as church mice” –Phrase: to get a frog in one’s throat –To say: to get an ugly frog in one’s throat
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.