Presentation is loading. Please wait.

Presentation is loading. Please wait.

Natural Language Processing Spring 2007 V. “Juggy” Jagannathan.

Similar presentations


Presentation on theme: "Natural Language Processing Spring 2007 V. “Juggy” Jagannathan."— Presentation transcript:

1 Natural Language Processing Spring 2007 V. “Juggy” Jagannathan

2 Foundations of Statistical Natural Language Processing By Christopher Manning & Hinrich Schutze Course Book

3 Chapter 5 Collocations January 29, 2007

4 Collocations Eg: –Strong tea; weapons of mass destruction; to make up; rich and powerful; Compositionality –Meaning is the sum of parts Not true for collocations and in particular idioms –Kick the bucket; heard it through the grapevine Contextual theory of meaning –Firth’s quote: you judge a word by the company it keeps

5 Frequency

6 Filtering by POS

7 Collocation in New York Times News wire articles

8

9 Mean and Variance Not all collocations are bi-grams Eg: –She knocked on his door –They knocked at the door –A man knocked on the metal front door The terms: “knocked” & “door” go together You would not say: “hit” or “beat” or “rap” instead of “knock”

10 Counting collocations with intervening words

11 Mean & Variance She knocked on his door They knocked at the door A man knocked on the metal front door 100 women knocked on Donaldson’s door

12 Strong support Strong leftist support Strong business support

13 Analysis of variance Larger variance indicates “strong” & “for” do not form interesting collocations.

14

15 Hypothesis Testing The fact that two words co-occur does not necessarily make it a collocation. Words can appear together merely as a chance appearance. Null Hypothesis H 0 –There is no association between the words beyond chance. P(w 1 w 2 ) = P(w 1 )P(w 2 )

16 The t test We can reject the null hypothesis, that the mean of the population is 158cm As the t value is large. The confidence level for this 99.5%. Null Hypothesis: the mean height of a Population is 158 cm.

17 Applying t test to corpora Applying t test to the collocation: “new companies” The goal is to find out if the occurrence of “new companies” explained by chance behavior? Text corpus is viewed as a long string of bigrams – a string of two word pairs

18 Applying t test to text corpora The word “new” occurs: 15828 times The word “companies” occurs: 4675 There are 14,307,668 tokens (words) P(new) = 15828/14,307,668 P(companies) = 4675/14,307,668 Null Hypothesis: occurrence of these two words are independent. H 0 = P(new)xP(companies) =~ 3.615x10 -7

19 Applying the t test to the bigrams The null hypothesis cannot be rejected since the t value is less than 2.576 – new companies is NOT a collocation.

20 Finding Collocations using t test It should be noted, that stop words were used to eliminate most collocations.

21 Applying t tests to detect differences in semantics

22

23 Pearson’s chi-square test Chi-square test does not assume normal distribution. Basically a measure of Observed vs. expected values and determine if observed is significantly different From expected values. Used in variety of interesting ways: - in machine translations - in determining similarity between text sources

24 Likelihood ratios

25 Calculating the likelihood ratio

26

27 Relative Frequency Ratios

28 Mutual Information

29

30

31 Notion of Collocation Non-Compositionality –Idioms: kick the bucket; white wine Non-Substitutability –Cannot say “yellow wine” Non-Modifiability –Phrase: “was poor as church mouse” cannot be modified to “people as poor as church mice” –Phrase: to get a frog in one’s throat –To say: to get an ugly frog in one’s throat


Download ppt "Natural Language Processing Spring 2007 V. “Juggy” Jagannathan."

Similar presentations


Ads by Google