Natural Language Processing Spring 2007 V. “Juggy” Jagannathan.

Natural Language Processing Spring 2007 V. “Juggy” Jagannathan

Foundations of Statistical Natural Language Processing By Christopher Manning & Hinrich Schutze Course Book

Chapter 5 Collocations January 29, 2007

Collocations Eg: –Strong tea; weapons of mass destruction; to make up; rich and powerful; Compositionality –Meaning is the sum of parts Not true for collocations and in particular idioms –Kick the bucket; heard it through the grapevine Contextual theory of meaning –Firth’s quote: you judge a word by the company it keeps

Frequency

Filtering by POS

Collocation in New York Times News wire articles

Mean and Variance Not all collocations are bi-grams Eg: –She knocked on his door –They knocked at the door –A man knocked on the metal front door The terms: “knocked” & “door” go together You would not say: “hit” or “beat” or “rap” instead of “knock”

Counting collocations with intervening words

Mean & Variance She knocked on his door They knocked at the door A man knocked on the metal front door 100 women knocked on Donaldson’s door

Strong support Strong leftist support Strong business support

Analysis of variance Larger variance indicates “strong” & “for” do not form interesting collocations.

Hypothesis Testing The fact that two words co-occur does not necessarily make it a collocation. Words can appear together merely as a chance appearance. Null Hypothesis H 0 –There is no association between the words beyond chance. P(w 1 w 2 ) = P(w 1 )P(w 2 )

The t test We can reject the null hypothesis, that the mean of the population is 158cm As the t value is large. The confidence level for this 99.5%. Null Hypothesis: the mean height of a Population is 158 cm.

Applying t test to corpora Applying t test to the collocation: “new companies” The goal is to find out if the occurrence of “new companies” explained by chance behavior? Text corpus is viewed as a long string of bigrams – a string of two word pairs

Applying t test to text corpora The word “new” occurs: 15828 times The word “companies” occurs: 4675 There are 14,307,668 tokens (words) P(new) = 15828/14,307,668 P(companies) = 4675/14,307,668 Null Hypothesis: occurrence of these two words are independent. H 0 = P(new)xP(companies) =~ 3.615x10 -7

Applying the t test to the bigrams The null hypothesis cannot be rejected since the t value is less than 2.576 – new companies is NOT a collocation.

Finding Collocations using t test It should be noted, that stop words were used to eliminate most collocations.

Applying t tests to detect differences in semantics

Pearson’s chi-square test Chi-square test does not assume normal distribution. Basically a measure of Observed vs. expected values and determine if observed is significantly different From expected values. Used in variety of interesting ways: - in machine translations - in determining similarity between text sources

Likelihood ratios

Calculating the likelihood ratio

Relative Frequency Ratios

Mutual Information

Notion of Collocation Non-Compositionality –Idioms: kick the bucket; white wine Non-Substitutability –Cannot say “yellow wine” Non-Modifiability –Phrase: “was poor as church mouse” cannot be modified to “people as poor as church mice” –Phrase: to get a frog in one’s throat –To say: to get an ugly frog in one’s throat

Natural Language Processing Spring 2007 V. “Juggy” Jagannathan.

Similar presentations

Presentation on theme: "Natural Language Processing Spring 2007 V. “Juggy” Jagannathan."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Natural Language Processing Spring 2007 V. “Juggy” Jagannathan.

Similar presentations

Presentation on theme: "Natural Language Processing Spring 2007 V. “Juggy” Jagannathan."— Presentation transcript:

Similar presentations

About project

Feedback