COGS Bilge Say1 Using Corpora for Language Research COGS 523-Lecture 8 Collocations
COGS Bilge Say2 Related Readings Manning and Schutze (1999). Foundations of Statistical Natural Language Processing.Chapter 5 on Collocations Optional: Evert, Stefan (2008). Corpora and collocations. In A. Lüdeling and M. Kytö (eds.), Corpus Linguistics. An International Handbook, article 58. Mouton de Gruyter, Berlin. [extended manuscript: and his web site
COGS Bilge Say3 A collocation is an expression consisting of two or more words that correspond to some conventional way of saying things. Collocations are characterized by limited compositionality. Collocations are not fully compositional in that there is usually an element of meaning added to the combination. ex. strong tea Collocations
COGS Bilge Say4 Idioms are the most extreme examples of non-compositionality; ex. kick the bucket Most collocations exhibit milder forms of compositionality; ex. international best practice
COGS Bilge Say5 Collocations are important for a number of applications: natural language generation, computational lexicography, parsing, corpus linguistic research Also sociolinguistics ex. strong tea; not powerful tea
COGS Bilge Say6 Manning and Schutze Example Corpus of the following analyses: New York Times (August – November 1990) 115 MB of text 14 million words
COGS Bilge Say7 Approaches to finding collocations: Frequency Mean and variance Hypothesis testing Likelihood ratios Mutual Information (pointwise)
COGS Bilge Say8 Frequency If two words occur together a lot, then that is evidence that they have a special function that is not simply explained as the function that results from their combination. heuristic: pass the candidate phrases through a part-of speech filter
C(w1, w2)w1w ofthe 58841inthe 26430tothe 21842onthe 21839forthe 13899ina 13689ofa 8753hasbeen Tag PatternExample A Nlinear function N regression coefficients A A NGaussian random variable A N Ncumulative distribution function N A Nmean squared error N N Nclass probability function N P Ndegrees of freedom C(w1, w2)w1w NewYork 7261UnitedStates 5412LosAngles 3301lastyear 3191SaudiArabia 2699lastweek 2514vicePresident 2378PersianGulf (Manning and Schutze, 1999)
COGS Bilge Say10 wC(strong, w) wC(powerful, w) support50 force13 safety22 computers10 sales21 position8 opposition19 men8 showing18 computers8 sense18 man7 message15 symbol6 defense14 military6 (Manning and Schutze, 1999)
COGS Bilge Say11 Mean and Variance Frequency based approach works for fixed phrases well. But many collocations consist of two words that stand in a more flexible relationship to one another she knocked on his door; they knocked at the door; 100 women knocked on Donaldson’s door; a man knocked on the metal from door
COGS Bilge Say12 The mean is simple the average offset. For the example, the mean offset between knocked and door is 4.0 Variance measures how much the individual offsets deviate from the mean. Sample standard deviation is the square root of the mean. For the example, the standard deviation between knocked and door is 1.15
COGS Bilge Say13 We can use this information to discover collocations by looking for pairs with low deviation. A low deviation means that the two words usually occur at about the same distance. Zero deviation means that the two words always occur at exactly the same distance.
COGS Bilge Say14 (Manning and Schutze, 1999)
COGS Bilge Say15 sample deviation sample meanCountword1word NewYork previousgames minuspoints hundredsdollars editorialAtlanta ringNew pointhundredth subscribersby strongsupport powerfulorganizations RichardNixon Garrisonsaid (Manning and Schutze, 1999)
COGS Bilge Say16 Hypothesis testing High frequency and variance can be accidental If two constituent words of a frequent bigram like new companies are regularly occurring words (as new and companies are), then we expect the two words to co-occur a lot just by chance.
COGS Bilge Say17 What we really want to know is whether two words occur together more often than chance. Assessing whether or not something is a chance event is one of the classical problems of statistics.
COGS Bilge Say18 How can we apply the methodology of hypothesis testing to the problem of finding collocations? We first formulate a null hypothesis which states that what should be true if two words do not form a collocation. P(w1, w2)= P(w1)*P(w2)
COGS Bilge Say19 The t test Now we need a statistical test that tells us how probable or improbable it is that a certain constellation will occur. A test that has been widely used for collocation discovery is the t test. t= (x-η)/(√s 2 /N) x the sample mean; s 2 sample variance; N is the sample size ; η is the mean of distribution
new companies P(new)= 15828/ P(companies)= 4675/ P(new, companies)= P(new)* P(companies) C(new, companies)=8 x(new, companies)=8/ t= (x-η)/(√s 2 /N)= x(new, companies)- P(new, companies) √ x(new, companies)/ ≈
COGS Bilge Say is not larger than the critical value for ά= We cannot reject the null hypothesis that new and companies occur independently and do not form a collocation.
COGS Bilge Say22 tC(w1)C(w2)C(w1, w2)w1w AyatollahRuhollah BetteMidler AgathaChristie videocassetterecorder unsaltedbutter fistmade overmany intothem likepeople timelast (Manning and Schutze, 1999)
COGS Bilge Say23 It turns out that most bigrams attested in a corpus occur significantly more often than chance. Language is very regular so that very few completely unpredictable events happen. The t test and other statistical tests are most useful as a method for ranking collocatins.
COGS Bilge Say24 Hypothesis testing of difference The t test can also be used for a slightly different collocation discovery problem: to find words whose co-occurrence patterns best distinguish between two words. ex. to find words that best differentiate the meanings of strong and powerful.
tC(w)C(strong w)C(powerful w)Word computers computer symbol machines Germany support enough safety sales opposition (Manning and Schutze, 1999)
COGS Bilge Say26 Pearson’s chi-square test t test assumes that probabilities are approximately normally distributed, which is not true in general. X 2 the essence of the test is to compare the observed frequencies in a table with the frequencies expected for independence. If the difference between observed and expected frequencies is large, then we can reject the null hypothesis of independence.
X 2 = Σi,j (O i,j -E i,j ) 2 /E i,j Expected = (8+4667/N)+( /N) X 2 ≈ 1.55; 1.55 is not larger than the critical value for ά= We cannot reject the null hypothesis that new and companies occur independently and do not form a collocation. w=neww~=new w=companies 8 (new companies) 4667 (e.g. old companies) w~=companies 8 (new machines) (e.g. old machines) (Manning and Schutze, 1999)
COGS Bilge Say28 Likelihood ratios More appropriate for sparse data than the X 2 test. And likelihood ratio is more interpretable than the X 2 test. Two alternative explanations for the occurrence frequency of a bigram w1w2 Hypothesis 1: P(w2|w1)= p= P(w2|-w1) Hypothesis 2: P(w2|w1)= p1=/= p2= P(w2|-w1) Hypothesis 1 is a formalization of independence Hypothesis 2 is a formalization of dependence which is good evidence for an interesting collocation
COGS Bilge Say29 -2logλC(w1)C(w2) C(w1, w2)w1w mostpowerful politicallypowerful powerfulcomputers powerfulforce powerfulsymbol powerfullobbies economicallypowerful powerfulmagnet powerfulcudgels (Manning and Schutze, 1999)
COGS Bilge Say30 One advantage of likelihood ratios is that they have a clear intuitive interpretation. For example, the bigram powerful computers is e 0.5x82.96 ≈ 1.3X10 18 time more likely under the hypothesis that computers is more likely to follow powerful than its base rate of occurrence would suggest.
COGS Bilge Say31 λ is a likelihood ratio of a particular form, then the quantity -2logλ is asymptotically X 2 distributed. We can use tables of X 2 to test H 1 against H 2. E.g. value for powerful cudgels reject H 1 for this bigram on a confidence level of 0.005
Relative Frequency Ratios Ratios of frequencies between two or more different corpora can be used to discover collocations that are characteristic of a corpus when compared to other corpora. e.g. Karim Obeid occurs 68 times in the 1989 corpus so relative frequency ratio r is r=(2/ )/ (68/ ) Relative frequency ratios are useful to find subject-specific collocations. The application proposed is to compare a general text with a subject-specific text.
COGS Bilge Say33 Ratio w1w KarimObeid EastBerliners MissManners earthquake HUDofficials EASTGERMANS Muslimcleric JohnLe PragueSpring Amongindividual (Manning and Schutze, 1999)
COGS Bilge Say34 Mutual Information SymbolDefinitionCurrent useFano I(x,y)log(p(x,y)/p(x)p(y) pointwise mutual information mutual information I(X;Y)E log(p(X,Y)/p(X)p(Y) mutual information average MI / expectation of MI
I 1000 w1w2w1w2BigramI w1w2w1w2Bigram Schwartz eschews Schwartz eschews fewest visits fewest visits FIND GARDEN FIND GARDEN Indonesian pieces Indonesian pieces Peds survived Peds survived marijuana growing marijuana growing doubt whether doubt whether new converts new converts like offensive like offensive must think must think (Manning and Schutze, 1999)
COGS Bilge Say36 Next Week Biber et al. Register and Discourse Variations Chapter.