Presentation is loading. Please wait.

Presentation is loading. Please wait.

COLLOCATIONS He Zhongjun 2007-04-13. Outline Introduction Approaches to find collocations Frequency Mean and Variance Hypothesis test Mutual information.

Similar presentations


Presentation on theme: "COLLOCATIONS He Zhongjun 2007-04-13. Outline Introduction Approaches to find collocations Frequency Mean and Variance Hypothesis test Mutual information."— Presentation transcript:

1 COLLOCATIONS He Zhongjun 2007-04-13

2 Outline Introduction Approaches to find collocations Frequency Mean and Variance Hypothesis test Mutual information Applications

3 Outline Introduction Approaches to find collocations Frequency Mean and Variance Hypothesis test Mutual information Applications

4 What are collocations? A collocation is an expression consisting of two or more words that correspond to some conventional way of saying things. (-- the book) A collocation is a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit, and whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components. (-- Choueka, 1988)

5 Examples noun phrases strong tea vs. powerful tea verbs make a decision vs. take a decision knock … door vs. hit … door make up Idioms kick the bucket ( 死掉 ) Subtle, unexplainable, native speaker usage broad daylight vs. bright daylight 昨天,去年,上个月 …

6 Introduction – Character /Criteria Non-compositionality e.g. kick the bucket white wine, white hair, white woman Non-substitutability e.g. white wine -> yellow wine? Non-modifiability e.g. as poor as church mouse / mice ? Can not translate word by word

7 Outline Introduction Approaches to find collocations Frequency Mean and Variance Hypothesis test Mutual information Applications

8 Frequency (2-1) Counting e.g. the count of bigrams in corpus Not effective, most of the pairs are function words!

9 Frequency (2-2) Filter by Part-Of-Speech (Justeson and Katz 1995) Or using stop list of function words simple quantitative technique + simple linguistic knowledge

10 Mean and Variance(4-1) Fixed bigrams -> bigrams at a distance she knocked on his door They knocked at the door 100 women knocked on Donaldson ‘ s door A man knocked on the metal front door Mean offset (3+3+5+5)/4 = 4.0 deviation

11 Mean and Variance(4-2) Mean Variance Low variance means two words usually occur at about the same distance

12 Mean and Variance(4-3) The mean of -1.15 indicates that strong usually occurs at the left side. e.g. strong business support strong and for don’t form collocations

13 Mean and Variance(4-4) If the mean is close to 1.0 and the deviation is low, it can find collocations as frequency-based method. It can also find loose phrases.

14 Hypothesis Testing What if high frequency and low variance is accidental e.g. new companies, new and companies are frequently occurring words, however, it is not collocation. Hypothesis testing: assessing whether or not something is a chance event Null hypothesis H 0 : there is no association between the words beyond chance occurrences Compute the probability p that the event would occur if H 0 were true If p > P reject H 0 otherwise, accept H 0

15 t-test (5-1) t statistic: sample mean distribution mean sample variance sample size Think of the corpus as a long sequence of N bigrams, if the interest bigram occurs, the value is 1, otherwise, the value is 0. (binomial distribution )

16 t-test (5-2) N(new) = 15828, N(companies) = 4675, N(tokens)=14307668 N(new companies) = 8 P(new) = 15828/14307668, P(companies) = 4675/14307668 P(new companies) = 8/14307668 =5.591*10 -7 H 0 : P(new companies) = p(new)p(companies) = 3.615 * 10 -7 mean: (assuming Bernoulli trial) variance: t = 0.9999932 < 2.576 Accept H 0

17 t-test (5-3) Rank the bigrams with the same frequency, which a frequency-based method cannot do.

18 t-test (5-4) Using t-test to find words whose co-occurrence patterns best distinguish between two words e.g. lexicography (Church et al., 1989)

19 t-test (5-5)

20 Pearson’s chi-square test (4-1) t-test assumes probabilities are approximately normally distributed test not assuming normality Compare the observed frequencies with the frequencies expected for independence. If the difference is large, reject H 0

21 Pearson’s chi-square test (4-2) Accept H 0,, new and companies occur independently!

22 Pearson’s chi-square test (4-3) Identification of translation pairs in aligned corpora (Church et al., 1991) 59 is the number of sentence pairs which have cow in English and vache in French. Reject H 0, (cow, vache) is a translation pair.

23 Pearson’s chi-square test (4-4) Metric for Corpus similarity (Kilgarriff et al., 1998)  H0= Two corpora drawn from same source

24 Likelihood ratios (3-1) More appropriate of sparse data Two alternative explanations for the occurrence frequency of a bigram (Dunning 1993) H 1 = P(w 2 |w 1 ) = P(w 2 | ¬ w 1 ) = p (independence) H 2 = P(w 2 |w 1 ) = p 1  p 2 = P(w 2 | ¬ w 1 ) (dependence) log = log ( L(H 1 ) / L(H 2 ) ) L(H) = likelihood of observing O under H

25 Likelihood ratios (3-2) c 1, c 2, c 12 are the number of occurrences of w 1, w 2, w 1 w 2, and assuming a binomial distribution:

26 Likelihood ratios (3-3) If is a likelihood ratio of a particular form, then is asymptotically distributed (Mood et al., 1974) Likelihood ratio test is more appropriate for sparse data.

27 Mutual Information (7-1) Information you gain about x’ when knowing y’ Pointwise mutual information (Church et al.1991; Church and Hanks 1989)

28 Mutual Information (7-2) The amount of information about the occurrence of Ayatollah at position i in the corpus increases by 18.38 bits if we are told that Ruhollah occurs at position i+1.

29 Mutual Information (7-3) English: house of commons French: chambre de communes Problem1: information gain  direct dependence

30 Mutual Information (7-4)  2 considers more than (house, communes) MI considers only (house, communes)

31 Mutual Information (7-5) Problem2: Data sparseness

32 Mutual Information (7-6) For Perfect dependence: For perfect independence: MI is a not good measure of dependence since the score depends on the frequency of the individual words.

33 Mutual Information (7-7) Pointwise MI: MI(new, companies) Uncertainty reduced in predicting “ companies ” When knowing the previous word is “ new ” Small sample, not good measure if count is low MI  0, good indication of independence Mutual information: MI (wi-1, wi ) How much information (entropy) gained Unary Model P(w) - Bigram Model P(wi | wi-1) Estimated using a large sample

34 Outline Introduction Approaches to find collocations Frequency Mean and Variance Hypothesis test Mutual information Applications

35 Computational lexicography Information Retrieval Accuracy of retrieval can be improved if the similarity between a user query and a document is determined based on common collocations instead of words. (Fagan 1989) Natural Language Generation (Smadja 1993) Cross Language information retrieval (Hull and Grefenstette 1998)

36 Collocations and Word Sense Disambiguation Association or co-occurrence doctor and nurse plane and airport Both are important for word sense disambiguation Collocation - local context (One sense per collocation) Drop me a line (letter).. on the line.. (phone line) Occurrence - topical context or global context Subject based disambiguation

37 References Choueka, Yaacov. 1988. Looking for needles in a haystack or locating interesting collocational expressions in large textual databases. In Proceedings of the RIAO, pp. 43 – 38. Justeson, John S., and Slava M. Katz. 1995. Technical terminology: some linguistic properties and an algorithm for identification in text. Natural Language Engineering 1:9 – 27. Church, Kenneth Ward, and Patrick Hanks. 1989. Word association norms, mutual information and lexicography. In ACL 27, pp. 76 – 83. Church, Kenneth, William Gale, Patrick Hanks, and Donald Hindle. 1991. Using statistics in lexical analysis. In Uri Zernik (ed.), Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon, pp. 115 – 164. Hillsdale, NJ: Lawrence Erlbaum. Kilgarriff, Adam, and Tony Rose. 1998. Metrics for corpus similarity and homogeneity. Manuscript, ITRI, University of Brighton. Dunning, Ted. 1993. Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19:61 – 74. Mood, Alexander M., Franklin A. Graybill, and Duane C. Boes. 1974. Introduction to the theory of statistics. New York: McGraw-Hill. 3rd edition. Fagan, Joel L. 1989. The effectiveness of a nonsyntactic approach to automatic phrase indexing for document retrieval. Journal of the American Society for Information Science 40:115–132. Smadja, Frank. 1993. Retrieving collocations from text: Xtract. Computational Linguistics 19:143 – 177. Hull, David A., and Gregory Grefenstette. 1998. Querying across languages: A dictionary- based approach to multilingual information retrieval. In Karen Sparck Jones and Peter Willett (eds.), Readings in Information Retrieval. San Francisco: Morgan Kaufmann.

38 Thanks!


Download ppt "COLLOCATIONS He Zhongjun 2007-04-13. Outline Introduction Approaches to find collocations Frequency Mean and Variance Hypothesis test Mutual information."

Similar presentations


Ads by Google