COLLOCATIONS He Zhongjun 2007-04-13. Outline Introduction Approaches to find collocations Frequency Mean and Variance Hypothesis test Mutual information.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word-counts, visualizations and N-grams Eric Atwell, Language Research.
Advertisements

1/32 Assignments Basic idea is to choose a topic of your own, or to take a study found in the literature Report is in two parts –Description of problem.
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
R OBERTO B ATTITI, M AURO B RUNATO The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
CHAPTER 21 Inferential Statistical Analysis. Understanding probability The idea of probability is central to inferential statistics. It means the chance.
Natural Language Processing COLLOCATIONS Updated 16/11/2005.
Outline What is a collocation?
U eatworms.swmed.edu/~leon u
1 I256: Applied Natural Language Processing Marti Hearst Sept 13, 2006.
Measures of Coincidence Vasileios Hatzivassiloglou University of Texas at Dallas.
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
QUANTITATIVE DATA ANALYSIS
A Markov Random Field Model for Term Dependencies Donald Metzler and W. Bruce Croft University of Massachusetts, Amherst Center for Intelligent Information.
Statistics MP Oakes (1998) Statistics for corpus linguistics. Edinburgh University Press.
1 Collocation and translation MA Literary Translation- Lesson 2 prof. Hugo Bowles February
MODULE 2 Meaning and discourse in English
Fall 2001 EE669: Natural Language Processing 1 Lecture 5: Collocations (Chapter 5 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer.
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
Introduction to Language Models Evaluation in information retrieval Lecture 4.
Statistical Comparison of Two Learning Algorithms Presented by: Payam Refaeilzadeh.
Today Concepts underlying inferential statistics
Collocations 09/23/2004 Reading: Chap 5, Manning & Schutze (note: this chapter is available online from the book’s page
Outline What is a collocation? Automatic approaches 1: frequency-based methods Automatic approaches 2: ruling out the null hypothesis, t-test Automatic.
AM Recitation 2/10/11.
Albert Gatt Corpora and Statistical Methods – Part 2.
Jeopardy Hypothesis Testing T-test Basics T for Indep. Samples Z-scores Probability $100 $200$200 $300 $500 $400 $300 $400 $300 $400 $500 $400.
Comparable Corpora Kashyap Popat( ) Rahul Sharnagat(11305R013)
Statistical Natural Language Processing Diana Trandabăț
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
1 Introduction to Natural Language Processing ( ) Words and the Company They Keep AI-lab
1 Statistical NLP: Lecture 10 Lexical Acquisition.
Natural Language Processing Spring 2007 V. “Juggy” Jagannathan.
1 COMP791A: Statistical Language Processing Collocations Chap. 5.
10/04/1999 JHU CS /Jan Hajic 1 *Introduction to Natural Language Processing ( ) Words and the Company They Keep Dr. Jan Hajič CS Dept., Johns.
Albert Gatt Corpora and Statistical Methods. In this lecture Corpora and Statistical Methods We have considered distributions of words and lexical variation.
User Study Evaluation Human-Computer Interaction.
1 Natural Language Processing (3b) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
Maximum Likelihood Estimator of Proportion Let {s 1,s 2,…,s n } be a set of independent outcomes from a Bernoulli experiment with unknown probability.
1 Natural Language Processing (5) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University
1 Statistical NLP: Lecture 7 Collocations (Ch 5).
The Scientific Method Probability and Inferential Statistics.
Biostatistics Class 6 Hypothesis Testing: One-Sample Inference 2/29/2000.
GUIDE : PROF. PUSHPAK BHATTACHARYYA Bilingual Terminology Mining BY: MUNISH MINIA (07D05016) PRIYANK SHARMA (07D05017)
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
Collocation 발표자 : 이도관. Contents 1.Introduction 2.Frequency 3.Mean & Variance 4.Hypothesis Testing 5.Mutual Information.
1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.
Educational Research Chapter 13 Inferential Statistics Gay, Mills, and Airasian 10 th Edition.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Lexical Acquisition Extending our information about words, particularly quantitative information.
Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.
Effective Automatic Image Annotation Via A Coherent Language Model and Active Learning Rong Jin, Joyce Y. Chai Michigan State University Luo Si Carnegie.
SIMS 296a-4 Text Data Mining Marti Hearst UC Berkeley SIMS.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
Educational Research Inferential Statistics Chapter th Chapter 12- 8th Gay and Airasian.
COGS Bilge Say1 Using Corpora for Language Research COGS 523-Lecture 8 Collocations.
Collocations David Guy Brizan Speech and Language Processing Seminar 26 th October, 2006.
Introduction to Corpus Linguistics
Statistical NLP: Lecture 7
Many slides from Rada Mihalcea (Michigan), Paul Tarau (U.North Texas)
Statistical NLP: Lecture 13
CSC 594 Topics in AI – Natural Language Processing
Statistical NLP: Lecture 9
Introduction: Statistics meets corpus linguistics
Statistical NLP : Lecture 9 Word Sense Disambiguation
Statistical NLP: Lecture 10
Presentation transcript:

COLLOCATIONS He Zhongjun

Outline Introduction Approaches to find collocations Frequency Mean and Variance Hypothesis test Mutual information Applications

Outline Introduction Approaches to find collocations Frequency Mean and Variance Hypothesis test Mutual information Applications

What are collocations? A collocation is an expression consisting of two or more words that correspond to some conventional way of saying things. (-- the book) A collocation is a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit, and whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components. (-- Choueka, 1988)

Examples noun phrases strong tea vs. powerful tea verbs make a decision vs. take a decision knock … door vs. hit … door make up Idioms kick the bucket ( 死掉 ) Subtle, unexplainable, native speaker usage broad daylight vs. bright daylight 昨天,去年,上个月 …

Introduction – Character /Criteria Non-compositionality e.g. kick the bucket white wine, white hair, white woman Non-substitutability e.g. white wine -> yellow wine? Non-modifiability e.g. as poor as church mouse / mice ? Can not translate word by word

Outline Introduction Approaches to find collocations Frequency Mean and Variance Hypothesis test Mutual information Applications

Frequency (2-1) Counting e.g. the count of bigrams in corpus Not effective, most of the pairs are function words!

Frequency (2-2) Filter by Part-Of-Speech (Justeson and Katz 1995) Or using stop list of function words simple quantitative technique + simple linguistic knowledge

Mean and Variance(4-1) Fixed bigrams -> bigrams at a distance she knocked on his door They knocked at the door 100 women knocked on Donaldson ‘ s door A man knocked on the metal front door Mean offset ( )/4 = 4.0 deviation

Mean and Variance(4-2) Mean Variance Low variance means two words usually occur at about the same distance

Mean and Variance(4-3) The mean of indicates that strong usually occurs at the left side. e.g. strong business support strong and for don’t form collocations

Mean and Variance(4-4) If the mean is close to 1.0 and the deviation is low, it can find collocations as frequency-based method. It can also find loose phrases.

Hypothesis Testing What if high frequency and low variance is accidental e.g. new companies, new and companies are frequently occurring words, however, it is not collocation. Hypothesis testing: assessing whether or not something is a chance event Null hypothesis H 0 : there is no association between the words beyond chance occurrences Compute the probability p that the event would occur if H 0 were true If p > P reject H 0 otherwise, accept H 0

t-test (5-1) t statistic: sample mean distribution mean sample variance sample size Think of the corpus as a long sequence of N bigrams, if the interest bigram occurs, the value is 1, otherwise, the value is 0. (binomial distribution )

t-test (5-2) N(new) = 15828, N(companies) = 4675, N(tokens)= N(new companies) = 8 P(new) = 15828/ , P(companies) = 4675/ P(new companies) = 8/ =5.591*10 -7 H 0 : P(new companies) = p(new)p(companies) = * mean: (assuming Bernoulli trial) variance: t = < Accept H 0

t-test (5-3) Rank the bigrams with the same frequency, which a frequency-based method cannot do.

t-test (5-4) Using t-test to find words whose co-occurrence patterns best distinguish between two words e.g. lexicography (Church et al., 1989)

t-test (5-5)

Pearson’s chi-square test (4-1) t-test assumes probabilities are approximately normally distributed test not assuming normality Compare the observed frequencies with the frequencies expected for independence. If the difference is large, reject H 0

Pearson’s chi-square test (4-2) Accept H 0,, new and companies occur independently!

Pearson’s chi-square test (4-3) Identification of translation pairs in aligned corpora (Church et al., 1991) 59 is the number of sentence pairs which have cow in English and vache in French. Reject H 0, (cow, vache) is a translation pair.

Pearson’s chi-square test (4-4) Metric for Corpus similarity (Kilgarriff et al., 1998)  H0= Two corpora drawn from same source

Likelihood ratios (3-1) More appropriate of sparse data Two alternative explanations for the occurrence frequency of a bigram (Dunning 1993) H 1 = P(w 2 |w 1 ) = P(w 2 | ¬ w 1 ) = p (independence) H 2 = P(w 2 |w 1 ) = p 1  p 2 = P(w 2 | ¬ w 1 ) (dependence) log = log ( L(H 1 ) / L(H 2 ) ) L(H) = likelihood of observing O under H

Likelihood ratios (3-2) c 1, c 2, c 12 are the number of occurrences of w 1, w 2, w 1 w 2, and assuming a binomial distribution:

Likelihood ratios (3-3) If is a likelihood ratio of a particular form, then is asymptotically distributed (Mood et al., 1974) Likelihood ratio test is more appropriate for sparse data.

Mutual Information (7-1) Information you gain about x’ when knowing y’ Pointwise mutual information (Church et al.1991; Church and Hanks 1989)

Mutual Information (7-2) The amount of information about the occurrence of Ayatollah at position i in the corpus increases by bits if we are told that Ruhollah occurs at position i+1.

Mutual Information (7-3) English: house of commons French: chambre de communes Problem1: information gain  direct dependence

Mutual Information (7-4)  2 considers more than (house, communes) MI considers only (house, communes)

Mutual Information (7-5) Problem2: Data sparseness

Mutual Information (7-6) For Perfect dependence: For perfect independence: MI is a not good measure of dependence since the score depends on the frequency of the individual words.

Mutual Information (7-7) Pointwise MI: MI(new, companies) Uncertainty reduced in predicting “ companies ” When knowing the previous word is “ new ” Small sample, not good measure if count is low MI  0, good indication of independence Mutual information: MI (wi-1, wi ) How much information (entropy) gained Unary Model P(w) - Bigram Model P(wi | wi-1) Estimated using a large sample

Outline Introduction Approaches to find collocations Frequency Mean and Variance Hypothesis test Mutual information Applications

Computational lexicography Information Retrieval Accuracy of retrieval can be improved if the similarity between a user query and a document is determined based on common collocations instead of words. (Fagan 1989) Natural Language Generation (Smadja 1993) Cross Language information retrieval (Hull and Grefenstette 1998)

Collocations and Word Sense Disambiguation Association or co-occurrence doctor and nurse plane and airport Both are important for word sense disambiguation Collocation - local context (One sense per collocation) Drop me a line (letter).. on the line.. (phone line) Occurrence - topical context or global context Subject based disambiguation

References Choueka, Yaacov Looking for needles in a haystack or locating interesting collocational expressions in large textual databases. In Proceedings of the RIAO, pp. 43 – 38. Justeson, John S., and Slava M. Katz Technical terminology: some linguistic properties and an algorithm for identification in text. Natural Language Engineering 1:9 – 27. Church, Kenneth Ward, and Patrick Hanks Word association norms, mutual information and lexicography. In ACL 27, pp. 76 – 83. Church, Kenneth, William Gale, Patrick Hanks, and Donald Hindle Using statistics in lexical analysis. In Uri Zernik (ed.), Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon, pp. 115 – 164. Hillsdale, NJ: Lawrence Erlbaum. Kilgarriff, Adam, and Tony Rose Metrics for corpus similarity and homogeneity. Manuscript, ITRI, University of Brighton. Dunning, Ted Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19:61 – 74. Mood, Alexander M., Franklin A. Graybill, and Duane C. Boes Introduction to the theory of statistics. New York: McGraw-Hill. 3rd edition. Fagan, Joel L The effectiveness of a nonsyntactic approach to automatic phrase indexing for document retrieval. Journal of the American Society for Information Science 40:115–132. Smadja, Frank Retrieving collocations from text: Xtract. Computational Linguistics 19:143 – 177. Hull, David A., and Gregory Grefenstette Querying across languages: A dictionary- based approach to multilingual information retrieval. In Karen Sparck Jones and Peter Willett (eds.), Readings in Information Retrieval. San Francisco: Morgan Kaufmann.

Thanks!