1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.

Slides:



Advertisements
Similar presentations
Tests of Hypotheses Based on a Single Sample
Advertisements

Chapter 7 Hypothesis Testing
Statistics Review – Part II Topics: – Hypothesis Testing – Paired Tests – Tests of variability 1.
Chi square.  Non-parametric test that’s useful when your sample violates the assumptions about normality required by other tests ◦ All other tests we’ve.
CHAPTER 21 Inferential Statistical Analysis. Understanding probability The idea of probability is central to inferential statistics. It means the chance.
Natural Language Processing COLLOCATIONS Updated 16/11/2005.
Outline What is a collocation?
Statistics II: An Overview of Statistics. Outline for Statistics II Lecture: SPSS Syntax – Some examples. Normal Distribution Curve. Sampling Distribution.
Fall 2001 EE669: Natural Language Processing 1 Lecture 5: Collocations (Chapter 5 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer.
Bivariate Statistics GTECH 201 Lecture 17. Overview of Today’s Topic Two-Sample Difference of Means Test Matched Pairs (Dependent Sample) Tests Chi-Square.
Hypothesis Testing. G/RG/R Null Hypothesis: The means of the populations from which the samples were drawn are the same. The samples come from the same.
BCOR 1020 Business Statistics Lecture 21 – April 8, 2008.
Chapter 2 Simple Comparative Experiments
Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall Statistics for Business and Economics 7 th Edition Chapter 9 Hypothesis Testing: Single.
Inferences About Process Quality
Statistical Comparison of Two Learning Algorithms Presented by: Payam Refaeilzadeh.
Today Concepts underlying inferential statistics
Collocations 09/23/2004 Reading: Chap 5, Manning & Schutze (note: this chapter is available online from the book’s page
Chapter 14 Inferential Data Analysis
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 8 Tests of Hypotheses Based on a Single Sample.
Chapter 12 Inferential Statistics Gay, Mills, and Airasian
Hypothesis Testing and T-Tests. Hypothesis Tests Related to Differences Copyright © 2009 Pearson Education, Inc. Chapter Tests of Differences One.
Outline What is a collocation? Automatic approaches 1: frequency-based methods Automatic approaches 2: ruling out the null hypothesis, t-test Automatic.
Chapter 9 Title and Outline 1 9 Tests of Hypotheses for a Single Sample 9-1 Hypothesis Testing Statistical Hypotheses Tests of Statistical.
Choosing Statistical Procedures
AM Recitation 2/10/11.
Albert Gatt Corpora and Statistical Methods – Part 2.
Hypothesis Testing:.
Confidence Intervals and Hypothesis Testing - II
Statistical Natural Language Processing Diana Trandabăț
Natural Language Processing Spring 2007 V. “Juggy” Jagannathan.
Week 10 Chapter 10 - Hypothesis Testing III : The Analysis of Variance
1 COMP791A: Statistical Language Processing Collocations Chap. 5.
Statistics 11 Correlations Definitions: A correlation is measure of association between two quantitative variables with respect to a single individual.
1 Natural Language Processing (3b) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University
1 Natural Language Processing (5) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University
1 Statistical NLP: Lecture 7 Collocations (Ch 5).
FOUNDATIONS OF NURSING RESEARCH Sixth Edition CHAPTER Copyright ©2012 by Pearson Education, Inc. All rights reserved. Foundations of Nursing Research,
Biostatistics Class 6 Hypothesis Testing: One-Sample Inference 2/29/2000.
No criminal on the run The concept of test of significance FETP India.
Collocation 발표자 : 이도관. Contents 1.Introduction 2.Frequency 3.Mean & Variance 4.Hypothesis Testing 5.Mutual Information.
Educational Research Chapter 13 Inferential Statistics Gay, Mills, and Airasian 10 th Edition.
Learning Objectives Copyright © 2002 South-Western/Thomson Learning Statistical Testing of Differences CHAPTER fifteen.
Chapter 14 – 1 Chapter 14: Analysis of Variance Understanding Analysis of Variance The Structure of Hypothesis Testing with ANOVA Decomposition of SST.
Statistical Inference for the Mean Objectives: (Chapter 9, DeCoursey) -To understand the terms: Null Hypothesis, Rejection Region, and Type I and II errors.
N318b Winter 2002 Nursing Statistics Specific statistical tests Chi-square (  2 ) Lecture 7.
Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.
SOCW 671 #2 Overview of SPSS Steps in Designing Research Hypotheses Research Questions.
Chapter Eight: Using Statistics to Answer Questions.
COLLOCATIONS He Zhongjun Outline Introduction Approaches to find collocations Frequency Mean and Variance Hypothesis test Mutual information.
© Copyright McGraw-Hill 2004
Chi-Square Analyses.
Outline of Today’s Discussion 1.The Chi-Square Test of Independence 2.The Chi-Square Test of Goodness of Fit.
Example The strength of concrete depends, to some extent on the method used for drying it. Two different drying methods were tested independently on specimens.
Chapter 7: Hypothesis Testing. Learning Objectives Describe the process of hypothesis testing Correctly state hypotheses Distinguish between one-tailed.
Educational Research Inferential Statistics Chapter th Chapter 12- 8th Gay and Airasian.
COGS Bilge Say1 Using Corpora for Language Research COGS 523-Lecture 8 Collocations.
CHI SQUARE DISTRIBUTION. The Chi-Square (  2 ) Distribution The chi-square distribution is the probability distribution of the sum of several independent,
Collocations David Guy Brizan Speech and Language Processing Seminar 26 th October, 2006.
15 Inferential Statistics.
Statistical NLP: Lecture 7
Research Methodology Lecture No :25 (Hypothesis Testing – Difference in Groups)
Understanding Results
Chapter 2 Simple Comparative Experiments
Many slides from Rada Mihalcea (Michigan), Paul Tarau (U.North Texas)
9 Tests of Hypotheses for a Single Sample CHAPTER OUTLINE
Discrete Event Simulation - 4
Introduction: Statistics meets corpus linguistics
What are their purposes? What kinds?
Chapter 9 Hypothesis Testing: Single Population
Presentation transcript:

1 Statistical NLP: Lecture 7 Collocations

2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts of collocations and terms, technical term and terminological phrase. 4 Collocations sometimes reflect interesting attitudes (in English) towards different types of substances: strong cigarettes, tea, coffee versus powerful drug (e.g., heroin)

3 Definition (w.r.t Computational and Statistical Literature) 4 [A collocation is defined as] a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit, and whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components. [Chouekra, 1988]

4 Other Definitions/Notions (w.r.t. Linguistic Literature) I 4 Collocations are not necessarily adjacent 4 Typical criteria for collocations: non- compositionality, non-substitutability, non- modifiability. 4 Collocations cannot be translated into other languages. 4 Generalization to weaker cases (strong association of words, but not necessarily fixed occurrence.

5 Linguistic Subclasses of Collocations 4 Light verbs: verbs with little semantic content 4 Verb particle constructions or Phrasal Verbs 4 Proper Nouns/Names 4 Terminological Expressions

6 Overview of the Collocation Detecting Techniques Surveyed 4 Selection of Collocations by Frequency 4 Selection of Collocation based on Mean and Variance of the distance between focal word and collocating word. 4 Hypothesis Testing 4 Mutual Information

7 Frequency (Justeson & Katz, 1995) 1. Selecting the most frequently occurring bigrams 2. Passing the results through a part-of- speech filter þ Simple method that works very well.

8 Mean and Variance (Smadja et al., 1993) 4 Frequency-based search works well for fixed phrases. However, many collocations consist of two words in more flexible relationships. 4 The method computes the mean and variance of the offset (signed distance) between the two words in the corpus. 4 If the offsets are randomly distributed (i.e., no collocation), then the variance/sample deviation will be high.

9 Hypothesis Testing I: Overview 4 High frequency and low variance can be accidental. We want to determine whether the co- occurrence is random or whether it occurs more often than chance. 4 This is a classical problem in Statistics called Hypothesis Testing. 4 We formulate a null hypothesis H 0 (no association beyond chance) and calculate the probability that a collocation would occur if H 0 were true, and then reject H 0 if p is too low and retain H 0 as possible, otherwise.

10 Hypothesis Testing II: The t test 4 The t test looks at the mean and variance of a sample of measurements, where the null hypothesis is that the sample is drawn from a distribution with mean . 4 The test looks at the difference between the observed and expected means, scaled by the variance of the data, and tells us how likely one is to get a sample of that mean and variance assuming that the sample is drawn from a normal distribution with mean . 4 To apply the t test to collocations, we think of the text corpus as a long sequence of N bigrams.

11 Hypothesis Testing II: Hypothesis testing of differences (Church & Hanks, We may also want to find words whose co- occurrence patterns best distinguish between two words. This application can be useful for Lexicography. 4 The t test is extended to the comparison of the means of two normal populations. 4 Here, the null hypothesis is that the average difference is 0.

12 Pearson’s Chi-Square test I: Method 4 Use of the t test has been criticized because it assumes that probabilities are approximately normally distributed (not true, generally). 4 The Chi-Square test does not make this assumption. 4 The essence of the test is to compare observed frequencies with frequencies expected for independence. If the difference between observed and expected frequencies is large, then we can reject the null hypothesis of independence.

13 Pearson’s Chi-Square test II: Applications 4 One of the early uses of the Chi square test in Statistical NLP was the identification of translation pairs in aligned corpora (Church & Gale, 1991). 4 A more recent application is to use Chi square as a metric for corpus similarity (Kilgariff and Rose, 1998) 4 Nevertheless, the Chi-Square test should not be used in small corpora.

14 Likelihood Ratios I: Within a single corpus (Dunning, 1993) 4 Likelihood ratios are more appropriate for sparse data than the Chi-Square test. In addition, they are easier to interpret than the Chi-Square statistic. 4 In applying the likelihood ratio test to collocation discovery, we examine the following two alternative explanations for the occurrence frequency of a bigram w1 w2: –The occurrence of w2 is independent of the previous occurrence of w1 –The occurrence of w2 is dependent of the previous occurrence of w1

15 Likelihood Ratios II: Between two or more corpora (Damerau, 1993) 4 Ratios of relative frequencies between two or more different corpora can be used to discover collocations that are characteristic of a corpus when compared to other corpora. 4 This approach is most useful for the discovery of subject-specific collocations.

16 Mutual Information 4 An Information-Theoretic measure for discovering collocations is pointwise mutual information (Church et al., 89, 91) 4 Pointwise Mutual Information is roughly a measure of how much one word tells us about the other. 4 Pointwise mutual information works particularly badly in sparse environments.