09.06.2016COGS 523 - Bilge Say1 Using Corpora for Language Research COGS 523-Lecture 8 Collocations.

Slides:

Advertisements

Similar presentations

Chapter 7 Hypothesis Testing

Advertisements

CHAPTER 15: Tests of Significance: The Basics Lecture PowerPoint Slides The Basic Practice of Statistics 6 th Edition Moore / Notz / Fligner.

11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.

Statistics Review – Part II Topics: – Hypothesis Testing – Paired Tests – Tests of variability 1.

Hypothesis Testing Steps in Hypothesis Testing:

Bivariate Analyses.

6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.

Natural Language Processing COLLOCATIONS Updated 16/11/2005.

Outline What is a collocation?

EPIDEMIOLOGY AND BIOSTATISTICS DEPT Esimating Population Value with Hypothesis Testing.

Statistics II: An Overview of Statistics. Outline for Statistics II Lecture: SPSS Syntax – Some examples. Normal Distribution Curve. Sampling Distribution.

10 Hypothesis Testing. 10 Hypothesis Testing Statistical hypothesis testing The expression level of a gene in a given condition is measured several.

Chapter 7 Sampling and Sampling Distributions

Fall 2001 EE669: Natural Language Processing 1 Lecture 5: Collocations (Chapter 5 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer.

Chapter 2 Simple Comparative Experiments

Chapter 9 Hypothesis Testing.

Chapter 9 - Lecture 2 Computing the analysis of variance for simple experiments (single factor, unrelated groups experiments).

Today Concepts underlying inferential statistics

11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.

Data Analysis Statistics. Levels of Measurement Nominal – Categorical; no implied rankings among the categories. Also includes written observations and.

Inference about Population Parameters: Hypothesis Testing

Collocations 09/23/2004 Reading: Chap 5, Manning & Schutze (note: this chapter is available online from the book’s page

Hypothesis Testing and T-Tests. Hypothesis Tests Related to Differences Copyright © 2009 Pearson Education, Inc. Chapter Tests of Differences One.

Outline What is a collocation? Automatic approaches 1: frequency-based methods Automatic approaches 2: ruling out the null hypothesis, t-test Automatic.

AM Recitation 2/10/11.

Albert Gatt Corpora and Statistical Methods – Part 2.

McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Chapter 9 Hypothesis Testing.

Probability Distributions and Test of Hypothesis Ka-Lok Ng Dept. of Bioinformatics Asia University.

Overview Definition Hypothesis

1 GE5 Lecture 6 rules of engagement no computer or no power → no lesson no SPSS → no lesson no homework done → no lesson.

Regression Analysis (2)

Statistical Natural Language Processing Diana Trandabăț

1 Introduction to Natural Language Processing ( ) Words and the Company They Keep AI-lab

Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Inference on the Least-Squares Regression Model and Multiple Regression 14.

1 Level of Significance α is a predetermined value by convention usually 0.05 α = 0.05 corresponds to the 95% confidence level We are accepting the risk.

Natural Language Processing Spring 2007 V. “Juggy” Jagannathan.

Statistical Analysis A Quick Overview. The Scientific Method Establishing a hypothesis (idea) Collecting evidence (often in the form of numerical data)

Analysis of Variance ( ANOVA )

STA Statistical Inference

1 COMP791A: Statistical Language Processing Collocations Chap. 5.

10/04/1999 JHU CS /Jan Hajic 1 *Introduction to Natural Language Processing ( ) Words and the Company They Keep Dr. Jan Hajič CS Dept., Johns.

Significance Tests: THE BASICS Could it happen by chance alone?

1 Natural Language Processing (3b) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University

Associate Professor Arthur Dryver, PhD School of Business Administration, NIDA url:

1 Natural Language Processing (5) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University

1 Chapter 10: Introduction to Inference. 2 Inference Inference is the statistical process by which we use information collected from a sample to infer.

Collocation 발표자 : 이도관. Contents 1.Introduction 2.Frequency 3.Mean & Variance 4.Hypothesis Testing 5.Mutual Information.

Chi- square test x 2. Chi Square test Symbolized by Greek x 2 pronounced “Ki square” A Test of STATISTICAL SIGNIFICANCE for TABLE data.

1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.

Educational Research Chapter 13 Inferential Statistics Gay, Mills, and Airasian 10 th Edition.

McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 8 Hypothesis Testing.

Learning Objectives Copyright © 2002 South-Western/Thomson Learning Statistical Testing of Differences CHAPTER fifteen.

© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.

© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 11: Bivariate Relationships: t-test for Comparing the Means of Two Groups.

Lecture PowerPoint Slides Basic Practice of Statistics 7 th Edition.

Chapter Eight: Using Statistics to Answer Questions.

COLLOCATIONS He Zhongjun Outline Introduction Approaches to find collocations Frequency Mean and Variance Hypothesis test Mutual information.

Chapter 6: Analyzing and Interpreting Quantitative Data

Correlation & Regression Analysis

Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,

Education 793 Class Notes Inference and Hypothesis Testing Using the Normal Distribution 8 October 2003.

CHAPTER 15: Tests of Significance The Basics ESSENTIAL STATISTICS Second Edition David S. Moore, William I. Notz, and Michael A. Fligner Lecture Presentation.

Chapter 7: Hypothesis Testing. Learning Objectives Describe the process of hypothesis testing Correctly state hypotheses Distinguish between one-tailed.

Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,

CHI SQUARE DISTRIBUTION. The Chi-Square (  2 ) Distribution The chi-square distribution is the probability distribution of the sum of several independent,

15 Inferential Statistics.

9.3 Hypothesis Tests for Population Proportions

Statistical NLP: Lecture 7

Many slides from Rada Mihalcea (Michigan), Paul Tarau (U.North Texas)

Product moment correlation

Presentation transcript:

COGS Bilge Say1 Using Corpora for Language Research COGS 523-Lecture 8 Collocations

COGS Bilge Say2 Related Readings Manning and Schutze (1999). Foundations of Statistical Natural Language Processing.Chapter 5 on Collocations Optional: Evert, Stefan (2008). Corpora and collocations. In A. Lüdeling and M. Kytö (eds.), Corpus Linguistics. An International Handbook, article 58. Mouton de Gruyter, Berlin. [extended manuscript: and his web site

COGS Bilge Say3 A collocation is an expression consisting of two or more words that correspond to some conventional way of saying things. Collocations are characterized by limited compositionality. Collocations are not fully compositional in that there is usually an element of meaning added to the combination. ex. strong tea Collocations

COGS Bilge Say4 Idioms are the most extreme examples of non-compositionality; ex. kick the bucket Most collocations exhibit milder forms of compositionality; ex. international best practice

COGS Bilge Say5 Collocations are important for a number of applications: natural language generation, computational lexicography, parsing, corpus linguistic research Also sociolinguistics ex. strong tea; not powerful tea

COGS Bilge Say6 Manning and Schutze Example Corpus of the following analyses: New York Times (August – November 1990) 115 MB of text 14 million words

COGS Bilge Say7 Approaches to finding collocations: Frequency Mean and variance Hypothesis testing Likelihood ratios Mutual Information (pointwise)

COGS Bilge Say8 Frequency If two words occur together a lot, then that is evidence that they have a special function that is not simply explained as the function that results from their combination. heuristic: pass the candidate phrases through a part-of speech filter

C(w1, w2)w1w ofthe 58841inthe 26430tothe 21842onthe 21839forthe 13899ina 13689ofa 8753hasbeen Tag PatternExample A Nlinear function N regression coefficients A A NGaussian random variable A N Ncumulative distribution function N A Nmean squared error N N Nclass probability function N P Ndegrees of freedom C(w1, w2)w1w NewYork 7261UnitedStates 5412LosAngles 3301lastyear 3191SaudiArabia 2699lastweek 2514vicePresident 2378PersianGulf (Manning and Schutze, 1999)

COGS Bilge Say10 wC(strong, w) wC(powerful, w) support50 force13 safety22 computers10 sales21 position8 opposition19 men8 showing18 computers8 sense18 man7 message15 symbol6 defense14 military6 (Manning and Schutze, 1999)

COGS Bilge Say11 Mean and Variance Frequency based approach works for fixed phrases well. But many collocations consist of two words that stand in a more flexible relationship to one another she knocked on his door; they knocked at the door; 100 women knocked on Donaldson’s door; a man knocked on the metal from door

COGS Bilge Say12 The mean is simple the average offset. For the example, the mean offset between knocked and door is 4.0 Variance measures how much the individual offsets deviate from the mean. Sample standard deviation is the square root of the mean. For the example, the standard deviation between knocked and door is 1.15

COGS Bilge Say13 We can use this information to discover collocations by looking for pairs with low deviation. A low deviation means that the two words usually occur at about the same distance. Zero deviation means that the two words always occur at exactly the same distance.

COGS Bilge Say14 (Manning and Schutze, 1999)

COGS Bilge Say15 sample deviation sample meanCountword1word NewYork previousgames minuspoints hundredsdollars editorialAtlanta ringNew pointhundredth subscribersby strongsupport powerfulorganizations RichardNixon Garrisonsaid (Manning and Schutze, 1999)

COGS Bilge Say16 Hypothesis testing High frequency and variance can be accidental If two constituent words of a frequent bigram like new companies are regularly occurring words (as new and companies are), then we expect the two words to co-occur a lot just by chance.

COGS Bilge Say17 What we really want to know is whether two words occur together more often than chance. Assessing whether or not something is a chance event is one of the classical problems of statistics.

COGS Bilge Say18 How can we apply the methodology of hypothesis testing to the problem of finding collocations? We first formulate a null hypothesis which states that what should be true if two words do not form a collocation. P(w1, w2)= P(w1)*P(w2)

COGS Bilge Say19 The t test Now we need a statistical test that tells us how probable or improbable it is that a certain constellation will occur. A test that has been widely used for collocation discovery is the t test. t= (x-η)/(√s 2 /N) x the sample mean; s 2 sample variance; N is the sample size ; η is the mean of distribution

new companies P(new)= 15828/ P(companies)= 4675/ P(new, companies)= P(new)* P(companies) C(new, companies)=8 x(new, companies)=8/ t= (x-η)/(√s 2 /N)= x(new, companies)- P(new, companies) √ x(new, companies)/ ≈

COGS Bilge Say is not larger than the critical value for ά= We cannot reject the null hypothesis that new and companies occur independently and do not form a collocation.

COGS Bilge Say22 tC(w1)C(w2)C(w1, w2)w1w AyatollahRuhollah BetteMidler AgathaChristie videocassetterecorder unsaltedbutter fistmade overmany intothem likepeople timelast (Manning and Schutze, 1999)

COGS Bilge Say23 It turns out that most bigrams attested in a corpus occur significantly more often than chance. Language is very regular so that very few completely unpredictable events happen. The t test and other statistical tests are most useful as a method for ranking collocatins.

COGS Bilge Say24 Hypothesis testing of difference The t test can also be used for a slightly different collocation discovery problem: to find words whose co-occurrence patterns best distinguish between two words. ex. to find words that best differentiate the meanings of strong and powerful.

tC(w)C(strong w)C(powerful w)Word computers computer symbol machines Germany support enough safety sales opposition (Manning and Schutze, 1999)

COGS Bilge Say26 Pearson’s chi-square test t test assumes that probabilities are approximately normally distributed, which is not true in general. X 2 the essence of the test is to compare the observed frequencies in a table with the frequencies expected for independence. If the difference between observed and expected frequencies is large, then we can reject the null hypothesis of independence.

X 2 = Σi,j (O i,j -E i,j ) 2 /E i,j Expected = (8+4667/N)+( /N) X 2 ≈ 1.55; 1.55 is not larger than the critical value for ά= We cannot reject the null hypothesis that new and companies occur independently and do not form a collocation. w=neww~=new w=companies 8 (new companies) 4667 (e.g. old companies) w~=companies 8 (new machines) (e.g. old machines) (Manning and Schutze, 1999)

COGS Bilge Say28 Likelihood ratios More appropriate for sparse data than the X 2 test. And likelihood ratio is more interpretable than the X 2 test. Two alternative explanations for the occurrence frequency of a bigram w1w2 Hypothesis 1: P(w2|w1)= p= P(w2|-w1) Hypothesis 2: P(w2|w1)= p1=/= p2= P(w2|-w1) Hypothesis 1 is a formalization of independence Hypothesis 2 is a formalization of dependence which is good evidence for an interesting collocation

COGS Bilge Say29 -2logλC(w1)C(w2) C(w1, w2)w1w mostpowerful politicallypowerful powerfulcomputers powerfulforce powerfulsymbol powerfullobbies economicallypowerful powerfulmagnet powerfulcudgels (Manning and Schutze, 1999)

COGS Bilge Say30 One advantage of likelihood ratios is that they have a clear intuitive interpretation. For example, the bigram powerful computers is e 0.5x82.96 ≈ 1.3X10 18 time more likely under the hypothesis that computers is more likely to follow powerful than its base rate of occurrence would suggest.

COGS Bilge Say31 λ is a likelihood ratio of a particular form, then the quantity -2logλ is asymptotically X 2 distributed. We can use tables of X 2 to test H 1 against H 2. E.g. value for powerful cudgels reject H 1 for this bigram on a confidence level of 0.005

Relative Frequency Ratios Ratios of frequencies between two or more different corpora can be used to discover collocations that are characteristic of a corpus when compared to other corpora. e.g. Karim Obeid occurs 68 times in the 1989 corpus so relative frequency ratio r is r=(2/ )/ (68/ ) Relative frequency ratios are useful to find subject-specific collocations. The application proposed is to compare a general text with a subject-specific text.

COGS Bilge Say33 Ratio w1w KarimObeid EastBerliners MissManners earthquake HUDofficials EASTGERMANS Muslimcleric JohnLe PragueSpring Amongindividual (Manning and Schutze, 1999)

COGS Bilge Say34 Mutual Information SymbolDefinitionCurrent useFano I(x,y)log(p(x,y)/p(x)p(y) pointwise mutual information mutual information I(X;Y)E log(p(X,Y)/p(X)p(Y) mutual information average MI / expectation of MI

I 1000 w1w2w1w2BigramI w1w2w1w2Bigram Schwartz eschews Schwartz eschews fewest visits fewest visits FIND GARDEN FIND GARDEN Indonesian pieces Indonesian pieces Peds survived Peds survived marijuana growing marijuana growing doubt whether doubt whether new converts new converts like offensive like offensive must think must think (Manning and Schutze, 1999)

COGS Bilge Say36 Next Week Biber et al. Register and Discourse Variations Chapter.