Albert Gatt Corpora and Statistical Methods – Part 2.

Slides:



Advertisements
Similar presentations
Chapter 7 Hypothesis Testing
Advertisements

Chapter 9 Hypothesis Testing Understandable Statistics Ninth Edition
Hypothesis Testing A hypothesis is a claim or statement about a property of a population (in our case, about the mean or a proportion of the population)
Review of the Basic Logic of NHST Significance tests are used to accept or reject the null hypothesis. This is done by studying the sampling distribution.
Natural Language Processing COLLOCATIONS Updated 16/11/2005.
Outline What is a collocation?
Chapter 14 Comparing two groups Dr Richard Bußmann.
BCOR 1020 Business Statistics Lecture 22 – April 10, 2008.
Topic 2: Statistical Concepts and Market Returns
Probability & Statistics for Engineers & Scientists, by Walpole, Myers, Myers & Ye ~ Chapter 10 Notes Class notes for ISE 201 San Jose State University.
BCOR 1020 Business Statistics Lecture 21 – April 8, 2008.
Inference about a Mean Part II
7-2 Estimating a Population Proportion
Inferences About Process Quality
Chapter 9 Hypothesis Testing.
BCOR 1020 Business Statistics
PSY 307 – Statistics for the Behavioral Sciences Chapter 19 – Chi-Square Test for Qualitative Data Chapter 21 – Deciding Which Test to Use.
5-3 Inference on the Means of Two Populations, Variances Unknown
Definitions In statistics, a hypothesis is a claim or statement about a property of a population. A hypothesis test is a standard procedure for testing.
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Collocations 09/23/2004 Reading: Chap 5, Manning & Schutze (note: this chapter is available online from the book’s page
Inferential Statistics
Statistical Analysis. Purpose of Statistical Analysis Determines whether the results found in an experiment are meaningful. Answers the question: –Does.
Inferential Statistics
AM Recitation 2/10/11.
Aaker, Kumar, Day Ninth Edition Instructor’s Presentation Slides
Hypothesis Testing:.
Copyright © 2010, 2007, 2004 Pearson Education, Inc Lecture Slides Elementary Statistics Eleventh Edition and the Triola Statistics Series by.
© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 9. Hypothesis Testing I: The Six Steps of Statistical Inference.
Hypothesis Testing.
Fundamentals of Hypothesis Testing: One-Sample Tests
Statistical Natural Language Processing Diana Trandabăț
Natural Language Processing Spring 2007 V. “Juggy” Jagannathan.
1 CSI5388: Functional Elements of Statistics for Machine Learning Part I.
Comparing Two Proportions
Slide Slide 1 Chapter 8 Hypothesis Testing 8-1 Overview 8-2 Basics of Hypothesis Testing 8-3 Testing a Claim about a Proportion 8-4 Testing a Claim About.
Introduction to Hypothesis Testing: One Population Value Chapter 8 Handout.
Albert Gatt Corpora and Statistical Methods. In this lecture Corpora and Statistical Methods We have considered distributions of words and lexical variation.
Chapter 21: More About Tests “The wise man proportions his belief to the evidence.” -David Hume 1748.
Maximum Likelihood Estimator of Proportion Let {s 1,s 2,…,s n } be a set of independent outcomes from a Bernoulli experiment with unknown probability.
10.2 Tests of Significance Use confidence intervals when the goal is to estimate the population parameter If the goal is to.
Chapter 20 Testing hypotheses about proportions
Statistical Hypotheses & Hypothesis Testing. Statistical Hypotheses There are two types of statistical hypotheses. Null Hypothesis The null hypothesis,
May 2004 Prof. Himayatullah 1 Basic Econometrics Chapter 5: TWO-VARIABLE REGRESSION: Interval Estimation and Hypothesis Testing.
EMIS 7300 SYSTEMS ANALYSIS METHODS FALL 2005 Dr. John Lipp Copyright © Dr. John Lipp.
1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.
Chapter 9 Three Tests of Significance Winston Jackson and Norine Verberg Methods: Doing Social Research, 4e.
Copyright © 2010, 2007, 2004 Pearson Education, Inc Section 8-2 Basics of Hypothesis Testing.
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 8 Hypothesis Testing.
Copyright © 2010 Pearson Education, Inc. Slide
Statistical Inference for the Mean Objectives: (Chapter 9, DeCoursey) -To understand the terms: Null Hypothesis, Rejection Region, and Type I and II errors.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Fall 2002Biostat Statistical Inference - Confidence Intervals General (1 -  ) Confidence Intervals: a random interval that will include a fixed.
Chap 8-1 Fundamentals of Hypothesis Testing: One-Sample Tests.
Copyright © Cengage Learning. All rights reserved. Chi-Square and F Distributions 10.
© Copyright McGraw-Hill 2004
Sampling and Statistical Analysis for Decision Making A. A. Elimam College of Business San Francisco State University.
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
Comparing Counts Chapter 26. Goodness-of-Fit A test of whether the distribution of counts in one categorical variable matches the distribution predicted.
ENGR 610 Applied Statistics Fall Week 7 Marshall University CITE Jack Smith.
Chapter 13 Understanding research results: statistical inference.
Chapter 9: Introduction to the t statistic. The t Statistic The t statistic allows researchers to use sample data to test hypotheses about an unknown.
Hypothesis Testing. Statistical Inference – dealing with parameter and model uncertainty  Confidence Intervals (credible intervals)  Hypothesis Tests.
Hypothesis Tests. An Hypothesis is a guess about a situation that can be tested, and the test outcome can be either true or false. –The Null Hypothesis.
Chapter 7: Hypothesis Testing. Learning Objectives Describe the process of hypothesis testing Correctly state hypotheses Distinguish between one-tailed.
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
Review Statistical inference and test of significance.
Probability Distributions ( 확률분포 ) Chapter 5. 2 모든 가능한 ( 확률 ) 변수의 값에 대해 확률을 할당하는 체계 X 가 1, 2, …, 6 의 값을 가진다면 이 6 개 변수 값에 확률을 할당하는 함수 Definition.
Hypothesis Testing: Hypotheses
Discrete Event Simulation - 4
Presentation transcript:

Albert Gatt Corpora and Statistical Methods – Part 2

Preliminaries: Hypothesis testing and the binomial distribution

Permutations Suppose we have the 5 words {the, dog, ate, a, bone} How many permutations (possible orderings) are there of these words? the dog ate a bone dog the ate a bone … E.g. there are 5! = 120 ways of permuting 5 words.

Binomial coefficient Slight variation: How many different choices of three words are there out of these 5? This is known as an “n choose k” problem, in our case: “5 choose 3” For our problem, this gives us 10 ways of choosing three items out of 5

Bernoulli trials A Bernoulli (or binomial) trial is like a coin flip. Features: 1. There are two possible outcomes (not necessarily with the same likelihood), e.g. success/failure or 1/0. 2. If the situation is repeated, then the likelihoods of the two outcomes are stable.

Sampling with/out replacement Suppose we’re interested in the probability of pulling out a function word from a corpus of 100 words. we pull out words one by one without putting them back Is this a Bernoulli trial? we have a notion of success/failure: w is either a function word (“success”) or not (“failure”) but our chances aren’t the same across trials: they diminish since we sample without replacement

Cutting corners If the sample (e.g. the corpus) is large enough, then we can assume a Bernoulli situation even if we sample without replacement. Suppose our corpus has 52 million words Success = pulling out a function word Suppose there are 13 million function words First trial: p(success) =.25 Second trial: p(success) = 12,999,999/51,999,999 =.249 On very large samples, the chances remain relatively stable even without replacement.

Binomial probabilities - I Let π represent the probability of success on a Bernoulli trial (e.g. our simple word game on a large corpus). Then, p(failure) = 1 - π Problem: What are the chances of achieving success 3 times out of 5 trials? Assumption: each trial is independent of every other. (Is this assumption reasonable?)

Binomial probabilities - II How many ways are there of getting success three times out of 5? Several: SSSFF, SFSFS, SFSSF, … To estimate the number of possible ways of getting k outcomes from n possibilities, we use the binomial coefficient:

Binomial probabilities - III “5 choose 3” gives 10. Given independence, each of these sequences is equally likely. What’s the probability of a sequence? it’s an AND problem (multiplication rule) P(SSSFF) = πππ (1- π )(1 – π ) = π 3 (1- π ) 2 P(SFSFS) = π (1- π ) π (1- π ) π = π 3 (1- π ) 2 (they all come out the same)

Binomial probabilities - IV The binomial distribution states that: given n Bernoulli trials, with probability π of success on each trial, the probability of getting exactly k successes is: probability of each success Number of different ways of getting k successes probability of k successes out of n

Expected value and variance Expected value: where π is our probability of success Expected value of X over n trials Variance of X over n trials

Using the t-test for collocation discovery

The logic of hypothesis testing The typical scenario in hypothesis testing compares two hypotheses: 1. The research hypothesis 2. A null hypothesis The idea is to set up our experiment (study, etc) in such a way that: If we show the null hypothesis to be false then we can affirm our research hypothesis with a certain degree of confidence

H0 for collocation studies There is no real association between w1 and w2, i.e. occurrence of is no more likely than chance. More formally: H0: P(w1 & w2) = P(w1)P(w2) i.e. P(w1) and P(w2) are independent

Some more on hypothesis testing Our research hypothesis (H1): are strong collocates P(w1 & w2) > P(w1)P(w2) A null hypothesis H0 P(w1 & w2) = P(w1)P(w2) How do we know whether our results are sufficient to affirm H1? I.e. how big is our risk of wrongly falsifying H0?

The notion of significance We generally fix a “level of confidence” in advance. In many disciplines, we’re happy with being 95% confident that the result we obtain is correct. So we have a 5% chance of error. Therefore, we state our results at p = 0.05 “The probability of wrongly rejecting H0 is 5% (0.05)”

Tests for significance Many of the tests we use involve: 1. having a prior notion of what the mean/variance of a population is, according to H0 2. computing the mean/variance on our sample of the population 3. checking whether the sample mean/variance is different from the sample predicted by H0, at 95% confidence.

The t-test: strategy obtain mean (x) and variance (s 2 ) for a sample H0: sample is drawn from a population with mean μ and variance σ 2 estimate the t value: this compares the sample mean/variance to the expected (population) mean/variance under H0 check if any difference found is significant enough to reject H0

Computing t calculate difference between sample mean and expected population mean scale the difference by the variance Assumption: population is normally distributed. If t is big enough, we reject H0. The magnitude of t given our sample size N is simply looked up in a table. Tables tell us what the level of significance is (p-value, or likelihood of making a Type 1 error, wrongly rejecting H0).

Example: new companies We think of our corpus as a series of bigrams, and each sample we take is an indicator variable (Bernoulli trial): value = 1 if a bigram is new companies value = 0 otherwise Compute P(new) and P(companies) using standard MLE. H0: P(new companies) = P(new)P(companies)

Example continued We have computed the likelihood of our bigram of interest under H0. Since this is a Bernoulli Trial, this is also our expected mean. We then compute the actual sample probability of (new companies). Compute t and check significance

Uses of the t-test Often used to rank candidate collocations, rather than compute significance. Stop word lists must be used, else all bigrams will be significant. e.g. M&S report 824 out of 831 bigrams that pass the significance test. Reason: language is just not random regularities mean that if the corpus is large enough, all bigrams will occur together regularly and often enough to be significant. Kilgarriff (2005): Any null hypothesis will be rejected on a large enough corpus.

Extending the t-test to compare samples Variation on the original problem: what co-occurrence relations are best to distinguish between two words, w1 and w1’ that are near-synonyms? e.g. strong vs. powerful Strategy: find all bigrams and e.g. strong tea, strong support check, for each w1, if it occurs significantly more often with w2, versus w2’. NB. This is a two-sample t-test

Two-sample t-test: details H0: For any w1, the probabilities of and is the same. i.e. μ (expected difference) = 0 Strategy: extract sample of and assume they are independent compute mean and SD for each sample compute t check for significance: is the magnitude of the difference large enough? Formula:

Simplifying under binomial assumptions On large samples, variance in the binomial distribution approaches the mean. I.e.: (similarly for the other sample mean) Therefore:

Concrete example: strong vs. powerful (M&S, p. 167); NY Times Words occurring significantly more often with powerful than strong Words occurring significantly more often with strong than powerful

Criticisms of the t-test Assumes that the probabilities are normally distributed. This is probably not the case in linguistic data, where probabilities tend to be very large or very small. Alternative: chi-squared test ( Χ 2 ) compare differences between expected and observed frequencies (e.g. of bigrams)

The chi-square test

Example Imagine we’re interested in whether poor performance is a good collocation. H0: frequency of poor performance is no different from the expected frequency if each word occurs independently. Find frequencies of bigrams containing poor, performance and poor performance. compare actual to expected frequencies check if the value is high enough to reject H0

Example continued f(w1= poor)F(w1 =/= poor) f(w2=performance)15 (poor performance) 1,230 (bad performance) F(w2 =/= performance)3,580 (poor people) 12,000 (all other bigrams) OBSERVED FREQUENCIES Expected frequencies need to be computed for each cell: E.g. expected value for cell (1,1) poor performance:

Computing the value The chi-squared value is the sum of differences of observed and expected frequencies, scaled by expected frequencies. Value is once again looked up in a table to check if degree of confidence (p-value) is acceptable. If so, we conclude that the dependency between w1 and w2 is significant.

More applications of this statistic Kilgarriff and Rose 1998 use chi-square as a measure of corpus similarity draw up an n (row)*2 (column) table columns correspond to corpora rows correspond to individual types compare the difference in counts between corpora H0: corpora are drawn from the same underlying linguistic population (e.g. register or variety) corpora will be highly similar if the ratio of counts for each word is roughly constant. This uses lexical variation to compute corpus-similarity.

Limitations of t-test and chi-square Not easily interpretable a large chi-square or t value suggests a large difference but makes more sense as a comparative measure, rather than in absolute terms t-test is problematic because of the normality assumption chi-square doesn’t work very well for small frequencies (by convention, we don’t calculate it if the expected value for any of the cells is less than 5) but n-grams will often be infrequent!

Likelihood ratios for collocation discovery

Rationale A likelihood ratio is the ratio of two probabilities indicates how much more likely one hypothesis is compared to another Notation: c 1 = C(w1) c 2 = C(w2) c 12 = C( ) Hypotheses: H0: P(w2|w1) = p = P(w2|¬w1) H1: P(w2|w1) = p1 P(w2|¬w1) = p2 p1 =/= p2

H0H0H1H1 P(w2|w1) P(w2|¬w1) Prob. that c 12 bigrams out of c 1 are Prob. that c 2 - c 12 out of N- c 1 bigrams are ) Computing the likelihood ratio

The likelihood (odds) that a hypothesis H is correct is L(H).

Computing the Likelihood ratio We usually compute the log of the ratio: Usually expressed as: because, for v. large samples, this is roughly equivalent to a Χ 2 value

Interpreting the ratio Suppose that the likelihood ratio for some bigram is x. This says: If we make the hypothesis that w2 is somehow dependent on w1, then we expect it to occur x times more than its actual base rate of occurrence would predict. This ratio is also better for sparse data. we can use the estimate as an approximate chi-square value even when expected frequencies are small.

Concrete example: bigrams involving powerful (M&S, p. 174) Source: NY Times corpus (N=14.3m) Note: sparse data can still have a high log likelihood value! Interpreting -2 log l as chi-squared allows us to reject H0, even for small samples (e.g. powerful cudgels)

Relative frequency ratios An extension of the same logic of a likelihood ratio used to compare collocations across corpora Let be our bigram of interest. Let C1 and C2 be two corpora: p1 = P( ) in C1 p2 = P( ) in C2. r= p1/p2 gives an indication of the relative likelihood of in C1 and C2.

Example application Manning and Schutze (p.176) compare: C1: NY Times texts from 1990 C2: NY Times texts from 1989 Bigram occurs 44 times in C2, but only 2 times in C1, so r = 0.03 The big difference is due to 1989 papers dealing more with the fall of the Berlin Wall.

Summary We’ve now considered two forms of hypothesis testing: t-test chi-square Also, log-likelihood ratios as measures of relative probability under different hypotheses. Next, we begin to look at the problem of lexical acquisition.

References M. Lapata, S. McDonald & F. Keller (1999). Determinants of Adjective- Noun plausibility. Proceedings of the 9th Conference of the European Chapter of the Association for Computational Linguistics, EACL-99 A. Kilgarriff (2005). Language is never, ever, ever random. Corpus Linguistics and Linguistic Theory 1(2): 263 Church, K. and Hanks, P. (1990). Word association norms, mutual information and lexicography. Computational Linguistics 16(1).