1 Statistical NLP: Lecture 7 Collocations (Ch 5).

Slides:



Advertisements
Similar presentations
Statistics Review – Part II Topics: – Hypothesis Testing – Paired Tests – Tests of variability 1.
Advertisements

Chi square.  Non-parametric test that’s useful when your sample violates the assumptions about normality required by other tests ◦ All other tests we’ve.
Review of the Basic Logic of NHST Significance tests are used to accept or reject the null hypothesis. This is done by studying the sampling distribution.
Natural Language Processing COLLOCATIONS Updated 16/11/2005.
Outline What is a collocation?
Statistical Issues in Research Planning and Evaluation
The Independent- Samples t Test Chapter 11. Independent Samples t-Test >Used to compare two means in a between-groups design (i.e., each participant is.
PSY 307 – Statistics for the Behavioral Sciences
Statistics II: An Overview of Statistics. Outline for Statistics II Lecture: SPSS Syntax – Some examples. Normal Distribution Curve. Sampling Distribution.
Sample size computations Petter Mostad
9-1 Hypothesis Testing Statistical Hypotheses Statistical hypothesis testing and confidence interval estimation of parameters are the fundamental.
1 Collocation and translation MA Literary Translation- Lesson 2 prof. Hugo Bowles February
Fall 2001 EE669: Natural Language Processing 1 Lecture 5: Collocations (Chapter 5 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer.
Topic 2: Statistical Concepts and Market Returns
Intro to Statistics for the Behavioral Sciences PSYC 1900
Hypothesis Testing. G/RG/R Null Hypothesis: The means of the populations from which the samples were drawn are the same. The samples come from the same.
BCOR 1020 Business Statistics Lecture 21 – April 8, 2008.
Independent Sample T-test Often used with experimental designs N subjects are randomly assigned to two groups (Control * Treatment). After treatment, the.
Chapter 2 Simple Comparative Experiments
Inferences About Process Quality
BCOR 1020 Business Statistics Lecture 20 – April 3, 2008.
Statistical Comparison of Two Learning Algorithms Presented by: Payam Refaeilzadeh.
Today Concepts underlying inferential statistics
Independent Sample T-test Classical design used in psychology/medicine N subjects are randomly assigned to two groups (Control * Treatment). After treatment,
Collocations 09/23/2004 Reading: Chap 5, Manning & Schutze (note: this chapter is available online from the book’s page
Chapter 14 Inferential Data Analysis
Hypothesis Testing and T-Tests. Hypothesis Tests Related to Differences Copyright © 2009 Pearson Education, Inc. Chapter Tests of Differences One.
Outline What is a collocation? Automatic approaches 1: frequency-based methods Automatic approaches 2: ruling out the null hypothesis, t-test Automatic.
Chapter 9 Title and Outline 1 9 Tests of Hypotheses for a Single Sample 9-1 Hypothesis Testing Statistical Hypotheses Tests of Statistical.
Choosing Statistical Procedures
AM Recitation 2/10/11.
Albert Gatt Corpora and Statistical Methods – Part 2.
Confidence Intervals and Hypothesis Testing - II
Statistical Natural Language Processing Diana Trandabăț
1 Introduction to Natural Language Processing ( ) Words and the Company They Keep AI-lab
1 Level of Significance α is a predetermined value by convention usually 0.05 α = 0.05 corresponds to the 95% confidence level We are accepting the risk.
Natural Language Processing Spring 2007 V. “Juggy” Jagannathan.
Copyright © 2010, 2007, 2004 Pearson Education, Inc Lecture Slides Elementary Statistics Eleventh Edition and the Triola Statistics Series by.
Introduction to Hypothesis Testing: One Population Value Chapter 8 Handout.
1 COMP791A: Statistical Language Processing Collocations Chap. 5.
Albert Gatt Corpora and Statistical Methods. In this lecture Corpora and Statistical Methods We have considered distributions of words and lexical variation.
9-1 Hypothesis Testing Statistical Hypotheses Definition Statistical hypothesis testing and confidence interval estimation of parameters are.
University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 08/10/ :23 PM 1 Some basic statistical concepts, statistics.
1 Natural Language Processing (3b) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University
1 Natural Language Processing (5) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University
Collocation 발표자 : 이도관. Contents 1.Introduction 2.Frequency 3.Mean & Variance 4.Hypothesis Testing 5.Mutual Information.
1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.
Educational Research Chapter 13 Inferential Statistics Gay, Mills, and Airasian 10 th Edition.
Chapter 14 – 1 Chapter 14: Analysis of Variance Understanding Analysis of Variance The Structure of Hypothesis Testing with ANOVA Decomposition of SST.
Statistical Inference for the Mean Objectives: (Chapter 9, DeCoursey) -To understand the terms: Null Hypothesis, Rejection Region, and Type I and II errors.
DOX 6E Montgomery1 Design of Engineering Experiments Part 2 – Basic Statistical Concepts Simple comparative experiments –The hypothesis testing framework.
© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 11: Bivariate Relationships: t-test for Comparing the Means of Two Groups.
N318b Winter 2002 Nursing Statistics Specific statistical tests Chi-square (  2 ) Lecture 7.
Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.
Chapter Eight: Using Statistics to Answer Questions.
COLLOCATIONS He Zhongjun Outline Introduction Approaches to find collocations Frequency Mean and Variance Hypothesis test Mutual information.
© Copyright McGraw-Hill 2004
1 Testing Statistical Hypothesis The One Sample t-Test Heibatollah Baghi, and Mastee Badii.
Slide 1 Copyright © 2004 Pearson Education, Inc. Chapter 11 Multinomial Experiments and Contingency Tables 11-1 Overview 11-2 Multinomial Experiments:
Educational Research Inferential Statistics Chapter th Chapter 12- 8th Gay and Airasian.
COGS Bilge Say1 Using Corpora for Language Research COGS 523-Lecture 8 Collocations.
CHI SQUARE DISTRIBUTION. The Chi-Square (  2 ) Distribution The chi-square distribution is the probability distribution of the sum of several independent,
Collocations David Guy Brizan Speech and Language Processing Seminar 26 th October, 2006.
Statistical NLP: Lecture 7
Chapter 2 Simple Comparative Experiments
Many slides from Rada Mihalcea (Michigan), Paul Tarau (U.North Texas)
9 Tests of Hypotheses for a Single Sample CHAPTER OUTLINE
What are their purposes? What kinds?
Chapter Nine: Using Statistics to Answer Questions
Last Update 12th May 2011 SESSION 41 & 42 Hypothesis Testing.
Presentation transcript:

1 Statistical NLP: Lecture 7 Collocations (Ch 5)

2 Introduction Collocations are characterized by limited compositionality. Large overlap between the concepts of collocations and terms, technical term and terminological phrase. Collocations sometimes reflect interesting attitudes (in English) towards different types of substances: strong cigarettes, tea, coffee versus powerful drug (e.g., heroin) 미국 쪽은 관심이 적었으나, 영국 쪽은 많았음 (Firth), contextual view 신발  신다, 양복  입다

3 Definition (w.r.t Computational and Statistical Literature) [A collocation is defined as] a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit, and whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components. [Chouekra, 1988]

4 Other Definitions/Notions (w.r.t. Linguistic Literature) Collocations are not necessarily adjacent Typical criteria for collocations: noncompositionality, non-substitutability, nonmodifiability. Collocations cannot be translated into other languages. Generalization to weaker cases (strong association of words, but not necessarily fixed occurrence.

5 Linguistic Subclasses of Collocations Light verbs: verbs with little semantic content Verb particle constructions or Phrasal Verbs Proper Nouns/Names Terminological Expressions

6 Overview of the Collocation Detecting Techniques Surveyed Selection of Collocations by Frequency Selection of Collocation based on Mean and Variance of the distance between focal word and collocating word. Hypothesis Testing Mutual Information

7 Frequency (Justeson & Katz, 1995) 1. Selecting the most frequently occurring bigrams : table Passing the results through a part-of speech filter: patterns likely to be phrases - A N, N N, A A N, N A N, N N N, N P N - Table Table 5.4 : strong vs. powerful  Simple method that works very well.

8 Mean and Variance (I) (Smadja et al., 1993) Frequency-based search works well for fixed phrases. However, many collocations consist of two words in more flexible relationships. –{knock, hit, beat, rap} {the door, on his door, at the door, on the metal front door} The method computes the mean and variance of the offset (signed distance) between the two words in the corpus. –Fig. 5.4 – a man knocked on Donaldson’s door (5) – door before knocked (-2), door that she knocked (-3) If the offsets are randomly distributed (i.e., no collocation), then the variance/sample deviation will be high.

9 Mean and Variance (II) n = number of times two words collocate μ = sample mean di = the value of each sample Sample deviation:

10 예 position of “strong” wrt “opposition” – 평균 (d): -1.15, 표준편차 (s): 0.67 Fig 이내의 collocation 만 고려 !!! “strong” wrt “support” –strong leftist support “strong” wrt “for” Table 5.5 – 평균이 1.0 보다 크고, 편차가 작으면 : 정보가 큼 – 편차가 크면 : 정보가 작음 – 평균이 0 에 가까우면 정규분포를 보이는 것임 –Samaja 는 여기에 histogram 을 이용하여 제거 80% 의 정확도로 terminology 를 찾아냄 – knock == door  terminology 는 아님, – 자연언어생성에는 knock === door 도 효과적임 ( 다수 어절 중에 선택 )

11 Hypothesis Testing: Overview High frequency and low variance can be accidental. We want to determine whether the cooccurrence is random or whether it occurs more often than chance. – new company This is a classical problem in Statistics called Hypothesis Testing. We formulate a null hypothesis H 0 (no association - only chance) and calculate the probability that a collocation would occur if H 0 were true, and then reject H 0 if p is too low. Otherwise, retain H 0 as possible. 단순한 형태 : p(w 1 w 2 ) = p(w 1 )p(w 2 )

12 Hypothesis Testing: The t-test The t-test looks at the mean and variance of a sample of measurements, where the null hypothesis is that the sample is drawn from a distribution with mean m. The test looks at the difference between the observed and expected means, scaled by the variance of the data, and tells us how likely one to get a sample of that mean and variance assuming that the sample is drawn from a normal distribution with mean μ. To apply the t-test to collocations, we think of text corpus as a long sequence of N bigrams.

13 Hypothesis Testing: Formula

14 예 158cm 평균 키 200 명  평균 169, 편차의 제곱 : 2600 – t=( )/root(2600/200)  3.05 –Confidence level 0.005( 일반적으로 실험과학에서 받 아들이는 정도 ) 이면 ::: 이보다 크므로 –99.5% 의 신뢰도로 null hypothesis 가 아님

15 문서에서 new companies –p(new) = 15828/ –p(companies) = 4675/ –H 0 : p(new companies) = p(new)*p(companies) => 3.615*10 -7 –Bernoulli trial : p= 3.615*10 -7  표준편차의 제곱은 p(1-p) 이며 이 는 p 가 아주 작으므로 거의 p 이다. – new companies 는 8 번 나타남 : 8/ = 5.591*10 -7 – t= (5.591 – 3.615)*10 -7 /root(5.591*10 -7 / )  – 표에서 찾으면 면 이며, 이보다 작으므로 null hypothesis 를 부정하지 못함 –Table 5.6 으로 설명

16 Hypothesis testing of differences (Church & Hanks, 1989) We may also want to find words whose cooccurrence patterns best distinguish between two words. This application can be useful for lexicography. The t-test is extended to the comparison of the means of two normal populations. Here, the null hypothesis is that the average difference is 0.

17 Hypothesis testing of difs. (II)

18 t-test for statistical significance of the difference between two systems

19 t-test for differences (continued) Pooled s 2 = ( ) / ( ) = For rejecting the hypothesis that System 1 is better then System 2 with a probability level of α = 0.05, the critical value is t=1.725 (from statistics table) We cannot conclude the superiority of System 1 because of the large variance in scores

20 Chi-Square test (I): Method Use of the t-test has been criticized because it assumes that probabilities are approximately normally distributed (not true, generally). The Chi-Square test does not make this assumption. The essence of the test is to compare observed frequencies with frequencies expected for independence. If the difference between observed and expected frequencies is large, then we can reject the null hypothesis of independence.

21 Chi-Square test (II): Formula

22 계산 계산 예를 보임 !!! 계산에 대한 설명을 함 – new companies 는 null hypothesis 를 부정 못 함 – 상위 20 개의 t-scores 와 x 2 는 같음 !!!

23 Chi-Square test (III): Applications One of the early uses of the Chi square test in Statistical NLP was the identification of translation pairs in aligned corpora (Church & Gale, 1991). A more recent application is to use Chi square as a metric for corpus similarity (Kilgariff and Rose,1998) Nevertheless, the Chi-Square test should not be used in small corpora.

24 예 Table 5.10 설명하면서 cosine measure 등도 언급

25 Likelihood Ratios I: Within a single corpus (Dunning, 1993) Likelihood ratios are more appropriate for sparse data than the Chi-Square test. In addition, they are easier to interpret than the Chi-Square statistic. In applying the likelihood ratio test to collocation discovery, we examine the following two alternative explanations for the occurrence frequency of a bigram w1 w2: –The occurrence of w2 is independent of the previous occurrence of w1 –The occurrence of w2 is dependent of the previous occurrence of w1

26 수식 설명 Binomial distribution 을 가정 !!!

27 Log likelihood

28 실제 응용 -2logλ  x 2 전체 공간 (p1, p2), 부분공간 (a subset of cases) p1=p2

29 Likelihood Ratios II: Between two or more corpora (Damerau, 1993) Ratios of relative frequencies between two or more different corpora can be used to discover collocations that are characteristic of a corpus when compared to other corpora. r(relative frequency ratio) 은 간단히 구해짐 – r1 = c1(w)/N1, r2 = c2(w)/N2, r=r1/r2 ( 표 5.13) This approach is most useful for the discovery of subject-specific collocations.

30 Mutual Information (I) An information-theoretic measure for discovering collocations is pointwise mutual information (Church et al., 89, 91) Pointwise Mutual Information is roughly a measure of how much one word tells us about the other. Pointwise mutual information does not work well with sparse data.

31 Mutual Information (II)

32 설명 왜 좋지 않은지를 179 페이지 내용으로 설명 비트가 쓰이냐는 문제 180 페이지의 예, 181 페이지의 예로 여기에 적용 하면 문제가 되는 이유 설명

33 Collocation 에 대한 정리 184 페이지 설명 185 페이지 확장 186 페이지 설명