Collocation 발표자 : 이도관. Contents 1.Introduction 2.Frequency 3.Mean & Variance 4.Hypothesis Testing 5.Mutual Information.

Slides:



Advertisements
Similar presentations
Chi square.  Non-parametric test that’s useful when your sample violates the assumptions about normality required by other tests ◦ All other tests we’ve.
Advertisements

CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
Bivariate Analyses.
Natural Language Processing COLLOCATIONS Updated 16/11/2005.
Outline What is a collocation?
QUANTITATIVE DATA ANALYSIS
Statistics II: An Overview of Statistics. Outline for Statistics II Lecture: SPSS Syntax – Some examples. Normal Distribution Curve. Sampling Distribution.
Introduction to Probability and Statistics
Fall 2001 EE669: Natural Language Processing 1 Lecture 5: Collocations (Chapter 5 of Manning and Schutze) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer.
The Simple Regression Model
Transforms What does the word transform mean?. Transforms What does the word transform mean? –Changing something into another thing.
Chapter 2 Simple Comparative Experiments
Introduction to Probability and Statistics Linear Regression and Correlation.
Inferences About Process Quality
Data Analysis Statistics. Levels of Measurement Nominal – Categorical; no implied rankings among the categories. Also includes written observations and.
Simple Linear Regression and Correlation
Collocations 09/23/2004 Reading: Chap 5, Manning & Schutze (note: this chapter is available online from the book’s page
Elec471 Embedded Computer Systems Chapter 4, Probability and Statistics By Prof. Tim Johnson, PE Wentworth Institute of Technology Boston, MA Theory and.
Outline What is a collocation? Automatic approaches 1: frequency-based methods Automatic approaches 2: ruling out the null hypothesis, t-test Automatic.
+ Quantitative Statistics: Chi-Square ScWk 242 – Session 7 Slides.
AM Recitation 2/10/11.
Albert Gatt Corpora and Statistical Methods – Part 2.
Statistical Natural Language Processing Diana Trandabăț
1 Introduction to Natural Language Processing ( ) Words and the Company They Keep AI-lab
1 Tests with two+ groups We have examined tests of means for a single group, and for a difference if we have a matched sample (as in husbands and wives)
Natural Language Processing Spring 2007 V. “Juggy” Jagannathan.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
Chapter 15 Data Analysis: Testing for Significant Differences.
1 COMP791A: Statistical Language Processing Collocations Chap. 5.
11-1 Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall Probability and Statistics Chapter 11.
10/04/1999 JHU CS /Jan Hajic 1 *Introduction to Natural Language Processing ( ) Words and the Company They Keep Dr. Jan Hajič CS Dept., Johns.
Albert Gatt Corpora and Statistical Methods. In this lecture Corpora and Statistical Methods We have considered distributions of words and lexical variation.
Statistics 11 Correlations Definitions: A correlation is measure of association between two quantitative variables with respect to a single individual.
● Final exam Wednesday, 6/10, 11:30-2:30. ● Bring your own blue books ● Closed book. Calculators and 2-page cheat sheet allowed. No cell phone/computer.
Sociology 5811: Lecture 14: ANOVA 2
1 Natural Language Processing (3b) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University
1 Natural Language Processing (5) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University
1 Statistical NLP: Lecture 7 Collocations (Ch 5).
Slide Slide 1 Section 8-6 Testing a Claim About a Standard Deviation or Variance.
Chi- square test x 2. Chi Square test Symbolized by Greek x 2 pronounced “Ki square” A Test of STATISTICAL SIGNIFICANCE for TABLE data.
1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.
Hotness Activity. Descriptives! Yay! Inferentials Basic info about sample “Simple” statistics.
Correlation Assume you have two measurements, x and y, on a set of objects, and would like to know if x and y are related. If they are directly related,
© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 11: Bivariate Relationships: t-test for Comparing the Means of Two Groups.
Inferential Statistics. The Logic of Inferential Statistics Makes inferences about a population from a sample Makes inferences about a population from.
Probability Review-1 Probability Review. Probability Review-2 Probability Theory Mathematical description of relationships or occurrences that cannot.
N318b Winter 2002 Nursing Statistics Specific statistical tests Chi-square (  2 ) Lecture 7.
COLLOCATIONS He Zhongjun Outline Introduction Approaches to find collocations Frequency Mean and Variance Hypothesis test Mutual information.
Inferences Concerning Variances
IE241: Introduction to Design of Experiments. Last term we talked about testing the difference between two independent means. For means from a normal.
Lecture 11. The chi-square test for goodness of fit.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Statistical Significance Hypothesis Testing.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Chapter 12 Tests of Goodness of Fit and Independence n Goodness of Fit Test: A Multinomial.
Chapter 10 Section 5 Chi-squared Test for a Variance or Standard Deviation.
Objectives (BPS chapter 12) General rules of probability 1. Independence : Two events A and B are independent if the probability that one event occurs.
LESSON 5 - STATISTICS & RESEARCH STATISTICS – USE OF MATH TO ORGANIZE, SUMMARIZE, AND INTERPRET DATA.
COGS Bilge Say1 Using Corpora for Language Research COGS 523-Lecture 8 Collocations.
Dr.Rehab F.M. Gwada. Measures of Central Tendency the average or a typical, middle observed value of a variable in a data set. There are three commonly.
Biostatistics Class 3 Probability Distributions 2/15/2000.
Collocations David Guy Brizan Speech and Language Processing Seminar 26 th October, 2006.
Basic statistics for corpus linguistics
Inference about the slope parameter and correlation
The simple linear regression model and parameter estimation
Statistical NLP: Lecture 7
LECTURE 33: STATISTICAL SIGNIFICANCE AND CONFIDENCE (CONT.)
Statistical Analysis Professor Lynne Stokes
Math 4030 – 10b Inferences Concerning Variances: Hypothesis Testing
Many slides from Rada Mihalcea (Michigan), Paul Tarau (U.North Texas)
Statistical significance & the Normal Curve
Statistics II: An Overview of Statistics
Presentation transcript:

Collocation 발표자 : 이도관

Contents 1.Introduction 2.Frequency 3.Mean & Variance 4.Hypothesis Testing 5.Mutual Information

Collocation Definition A sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit, and whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components. 특징 - Non-compositionality Ex) white wine, white hair, white woman -Non-substitutability Ex) white wine vs. yellow wine -Non-modifiability Ex) as poor as church mice vs. as poor as a church mouse 1. Introduction

Frequency(1) simplest method for finding collocations counting word frequency 단순히 frequency 에 의존할 경우 2. Frequency … ….…. ….…. theto26430 thein58841 theof80871 W2W1C(W1,W2)

Frequency(2) frequency 와 패턴을 같이 사용하는 경우 2. Frequency C(w1w2)w1w2Tag pattern 11487NewYorkA N 7261UnitedStatesA N 3301LastYearA N …………

patterns 2. Frequency degrees of freedomN P N class probability functionN N N mean squared errorN A N cumulative distribution functionA N N Gaussian random variableA A N regression coefficientN linear functionA N ExampleTag Pattern

property 장점 - 간단하면서 비교적 좋은 결과를 얻는다. - 특히 fixed phrase 에 좋다. 단점 - 정확한 결과를 얻을 수 없다. Ex) 웹 페이지에서 powerful tea 가 17 번 검색됨. - Fixed phrase 가 아니면 적용하기 어렵다. Ex) knock 과 door 2. Frequency

Mean & Variance finding collocations consist of two words that stand more flexible relationship to another  They knocked at the door  A man knocked on the metal front door mean distance & variance between two words low deviation : good candidate for collocation 3. Mean & Variance

Tools relative position mean : average offset variance collocation window : local phenomenon 3. Mean & Variance knockdoor knockdoor

example 3. Mean & Variance position of strong with respect to for d = s = 2.15

property 장점 good for finding collocation which has - looser relationship between words - intervening material and relative position 단점 compositions like ‘ new company ’ could be selected for the candidate of collocation 3. Mean & Variance

Hypothesis Testing to avoid selecting a lot of words co-occurring just by chance  ‘ new company ’ : just composition H0(null hypothesis) : no association between the words  p(w1 w2) = p(w1)p(w2) t test, test of difference, chi-square test, likelihood ratios, 4. Hypothesis Test

t test t statistic tell us how likely one is to get a sample of that mean and variance probabilistic parsing, word sense disambiguation 4. Hypothesis Test

t test example t test applied to 10 bigrams (freq. 20) significant level :  can reject above 2 candidates ’ s H0 4. Hypothesis Test

Hypo. test of diff. to find words whose co-occurrence patterns best distinguish between two words.  ‘ strong ’ & ‘ powerful ’ t score H0 : average difference is 0( ) 4. Hypothesis Test

difference test example powerful & strong  strong : intrinsic quality  powerful : power to move things 4. Hypothesis Test

chi-square test do not assume normal distribution  t test : assumes normal distribution compare expected & observed frequencies  if diff. Is large : can reject H0(independence) to identify translation pairs in aligned corpora chi-square statistic 4. Hypothesis Test

chi-square example ‘ new companies ’ significant level :  t = 1.55 : cannot reject H0 4. Hypothesis Test

Likelihood ratios sparse data than chi-square test more interpretable than chi-square test Hypothesis  H1 : p(w2|w1)=p=p(w2|~w1)  H2 : p(w2|w1)=p1 != p2=p(w2|w1)  p = c2/N, p1 = c12/c1, p2 = (c2-c12)/(N-c1) likelihood ratio(pp. 173) 4. Hypothesis Test

Likelihood ratios (2) table 5.12(pp. 174)  ‘ powerful computers ’ is 1.3E18 times more likely than its base rate of occurrence would suggest relative frequency ratio.  Relative frequencies between two or more diff. Corpora.  useful for subject-specific collocation  Table 5.13(pp. 176)  Karim Obeid (1990 vs. 1989) : Hypothesis Test

Mutual Information tells us how much one word about the other ex) table 5.14(pp. 178)  I(Ayatollah,Ruhollah) =  Ayatollah at pos. i increase by if Ruhollah occurs at pos. i+1 5. Mutual Info.

good measure of independence bad measure of dependence  perfect dependence  perfect independence Mutual Information(2) 5. Mutual Info.

장점 - 한 단어에 대해 다른 단어가 전달하는 정보를 개략적으로 측정할 수 있다. - 간단하면서 더 정확한 개념을 전달한다. 장점 - frequency 가 작은 sparse data 의 경우 결과가 잘못 나올 수 있다. Mutual Information(3) 5. Mutual Info.