Collocation 발표자 : 이도관. Contents 1.Introduction 2.Frequency 3.Mean & Variance 4.Hypothesis Testing 5.Mutual Information.

Collocation 발표자 : 이도관

Contents 1.Introduction 2.Frequency 3.Mean & Variance 4.Hypothesis Testing 5.Mutual Information

Collocation Definition A sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit, and whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components. 특징 - Non-compositionality Ex) white wine, white hair, white woman -Non-substitutability Ex) white wine vs. yellow wine -Non-modifiability Ex) as poor as church mice vs. as poor as a church mouse 1. Introduction

Frequency(1) simplest method for finding collocations counting word frequency 단순히 frequency 에 의존할 경우 2. Frequency … ….…. ….…. theto26430 thein58841 theof80871 W2W1C(W1,W2)

Frequency(2) frequency 와 패턴을 같이 사용하는 경우 2. Frequency C(w1w2)w1w2Tag pattern 11487NewYorkA N 7261UnitedStatesA N 3301LastYearA N …………

patterns 2. Frequency degrees of freedomN P N class probability functionN N N mean squared errorN A N cumulative distribution functionA N N Gaussian random variableA A N regression coefficientN linear functionA N ExampleTag Pattern

property 장점 - 간단하면서 비교적 좋은 결과를 얻는다. - 특히 fixed phrase 에 좋다. 단점 - 정확한 결과를 얻을 수 없다. Ex) 웹 페이지에서 powerful tea 가 17 번 검색됨. - Fixed phrase 가 아니면 적용하기 어렵다. Ex) knock 과 door 2. Frequency

Mean & Variance finding collocations consist of two words that stand more flexible relationship to another  They knocked at the door  A man knocked on the metal front door mean distance & variance between two words low deviation : good candidate for collocation 3. Mean & Variance

Tools relative position mean : average offset variance collocation window : local phenomenon 3. Mean & Variance knockdoor knockdoor

example 3. Mean & Variance position of strong with respect to for -4 -3 -2 -1 0 1 2 3 4 d = -1.12 s = 2.15

property 장점 good for finding collocation which has - looser relationship between words - intervening material and relative position 단점 compositions like ‘ new company ’ could be selected for the candidate of collocation 3. Mean & Variance

Hypothesis Testing to avoid selecting a lot of words co-occurring just by chance  ‘ new company ’ : just composition H0(null hypothesis) : no association between the words  p(w1 w2) = p(w1)p(w2) t test, test of difference, chi-square test, likelihood ratios, 4. Hypothesis Test

t test t statistic tell us how likely one is to get a sample of that mean and variance probabilistic parsing, word sense disambiguation 4. Hypothesis Test

t test example t test applied to 10 bigrams (freq. 20) significant level : 0.005 2.576  can reject above 2 candidates ’ s H0 4. Hypothesis Test

Hypo. test of diff. to find words whose co-occurrence patterns best distinguish between two words.  ‘ strong ’ & ‘ powerful ’ t score H0 : average difference is 0( ) 4. Hypothesis Test

difference test example powerful & strong  strong : intrinsic quality  powerful : power to move things 4. Hypothesis Test

chi-square test do not assume normal distribution  t test : assumes normal distribution compare expected & observed frequencies  if diff. Is large : can reject H0(independence) to identify translation pairs in aligned corpora chi-square statistic 4. Hypothesis Test

chi-square example ‘ new companies ’ significant level : 0.005 3.841  t = 1.55 : cannot reject H0 4. Hypothesis Test

Likelihood ratios sparse data than chi-square test more interpretable than chi-square test Hypothesis  H1 : p(w2|w1)=p=p(w2|~w1)  H2 : p(w2|w1)=p1 != p2=p(w2|w1)  p = c2/N, p1 = c12/c1, p2 = (c2-c12)/(N-c1) likelihood ratio(pp. 173) 4. Hypothesis Test

Likelihood ratios (2) table 5.12(pp. 174)  ‘ powerful computers ’ is 1.3E18 times more likely than its base rate of occurrence would suggest relative frequency ratio.  Relative frequencies between two or more diff. Corpora.  useful for subject-specific collocation  Table 5.13(pp. 176)  Karim Obeid (1990 vs. 1989) : 0.0241 4. Hypothesis Test

Mutual Information tells us how much one word about the other ex) table 5.14(pp. 178)  I(Ayatollah,Ruhollah) = 18.38  Ayatollah at pos. i increase by 18.38 if Ruhollah occurs at pos. i+1 5. Mutual Info.

good measure of independence bad measure of dependence  perfect dependence  perfect independence Mutual Information(2) 5. Mutual Info.

장점 - 한 단어에 대해 다른 단어가 전달하는 정보를 개략적으로 측정할 수 있다. - 간단하면서 더 정확한 개념을 전달한다. 장점 - frequency 가 작은 sparse data 의 경우 결과가 잘못 나올 수 있다. Mutual Information(3) 5. Mutual Info.

Collocation 발표자 : 이도관. Contents 1.Introduction 2.Frequency 3.Mean & Variance 4.Hypothesis Testing 5.Mutual Information.

Similar presentations

Presentation on theme: "Collocation 발표자 : 이도관. Contents 1.Introduction 2.Frequency 3.Mean & Variance 4.Hypothesis Testing 5.Mutual Information."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Collocation 발표자 : 이도관. Contents 1.Introduction 2.Frequency 3.Mean & Variance 4.Hypothesis Testing 5.Mutual Information.

Similar presentations

Presentation on theme: "Collocation 발표자 : 이도관. Contents 1.Introduction 2.Frequency 3.Mean & Variance 4.Hypothesis Testing 5.Mutual Information."— Presentation transcript:

Similar presentations

About project

Feedback