Download presentation
Presentation is loading. Please wait.
Published byEmery Brown Modified over 9 years ago
1
Collocation 발표자 : 이도관
2
Contents 1.Introduction 2.Frequency 3.Mean & Variance 4.Hypothesis Testing 5.Mutual Information
3
Collocation Definition A sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit, and whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components. 특징 - Non-compositionality Ex) white wine, white hair, white woman -Non-substitutability Ex) white wine vs. yellow wine -Non-modifiability Ex) as poor as church mice vs. as poor as a church mouse 1. Introduction
4
Frequency(1) simplest method for finding collocations counting word frequency 단순히 frequency 에 의존할 경우 2. Frequency … ….…. ….…. theto26430 thein58841 theof80871 W2W1C(W1,W2)
5
Frequency(2) frequency 와 패턴을 같이 사용하는 경우 2. Frequency C(w1w2)w1w2Tag pattern 11487NewYorkA N 7261UnitedStatesA N 3301LastYearA N …………
6
patterns 2. Frequency degrees of freedomN P N class probability functionN N N mean squared errorN A N cumulative distribution functionA N N Gaussian random variableA A N regression coefficientN linear functionA N ExampleTag Pattern
7
property 장점 - 간단하면서 비교적 좋은 결과를 얻는다. - 특히 fixed phrase 에 좋다. 단점 - 정확한 결과를 얻을 수 없다. Ex) 웹 페이지에서 powerful tea 가 17 번 검색됨. - Fixed phrase 가 아니면 적용하기 어렵다. Ex) knock 과 door 2. Frequency
8
Mean & Variance finding collocations consist of two words that stand more flexible relationship to another They knocked at the door A man knocked on the metal front door mean distance & variance between two words low deviation : good candidate for collocation 3. Mean & Variance
9
Tools relative position mean : average offset variance collocation window : local phenomenon 3. Mean & Variance knockdoor knockdoor
10
example 3. Mean & Variance position of strong with respect to for -4 -3 -2 -1 0 1 2 3 4 d = -1.12 s = 2.15
11
property 장점 good for finding collocation which has - looser relationship between words - intervening material and relative position 단점 compositions like ‘ new company ’ could be selected for the candidate of collocation 3. Mean & Variance
12
Hypothesis Testing to avoid selecting a lot of words co-occurring just by chance ‘ new company ’ : just composition H0(null hypothesis) : no association between the words p(w1 w2) = p(w1)p(w2) t test, test of difference, chi-square test, likelihood ratios, 4. Hypothesis Test
13
t test t statistic tell us how likely one is to get a sample of that mean and variance probabilistic parsing, word sense disambiguation 4. Hypothesis Test
14
t test example t test applied to 10 bigrams (freq. 20) significant level : 0.005 2.576 can reject above 2 candidates ’ s H0 4. Hypothesis Test
15
Hypo. test of diff. to find words whose co-occurrence patterns best distinguish between two words. ‘ strong ’ & ‘ powerful ’ t score H0 : average difference is 0( ) 4. Hypothesis Test
16
difference test example powerful & strong strong : intrinsic quality powerful : power to move things 4. Hypothesis Test
17
chi-square test do not assume normal distribution t test : assumes normal distribution compare expected & observed frequencies if diff. Is large : can reject H0(independence) to identify translation pairs in aligned corpora chi-square statistic 4. Hypothesis Test
18
chi-square example ‘ new companies ’ significant level : 0.005 3.841 t = 1.55 : cannot reject H0 4. Hypothesis Test
19
Likelihood ratios sparse data than chi-square test more interpretable than chi-square test Hypothesis H1 : p(w2|w1)=p=p(w2|~w1) H2 : p(w2|w1)=p1 != p2=p(w2|w1) p = c2/N, p1 = c12/c1, p2 = (c2-c12)/(N-c1) likelihood ratio(pp. 173) 4. Hypothesis Test
20
Likelihood ratios (2) table 5.12(pp. 174) ‘ powerful computers ’ is 1.3E18 times more likely than its base rate of occurrence would suggest relative frequency ratio. Relative frequencies between two or more diff. Corpora. useful for subject-specific collocation Table 5.13(pp. 176) Karim Obeid (1990 vs. 1989) : 0.0241 4. Hypothesis Test
21
Mutual Information tells us how much one word about the other ex) table 5.14(pp. 178) I(Ayatollah,Ruhollah) = 18.38 Ayatollah at pos. i increase by 18.38 if Ruhollah occurs at pos. i+1 5. Mutual Info.
22
good measure of independence bad measure of dependence perfect dependence perfect independence Mutual Information(2) 5. Mutual Info.
23
장점 - 한 단어에 대해 다른 단어가 전달하는 정보를 개략적으로 측정할 수 있다. - 간단하면서 더 정확한 개념을 전달한다. 장점 - frequency 가 작은 sparse data 의 경우 결과가 잘못 나올 수 있다. Mutual Information(3) 5. Mutual Info.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.