Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistics MP Oakes (1998) Statistics for corpus linguistics. Edinburgh University Press.

Similar presentations


Presentation on theme: "Statistics MP Oakes (1998) Statistics for corpus linguistics. Edinburgh University Press."— Presentation transcript:

1 Statistics MP Oakes (1998) Statistics for corpus linguistics. Edinburgh University Press

2 2 Rock bottom basics Central tendency –With any set of numerical scores (eg frequency counts of word types, lengths of sentences in a corpus) –mode (most frequently obtained score) Easily affected by chance scores –median (the score nearest the middle of the range of scores) Will be close to mean if data evenly distributed –mean (average) in equations

3 3 Rock bottom basics Probability of an event a, usually written P(a) –For a set of alternative events, total of all probabilities is 1 –Events assumed to be independent This can be counter-intuitive, but (in coin toss) the chance of heads is always 1/2, whatever the preceding tosses were Probability of an event a, given some other condition b is written P(a|b) –Notice that P(a|b) is independent of P(b ) - eg P(skelter|helter) Not to be confused with the probability of two events co-occurring –written P(a,b) –which is not the same as the combined probability P(a)  P(b)

4 4 Simple word counts A simple frequency count on its own might not tell you anything Need to compare it with something else –Frequency counts of other similar things –Or the frequency count that you might expect on average Then need to see if the measured difference is significant

5 5 Statistical significance Probably most commonly used statistic in all social science is t-test Understood that any result could be due to random chance Statistical significance tells you what level of random chance would be responsible for the result you get Usually involves looking something up in a table –Level of certainty –Number of variables or degrees of freedom

6 6 Correlation Frequency counts might provide an ordered list You might want to compare counts of two things to see if they are correlated, eg word length in English and number of characters in Chinese (Xu 1996) Person’s rho There’s also a formula for rank correlation

7 7 Xu (1996) XYsqr(X)sqr(Y)XY 12142 21412 22444 31913 32946 421648 6236412 6336918 714917 72 414 8264416 9281418 102100420 112121422 113121933 TOTAL N=152970061185 Critical value for 15 pairs of observations at 5% level of confidence is 0.441, so result is not statistically significant (it is at 10% level though)

8 8 Comparison with expected values We might want to compare relative frequencies of a range of features Chi-square test shows if frequency differences are significant where O is observed value, E is expected value

9 9 Yamamoto (1996) Frequencies of types of 3 rd -person reference in English and Japanese Sum = 258.8, significant for (5-1)x(2-1)=4dfs at 0.1% level JapaneseEnglishTOTALSE(J)E(E)X 2 (J)X 2 (E) Ellipsis1040 48.6055.4063.1455.40 Central pronouns73314387180.86206.1464.3256.43 Non-central pronouns12284018.6921.312.402.10 Names314291605282.73322.273.463.03 Common NPs205174379177.12201.884.393.85 TOTAL7088071515

10 10 Co-occurrence Is distribution of two things correlated? Contingency table –eg sentences where two words co-occur or not Phi coefficient Dice’s coefficient Several variants W1 not W1 W2 a b not W2 c d

11 11 Co-occurrence Scores such as Dice’s coefficient need to be turned into something like a t score, so that significance can be measured

12 12 Co-occurrence  Mutual information Measures the relatedness of two variables compares joint and combined Ps P  0 = chance association P>>0 strong association P<<0 complementary distribution In terms of contingency matrix:

13 13 Church & Hanks (1990) Used MI to show word associations –Eg doctors + {dentitsts,nurses,treating,treat, examine,bills,hospitals} –In contrast with doctors + {with,a,is} –Identify phrasal verbs eg set + {up,off,out,in} but not about –Using a parser to separate N and V readings, most likely objects of verb drink –What you can do to a telephone (sit by, disconnect, answer, …)

14 14

15 15 Church et al. (1991) strong vs powerful experiment MIword pairMIword pair 10.47strong northerly8.66powerful legacy 9.76strong showings8.58powerful tool 9.3strong believer8.35powerful storms


Download ppt "Statistics MP Oakes (1998) Statistics for corpus linguistics. Edinburgh University Press."

Similar presentations


Ads by Google