Download presentation
Presentation is loading. Please wait.
Published byGwen Robinson Modified over 9 years ago
1
Simple Maths for Keywords Adam Kilgarriff Lexical Computing Ltd
2
Liverpool, July 2009Kilgarriff: Simple Maths2 “This word is twice as common here as there”
3
Liverpool, July 2009Kilgarriff: Simple Maths3 “This word is twice as common here as there” What does it mean? For word wubble Ratio=2: wubble is twice as common in fc as rc Freq (f)Corp SizePer million Focus corp (fc) 4010m4 Reference corp (rc) 5025m2
4
Liverpool, July 2009Kilgarriff: Simple Maths4 “This word is twice as common here as there” Not just words Grammatical constructions Suffixes … Keyword list Calculate ratio for all words Sort Keywords: at top of list
5
Liverpool, July 2009Kilgarriff: Simple Maths5 Good enough for keywords? Almost, but 1.Are corpora well matched? 2.Burstiness 3.You can’t divide by zero 4.High ratios more common for rare words
6
Liverpool, July 2009Kilgarriff: Simple Maths6 1Are corpora well matched? Proportionality If fiction contains more American, newspaper more British… genre compromised by region Usual problem Issue in corpus design Not here
7
Liverpool, July 2009Kilgarriff: Simple Maths7 2Burstiness WordBNC freqBNC files mucosa10319 theology1032230 unfortunate1031648 Discount frequency for bursty words Gries, CL 2007, also CL journal We use ARF (average reduced frequency) Not here
8
Liverpool, July 2009Kilgarriff: Simple Maths8 3You can’t divide by zero Standard solution: add one Problem solved fc rcratio buggle100? stort1000? nammikin10000? fc rcratio buggle111 stort1011 nammikin10011
9
Liverpool, July 2009Kilgarriff: Simple Maths9 4High ratios more common for rarer words fc rc ratiointeresting? spug101 no grod100010010yes some researchers: grammar, grammar words some researchers: lexis content words No right answer Slider?
10
Liverpool, July 2009Kilgarriff: Simple Maths10 Solution Don’t just add 1, add n: n=1 n=100 word fc rc fc+n rc+nRatioRank obscurish10011111.001 middling2001002011011.992 common120001000012001100011.203 word fc rc fc+n rc+nRatioRank obscurish1001101001.103 middling2001003002001.501 common120001000012100101001.202
11
Liverpool, July 2009Kilgarriff: Simple Maths11 Solution n=1000 Summary word fc rc fc+n rc+nRatioRank obscurish100101010001.013 middling200100120011001.092 common120001000013000110001.181 word fc rc n=1 n=100n=1000 obscurish1001st2nd3rd middling2001002nd1st2nd common12000100003rd 1st
12
Liverpool, July 2009Kilgarriff: Simple Maths12 But what about Mutual information Log-likelihood Chi-square Fisher’s test … Don’t they use cleverer maths?
13
Liverpool, July 2009Kilgarriff: Simple Maths13 Yes but Clever maths is for hypothesis testing Can you defeat null hypothesis? Language is not random, so … you always can Null hypothesis never true Hypothesis-testing not informative Clever maths irrelevant Kilgarriff 2006, CLLT
14
Liverpool, July 2009Kilgarriff: Simple Maths14 Moreover… just one answer grammar words vs content words? does not help confuses and obscures
15
Liverpool, July 2009Kilgarriff: Simple Maths15 you should understand the maths you use
16
Liverpool, July 2009Kilgarriff: Simple Maths16 The Sketch Engine Leading corpus query tool Widely used by dictionary publishers, at universities Large corpora for many lgs available Word sketches Web service Since last week: Implements SimpleMaths
17
Liverpool, July 2009Kilgarriff: Simple Maths17 Example BAWE British Academic Written English Nesi and Thompson, completed last year Student essays Arts/Humanities, Social Sciences, Life Sciences, Physical Sciences fc: ArtsHum, rc: SocSci With n=10 and n=1000
18
Liverpool, July 2009Kilgarriff: Simple Maths18
19
Liverpool, July 2009Kilgarriff: Simple Maths19
20
Liverpool, July 2009Kilgarriff: Simple Maths20 Thank you http://www.sketchengine.co.uk
21
Liverpool, July 2009Kilgarriff: Simple Maths21 Language is never ever ever random
22
Liverpool, July 2009Kilgarriff: Simple Maths22 Language
23
Liverpool, July 2009Kilgarriff: Simple Maths23 is
24
Liverpool, July 2009Kilgarriff: Simple Maths24 never
25
Liverpool, July 2009Kilgarriff: Simple Maths25 ever
26
Liverpool, July 2009Kilgarriff: Simple Maths26 ever
27
Liverpool, July 2009Kilgarriff: Simple Maths27 random
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.