Simple Maths for Keywords Adam Kilgarriff Lexical Computing Ltd.

Slides:



Advertisements
Similar presentations
Terminology-finding in the Sketch Engine Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý, Vit Suchomel Lexical Computing Ltd., Brighton,
Advertisements

Finding multiwords of more than two words Adam Kilgarriff, Pavel Rychly, Vojtech Kovar, Vıt Baisa Lexical Computing Ltd; Masaryk Univ., Cz.
Feed Corpus : An Ever Growing Up to Date Corpus Akshay Minocha, Siva Reddy, Adam Kilgarriff Lexical Computing Ltd.
Using Corpus Tools in Discourse Analysis Discourse and Pragmatics Week 12.
Chi-Square Test A fundamental problem is genetics is determining whether the experimentally determined data fits the results expected from theory (i.e.
1 Corpora for all Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Measuring Distance between Language Varieties Adam Kilgarriff, Jan Pomikalek, Pavel Rychly, Vit Suchomel Supported by EU Project PRESEMT.
Augmenting online dictionary entries with corpus data for Search Engine Optimisation Holger Hvelplund, 1 Adam Kilgarriff, 2 Vincent Lannoy, 1 Patrick White.
Corpus Creation for Lexicography Adam Kilgarriff, Michael Rundell Lexicography MasterClass, UK Elaine Ui Dhonnchadha ITE (Linguistics Institute of Ireland)
Using Corpora for Teaching Chinese Dr. Adam Kilgarriff Lexical Computing Ltd Leeds University UK.
The Sketch Engine -What is The Sketch Engine? -What is a corpus? -Looking at the BASE and the BAWE corpora. -How can this help.
Labels: automation Adam Kilgarriff. Kivik 2013Kilgarriff / Labels: automation2 Which words are:  Most distinctive of business English? Keywords, already.
Making useful wordlists for ELT Topical vocabulary from the WWW Simon Smith & Scott Sommers Ming Chuan University, Taipei Adam Kilgarriff, Lexical Computing.
Talking about your homework News story? –What made you choose…? One of your words? –What made you choose…? (Give your vocabulary books to another student.
1 Corpora for the coming decade Adam Kilgarriff Lexical Computing Ltd.
Today Writing: using the comma –Writing task Corpus linguistics talk, Part 2 Re-organize groups –Group news discussion.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Feed Corpus : An Ever Growing Up to Date Corpus Akshay Minocha, Siva Reddy, Adam Kilgarriff Lexical Computing Ltd.
Research methods in corpus linguistics Xiaofei Lu.
What's on the Web? The Web as a Linguistic Corpus Adam Kilgarriff Lexical Computing Ltd University of Leeds.
Memory Strategy – Using Mental Images
Labels: automation Adam Kilgarriff. Auckland 2012Kilgarriff / Labels: automation2 Which words are:  Most distinctive of business English?  Most often.
Traditional method 2 means, σ’s unknown. Scientists studying the effect of diet on cognitive ability are comparing two groups of mice. The first group.
So much of everything Adam Kilgarriff Lexical Computing Ltd.
Using Corpora for Teaching Chinese Dr. Adam Kilgarriff Lexical Computing Ltd Leeds University UK.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
1 Corpora, Language Technology and Maltese Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex.
Terminology, translation, and PRESEMT; word frequency lists and KELLY 1 Adam Kilgarriff Lexical Computing Ltd SKEW-2, March 2011Kilgarriff: PRESEMT and.
GDEX: Automatically finding good dictionary examples in a corpus Adam Kilgarriff, Miloš Husák, Katy McAdam, Michael Rundell, Pavel Rychlý Lexical Computing.
1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of.
Researching language with computers Paul Thompson.
Author: Marybeth Moretti
1 Corpora, Language Technology and Maltese Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex.
Genre in a Frequency Dictionary Adam Kilgarriff & Carole Tiberius.
Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.
Kivik 2013Kilgarriff: Web corpora1 Web Corpora Adam Kilgarriff.
1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds.
Chi-Square Test A fundamental problem in genetics is determining whether the experimentally determined data fits the results expected from theory. How.
1 Evaluating word sketches and corpora Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Corpus Evaluation Adam Kilgarriff Lexical Computing Ltd Corpus evaluationPortsmouth Nov
Advanced Math Topics Finals Review: Chapters 12 & 13.
Using Corpora in Language Research Adam Kilgarriff Lexical Computing Ltd Universities of Leeds January 2013Adam Kilgarriff.
Malta, May 2010Kilgarriff: Corpora by Web Services1 Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities.
Terminology-finding in the Sketch Engine Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý, Vit Suchomel Lexical Computing Ltd., Brighton,
The Sketch Engine as Infrastructure for Large Scale Text Collections for Humanities Research Adam Kilgarriff Lexical Computing Ltd. & Univ of Leeds, UK.
How Can Corpora Help Me To Be Successful in CO150?
Comparing Counts.  A test of whether the distribution of counts in one categorical variable matches the distribution predicted by a model is called a.
Subcorpus configuration Adam Kilgarriff. Feb 2010Kilgarriff: IWSG: Subcorpora2 “you can’t get away from genre” Bonnie Weber, Keynote Lecture ICON (Indian.
Inferential Statistics Introduction. If both variables are categorical, build tables... Convention: Each value of the independent (causal) variable has.
Grammar is to Meaning as the Law if to Good Behaviour Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Cumulative Weighted GPA Cumulative Unweighted GPA.
Algorithms and Pseudocode
GDEX: Automatically finding good dictionary examples in a corpus Auckland 2012Kilgarriff: GDEX1.
Class Notes Unit 2 – Scientific Research Science:.
T-tests Chi-square Seminar 7. The previous week… We examined the z-test and one-sample t-test. Psychologists seldom use them, but they are useful to understand.
Exploring Variation in Lexis and Genre in the Sketch Engine Adam Kilgarriff Lexical Computing Ltd., UK Supported by EU Project PRESEMT.
Using Corpora in TEFL By Terri Yueh. WhyWhy Work With Corpora? Why  From Vocabulary to Corpus  Choosing a Corpus Choosing a Corpus  Examples of Word.
Solving Algebraic Equations. Equality 3 = = = 7 For what value of x is: x + 4 = 7 true?
GDEX: Automatically finding good dictionary examples in a corpus Kivik 2013Kilgarriff: GDEX1.
Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.
GDEX: Automatically finding good dictionary examples in a corpus.
Goodness-of-Fit A test of whether the distribution of counts in one categorical variable matches the distribution predicted by a model is called a goodness-of-fit.
Chi-Square Test A fundamental problem is genetics is determining whether the experimentally determined data fits the results expected from theory (i.e.
Corpora: a key part of a materials writer’s toolkit
Making useful wordlists for ELT
Evaluating word sketches and corpora
Corpus Linguistics I ENG 617
Division Properties of Real Numbers
Corpora, Language Technology and Maltese
Corpus Size and the Robustness of Measures of Corpus Distance
Presentation transcript:

Simple Maths for Keywords Adam Kilgarriff Lexical Computing Ltd

Liverpool, July 2009Kilgarriff: Simple Maths2 “This word is twice as common here as there”

Liverpool, July 2009Kilgarriff: Simple Maths3 “This word is twice as common here as there”  What does it mean? For word wubble Ratio=2: wubble is twice as common in fc as rc Freq (f)Corp SizePer million Focus corp (fc) 4010m4 Reference corp (rc) 5025m2

Liverpool, July 2009Kilgarriff: Simple Maths4 “This word is twice as common here as there”  Not just words Grammatical constructions Suffixes …  Keyword list Calculate ratio for all words Sort Keywords: at top of list

Liverpool, July 2009Kilgarriff: Simple Maths5 Good enough for keywords?  Almost, but 1.Are corpora well matched? 2.Burstiness 3.You can’t divide by zero 4.High ratios more common for rare words

Liverpool, July 2009Kilgarriff: Simple Maths6 1Are corpora well matched?  Proportionality If fiction contains more American, newspaper more British… genre compromised by region  Usual problem  Issue in corpus design  Not here

Liverpool, July 2009Kilgarriff: Simple Maths7 2Burstiness WordBNC freqBNC files mucosa10319 theology unfortunate Discount frequency for bursty words Gries, CL 2007, also CL journal We use ARF (average reduced frequency) Not here

Liverpool, July 2009Kilgarriff: Simple Maths8 3You can’t divide by zero  Standard solution: add one  Problem solved fc rcratio buggle100? stort1000? nammikin10000? fc rcratio buggle111 stort1011 nammikin10011

Liverpool, July 2009Kilgarriff: Simple Maths9 4High ratios more common for rarer words fc rc ratiointeresting? spug101 no grod yes some researchers: grammar, grammar words some researchers: lexis content words No right answer Slider?

Liverpool, July 2009Kilgarriff: Simple Maths10 Solution  Don’t just add 1, add n: n=1  n=100 word fc rc fc+n rc+nRatioRank obscurish middling common word fc rc fc+n rc+nRatioRank obscurish middling common

Liverpool, July 2009Kilgarriff: Simple Maths11 Solution  n=1000 Summary word fc rc fc+n rc+nRatioRank obscurish middling common word fc rc n=1 n=100n=1000 obscurish1001st2nd3rd middling nd1st2nd common rd 1st

Liverpool, July 2009Kilgarriff: Simple Maths12 But what about  Mutual information  Log-likelihood  Chi-square  Fisher’s test  …  Don’t they use cleverer maths?

Liverpool, July 2009Kilgarriff: Simple Maths13 Yes but  Clever maths is for hypothesis testing Can you defeat null hypothesis?  Language is not random, so  … you always can  Null hypothesis never true  Hypothesis-testing not informative  Clever maths irrelevant Kilgarriff 2006, CLLT

Liverpool, July 2009Kilgarriff: Simple Maths14 Moreover…  just one answer grammar words vs content words? does not help  confuses and obscures

Liverpool, July 2009Kilgarriff: Simple Maths15 you should understand the maths you use

Liverpool, July 2009Kilgarriff: Simple Maths16 The Sketch Engine  Leading corpus query tool  Widely used by dictionary publishers, at universities  Large corpora for many lgs available  Word sketches  Web service  Since last week: Implements SimpleMaths

Liverpool, July 2009Kilgarriff: Simple Maths17 Example  BAWE British Academic Written English  Nesi and Thompson, completed last year Student essays  Arts/Humanities, Social Sciences, Life Sciences, Physical Sciences fc: ArtsHum, rc: SocSci With n=10 and n=1000

Liverpool, July 2009Kilgarriff: Simple Maths18

Liverpool, July 2009Kilgarriff: Simple Maths19

Liverpool, July 2009Kilgarriff: Simple Maths20 Thank you

Liverpool, July 2009Kilgarriff: Simple Maths21 Language is never ever ever random

Liverpool, July 2009Kilgarriff: Simple Maths22 Language

Liverpool, July 2009Kilgarriff: Simple Maths23 is

Liverpool, July 2009Kilgarriff: Simple Maths24 never

Liverpool, July 2009Kilgarriff: Simple Maths25 ever

Liverpool, July 2009Kilgarriff: Simple Maths26 ever

Liverpool, July 2009Kilgarriff: Simple Maths27 random