Using Corpora in Language Research Adam Kilgarriff Lexical Computing Ltd Universities of Leeds January 2013Adam Kilgarriff
May 2011 Adam Kilgarriff What is language?
May 2011 Adam Kilgarriff What is language? In our heads
May 2011 Adam Kilgarriff What is language? In our heads In texts and sound signals
May 2011 Adam Kilgarriff What is language? In our heads In texts and sound signals Both
May 2011 Adam Kilgarriff Methodology Study language in our heads Competence Chomsky “rationalist” (Descartes, Leibniz)
May 2011 Adam Kilgarriff Methodology Study language in our heads Competence Chomsky “rationalist” (Descartes, Leibniz) Odd method for objective science Practical problems: coverage, arbitrariness
May 2011 Adam Kilgarriff Methodology Study text “empiricist” (Locke, Hume) Physics: forces, matter Chemistry: chemicals, bonds Language: text, speech signals
May 2011 Adam Kilgarriff It goes against the grain What is important about a sentence? its meaning Corpus methodology: Throw away individual sentence meaning Find patterns
May 2011 Adam Kilgarriff Computer power Corpora bigger and bigger data sets Language technology tools lemmatizers, POS-taggers, parsers Machine learning, pattern-finding 20 years of rapid ascent
May 2011 Adam Kilgarriff All the linguisticses Theoretical Socio Psycho Developmental Law and Computational Contrastive Applied... linguistics
May 2011 Adam Kilgarriff Developmental CHILDES, TalkBank How children learn language Parents record all interactions Since 1980s Prof. Brian MacWhinney, Carnegie-Mellon Many languages Largest chunk: English, 23m words
May Adam Kilgarriff
May Adam Kilgarriff
May Adam Kilgarriff
May Adam Kilgarriff
May Adam Kilgarriff
May Adam Kilgarriff
May Adam Kilgarriff
May 2011 Adam Kilgarriff Language change Brown family Small but perfectly formed I m words 500 x 2000-word samples the same 15 text types Supports comparison American and British English 1931, 1961, 1991, 2006
May Adam Kilgarriff
May Adam Kilgarriff
May Adam Kilgarriff
May Adam Kilgarriff
May Adam Kilgarriff
May 2011 Adam Kilgarriff Language and gender When you see a dentist... What is now normal? Recent study they now the norm themself now needed despite what spellcheck says BNC (most text from 1989) 0.2/million EnTenTen (mostly 2009) 0.4/million
May 2011 Adam Kilgarriff Language and law Trade marks Hoover and similar trademark or generic Cases sabatier, botox, kettle chips Key evidence Do people tend to capitalize?
May 2011 Adam Kilgarriff English nouns: % capitalized
May 2011 Adam Kilgarriff Syntax and semantics
May Adam Kilgarriff
May Adam Kilgarriff
May 2011 Adam Kilgarriff DANTE Detailed account of English lexis Corpus-driven From word sketches Lexicographers assign to senses High precision Available at
May 2011 Adam Kilgarriff What data shall I use?
May 2011 Adam Kilgarriff Think hard
May 2011 Adam Kilgarriff Sometimes... Just-in-time corpus from the web Use case: Translator, French-to-English Translation task volcanoes In French I understand it OK, but I'm no vulcanologist, I don't know the English terminology BootCaT, Baroni and Bernardini
May Adam Kilgarriff
May Adam Kilgarriff
May Adam Kilgarriff
May Adam Kilgarriff
May Adam Kilgarriff
May Adam Kilgarriff
May Adam Kilgarriff
May Adam Kilgarriff
May 2011 Adam Kilgarriff Corpora in Sketch Engine Access-to-all 60 languages All major world languages Mostly large, web-crawled Various other CHILDES, Brown,... “My corpora” BootCat and other
May 2011 Thank you Adam Kilgarriff