Download presentation
Presentation is loading. Please wait.
Published byEvan Gordon Modified over 8 years ago
1
Using Corpora in Language Research Adam Kilgarriff Lexical Computing Ltd Universities of Leeds January 2013Adam Kilgarriff
2
May 2011 Adam Kilgarriff What is language?
3
May 2011 Adam Kilgarriff What is language? In our heads
4
May 2011 Adam Kilgarriff What is language? In our heads In texts and sound signals
5
May 2011 Adam Kilgarriff What is language? In our heads In texts and sound signals Both
6
May 2011 Adam Kilgarriff Methodology Study language in our heads Competence Chomsky “rationalist” (Descartes, Leibniz)
7
May 2011 Adam Kilgarriff Methodology Study language in our heads Competence Chomsky “rationalist” (Descartes, Leibniz) Odd method for objective science Practical problems: coverage, arbitrariness
8
May 2011 Adam Kilgarriff Methodology Study text “empiricist” (Locke, Hume) Physics: forces, matter Chemistry: chemicals, bonds Language: text, speech signals
9
May 2011 Adam Kilgarriff It goes against the grain What is important about a sentence? its meaning Corpus methodology: Throw away individual sentence meaning Find patterns
10
May 2011 Adam Kilgarriff Computer power Corpora bigger and bigger data sets Language technology tools lemmatizers, POS-taggers, parsers Machine learning, pattern-finding 20 years of rapid ascent
11
May 2011 Adam Kilgarriff All the linguisticses Theoretical Socio Psycho Developmental Law and Computational Contrastive Applied... linguistics
12
May 2011 Adam Kilgarriff Developmental CHILDES, TalkBank How children learn language Parents record all interactions Since 1980s Prof. Brian MacWhinney, Carnegie-Mellon Many languages Largest chunk: English, 23m words
13
May 2011 201 Adam Kilgarriff
14
May 2011 201 Adam Kilgarriff
15
May 2011 201 Adam Kilgarriff
16
May 2011 201 Adam Kilgarriff
17
May 2011 201 Adam Kilgarriff
18
May 2011 201 Adam Kilgarriff
19
May 2011 201 Adam Kilgarriff
20
May 2011 Adam Kilgarriff Language change Brown family Small but perfectly formed I m words 500 x 2000-word samples the same 15 text types Supports comparison American and British English 1931, 1961, 1991, 2006
21
May 2011 201 Adam Kilgarriff
22
May 2011 201 Adam Kilgarriff
23
May 2011 201 Adam Kilgarriff
24
May 2011 201 Adam Kilgarriff
25
May 2011 201 Adam Kilgarriff
26
May 2011 Adam Kilgarriff Language and gender When you see a dentist... What is now normal? Recent study they now the norm themself now needed despite what spellcheck says BNC (most text from 1989) 0.2/million EnTenTen (mostly 2009) 0.4/million
27
May 2011 Adam Kilgarriff Language and law Trade marks Hoover and similar trademark or generic Cases sabatier, botox, kettle chips Key evidence Do people tend to capitalize?
28
May 2011 Adam Kilgarriff English nouns: % capitalized
29
May 2011 Adam Kilgarriff Syntax and semantics
30
May 2011 201 Adam Kilgarriff
31
May 2011 201 Adam Kilgarriff
32
May 2011 Adam Kilgarriff DANTE Detailed account of English lexis Corpus-driven From word sketches Lexicographers assign to senses High precision Available at http://webdante.com
33
May 2011 Adam Kilgarriff What data shall I use?
34
May 2011 Adam Kilgarriff Think hard
35
May 2011 Adam Kilgarriff Sometimes... Just-in-time corpus from the web Use case: Translator, French-to-English Translation task volcanoes In French I understand it OK, but I'm no vulcanologist, I don't know the English terminology BootCaT, Baroni and Bernardini
36
May 2011 201 Adam Kilgarriff
37
May 2011 201 Adam Kilgarriff
38
May 2011 201 Adam Kilgarriff
39
May 2011 201 Adam Kilgarriff
40
May 2011 201 Adam Kilgarriff
41
May 2011 201 Adam Kilgarriff
42
May 2011 201 Adam Kilgarriff
43
May 2011 201 Adam Kilgarriff
44
May 2011 Adam Kilgarriff Corpora in Sketch Engine Access-to-all 60 languages All major world languages Mostly large, web-crawled Various other CHILDES, Brown,... “My corpora” BootCat and other
45
May 2011 Thank you http://www.sketchengine.co.uk Adam Kilgarriff
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.