Download presentation
Presentation is loading. Please wait.
Published byFranklin Hodge Modified over 9 years ago
1
Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex
2
Madrid April 2010Kilgarriff: Why corpora and how2 Corpora show us the facts of the language
3
Madrid April 2010Kilgarriff: Why corpora and how3 Exercise planet Think about the word What could you say about it if you were writing a dictionary entry Write down three (or more) things
4
Madrid April 2010Kilgarriff: Why corpora and how4 The Sketch Engine: demo http://www.sketchengine.co.uk http://www.sketchengine.co.uk
5
Madrid April 2010Kilgarriff: Why corpora and how5 Dictionaries How to decide what to say about the word?
6
Madrid April 2010Kilgarriff: Why corpora and how6 Dictionaries How to decide what to say about the word? What the native speaker knows (introspection)
7
Madrid April 2010Kilgarriff: Why corpora and how7 Dictionaries How to decide what to say about the word? What the native speaker knows (introspection) What other dictionaries say
8
Madrid April 2010Kilgarriff: Why corpora and how8 Dictionaries How to decide what to say about the word? What the native speaker knows (introspection) What other dictionaries say corpus
9
Madrid April 2010Kilgarriff: Why corpora and how9 Four ages of corpus lexicography
10
Madrid April 2010Kilgarriff: Why corpora and how10 Age 1: Pre-computer Oxford English Dictionary: 20 million index cards
11
Madrid April 2010Kilgarriff: Why corpora and how11 Age 2: KWIC Concordances From 1980 Computerised Overhauled lexicography
12
Madrid April 2010Kilgarriff: Why corpora and how12 Age 2: limitations as corpora get bigger: too much data 50 lines for a word: :read all 500 lines: could read all, takes a long time, slow 5000 lines: no
13
Madrid April 2010Kilgarriff: Why corpora and how13 Age 3: Collocation statistics Problem: too much data - how to summarise? Solution: list of words occurring in neighbourhood of headword, with frequencies Sorted by salience
14
Madrid April 2010Kilgarriff: Why corpora and how14 Collocation listing For collocates of save (>5 hits), to right of nodeword word forestslife $1.2dollars livescosts enormousthousands annuallyface jobsestimated moneyyour
15
Madrid April 2010Kilgarriff: Why corpora and how15 Age-3 collocation statistics: limitations Lists contain junk unsorted for type mixes together adverbs, subjects, objects, prepositions What we really want: noise-free lists one list for each grammatical relation
16
Madrid April 2010Kilgarriff: Why corpora and how16 Age 4: The word sketch Large well-balanced corpus Parse to find subjects, objects, heads, modifiers etc One list for each grammatical relation Statistics to sort each list, as before
17
Madrid April 2010Kilgarriff: Why corpora and how17 Macmillan English Dictionary For Advanced Learners Ed: Rundell, 2002, 2007
18
Madrid April 2010Kilgarriff: Why corpora and how18 Demo part 2
19
Madrid April 2010Kilgarriff: Why corpora and how19 Fruit task Choose fruit Concordance Lemma, noun, lower case Frequency: node forms Write down Plural freq (pl) Singular freq (sing) Compute proportion: pl/(pl+sing)
20
Madrid April 2010Kilgarriff: Why corpora and how20 What is a corpus? A collection of texts (as used for linguistic study) Which texts? How many?
21
Madrid April 2010Kilgarriff: Why corpora and how21 Which texts? Written Spoken
22
Madrid April 2010Kilgarriff: Why corpora and how22 Written Books Fiction Non-fiction Textbooks Newspapers Letters, unpublished Web pages Academic journals Student essays …
23
Madrid April 2010Kilgarriff: Why corpora and how23 Spoken Must be transcribed, for text corpora Conversation Who? Region, class, age-group, situation… Lectures TV and Radio Film transcripts Meetings, seminars …
24
Madrid April 2010Kilgarriff: Why corpora and how24 Which texts? Different purposes, different text types Making dictionaries: Cover the whole language Some of everything
25
Madrid April 2010Kilgarriff: Why corpora and how25 How much? Most words are rare Zipf’s Law To get enough data for most words, we need very big corpora
26
Madrid April 2010Kilgarriff: Why corpora and how26 Zipf’s Law Word (pos) r f r x f the (det) 1 6187267 6187267 to (prep) 10 917579 9175790 as (adv) 100 91583 9158300 playing (vb) 1000 9738 9738000 paint (vb) 2000 4539 9078000 amateur (adj) 10,000 741 7410000
27
Madrid April 2010Kilgarriff: Why corpora and how27 Zipf’s Law the: 6% 100 most frequent: 45% 7500 most frequent: 90% all others: rare
28
Madrid April 2010Kilgarriff: Why corpora and how28 Zipf’s Law
29
Madrid April 2010Kilgarriff: Why corpora and how29 Leading English Corpora: Size 10 9 10 8 10 7 10 6 Size of Corpora (in words) 1960s 1970s 1980s 1990s 2000s Brown/LOB COBUILD BNC OEC
30
Madrid April 2010Kilgarriff: Why corpora and how30 Good news The web
31
Madrid April 2010Kilgarriff: Why corpora and how31 Thank you http://www.sketchengine.co.uk
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.