Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex
Madrid April 2010Kilgarriff: Why corpora and how2 Corpora show us the facts of the language
Madrid April 2010Kilgarriff: Why corpora and how3 Exercise planet Think about the word What could you say about it if you were writing a dictionary entry Write down three (or more) things
Madrid April 2010Kilgarriff: Why corpora and how4 The Sketch Engine: demo
Madrid April 2010Kilgarriff: Why corpora and how5 Dictionaries How to decide what to say about the word?
Madrid April 2010Kilgarriff: Why corpora and how6 Dictionaries How to decide what to say about the word? What the native speaker knows (introspection)
Madrid April 2010Kilgarriff: Why corpora and how7 Dictionaries How to decide what to say about the word? What the native speaker knows (introspection) What other dictionaries say
Madrid April 2010Kilgarriff: Why corpora and how8 Dictionaries How to decide what to say about the word? What the native speaker knows (introspection) What other dictionaries say corpus
Madrid April 2010Kilgarriff: Why corpora and how9 Four ages of corpus lexicography
Madrid April 2010Kilgarriff: Why corpora and how10 Age 1: Pre-computer Oxford English Dictionary: 20 million index cards
Madrid April 2010Kilgarriff: Why corpora and how11 Age 2: KWIC Concordances From 1980 Computerised Overhauled lexicography
Madrid April 2010Kilgarriff: Why corpora and how12 Age 2: limitations as corpora get bigger: too much data 50 lines for a word: :read all 500 lines: could read all, takes a long time, slow 5000 lines: no
Madrid April 2010Kilgarriff: Why corpora and how13 Age 3: Collocation statistics Problem: too much data - how to summarise? Solution: list of words occurring in neighbourhood of headword, with frequencies Sorted by salience
Madrid April 2010Kilgarriff: Why corpora and how14 Collocation listing For collocates of save (>5 hits), to right of nodeword word forestslife $1.2dollars livescosts enormousthousands annuallyface jobsestimated moneyyour
Madrid April 2010Kilgarriff: Why corpora and how15 Age-3 collocation statistics: limitations Lists contain junk unsorted for type mixes together adverbs, subjects, objects, prepositions What we really want: noise-free lists one list for each grammatical relation
Madrid April 2010Kilgarriff: Why corpora and how16 Age 4: The word sketch Large well-balanced corpus Parse to find subjects, objects, heads, modifiers etc One list for each grammatical relation Statistics to sort each list, as before
Madrid April 2010Kilgarriff: Why corpora and how17 Macmillan English Dictionary For Advanced Learners Ed: Rundell, 2002, 2007
Madrid April 2010Kilgarriff: Why corpora and how18 Demo part 2
Madrid April 2010Kilgarriff: Why corpora and how19 Fruit task Choose fruit Concordance Lemma, noun, lower case Frequency: node forms Write down Plural freq (pl) Singular freq (sing) Compute proportion: pl/(pl+sing)
Madrid April 2010Kilgarriff: Why corpora and how20 What is a corpus? A collection of texts (as used for linguistic study) Which texts? How many?
Madrid April 2010Kilgarriff: Why corpora and how21 Which texts? Written Spoken
Madrid April 2010Kilgarriff: Why corpora and how22 Written Books Fiction Non-fiction Textbooks Newspapers Letters, unpublished Web pages Academic journals Student essays …
Madrid April 2010Kilgarriff: Why corpora and how23 Spoken Must be transcribed, for text corpora Conversation Who? Region, class, age-group, situation… Lectures TV and Radio Film transcripts Meetings, seminars …
Madrid April 2010Kilgarriff: Why corpora and how24 Which texts? Different purposes, different text types Making dictionaries: Cover the whole language Some of everything
Madrid April 2010Kilgarriff: Why corpora and how25 How much? Most words are rare Zipf’s Law To get enough data for most words, we need very big corpora
Madrid April 2010Kilgarriff: Why corpora and how26 Zipf’s Law Word (pos) r f r x f the (det) to (prep) as (adv) playing (vb) paint (vb) amateur (adj) 10,
Madrid April 2010Kilgarriff: Why corpora and how27 Zipf’s Law the: 6% 100 most frequent: 45% 7500 most frequent: 90% all others: rare
Madrid April 2010Kilgarriff: Why corpora and how28 Zipf’s Law
Madrid April 2010Kilgarriff: Why corpora and how29 Leading English Corpora: Size Size of Corpora (in words) 1960s 1970s 1980s 1990s 2000s Brown/LOB COBUILD BNC OEC
Madrid April 2010Kilgarriff: Why corpora and how30 Good news The web
Madrid April 2010Kilgarriff: Why corpora and how31 Thank you