Presentation is loading. Please wait.

Presentation is loading. Please wait.

Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.

Similar presentations


Presentation on theme: "Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex."— Presentation transcript:

1 Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex

2 Madrid April 2010Kilgarriff: Why corpora and how2  Corpora show us the facts of the language

3 Madrid April 2010Kilgarriff: Why corpora and how3 Exercise  planet  Think about the word  What could you say about it if you were writing a dictionary entry  Write down three (or more) things

4 Madrid April 2010Kilgarriff: Why corpora and how4 The Sketch Engine: demo  http://www.sketchengine.co.uk http://www.sketchengine.co.uk

5 Madrid April 2010Kilgarriff: Why corpora and how5 Dictionaries  How to decide what to say about the word?

6 Madrid April 2010Kilgarriff: Why corpora and how6 Dictionaries  How to decide what to say about the word? What the native speaker knows (introspection)

7 Madrid April 2010Kilgarriff: Why corpora and how7 Dictionaries  How to decide what to say about the word? What the native speaker knows (introspection) What other dictionaries say

8 Madrid April 2010Kilgarriff: Why corpora and how8 Dictionaries  How to decide what to say about the word? What the native speaker knows (introspection) What other dictionaries say corpus

9 Madrid April 2010Kilgarriff: Why corpora and how9 Four ages of corpus lexicography

10 Madrid April 2010Kilgarriff: Why corpora and how10 Age 1: Pre-computer Oxford English Dictionary: 20 million index cards

11 Madrid April 2010Kilgarriff: Why corpora and how11 Age 2: KWIC Concordances  From 1980  Computerised  Overhauled lexicography

12 Madrid April 2010Kilgarriff: Why corpora and how12 Age 2: limitations as corpora get bigger: too much data 50 lines for a word: :read all 500 lines: could read all, takes a long time, slow 5000 lines: no

13 Madrid April 2010Kilgarriff: Why corpora and how13 Age 3: Collocation statistics  Problem: too much data - how to summarise?  Solution: list of words occurring in neighbourhood of headword, with frequencies  Sorted by salience

14 Madrid April 2010Kilgarriff: Why corpora and how14 Collocation listing For collocates of save (>5 hits), to right of nodeword word forestslife $1.2dollars livescosts enormousthousands annuallyface jobsestimated moneyyour

15 Madrid April 2010Kilgarriff: Why corpora and how15 Age-3 collocation statistics: limitations Lists contain  junk  unsorted for type mixes together adverbs, subjects, objects, prepositions What we really want:  noise-free lists  one list for each grammatical relation

16 Madrid April 2010Kilgarriff: Why corpora and how16 Age 4: The word sketch  Large well-balanced corpus  Parse to find subjects, objects, heads, modifiers etc  One list for each grammatical relation  Statistics to sort each list, as before

17 Madrid April 2010Kilgarriff: Why corpora and how17 Macmillan English Dictionary For Advanced Learners Ed: Rundell, 2002, 2007

18 Madrid April 2010Kilgarriff: Why corpora and how18 Demo part 2

19 Madrid April 2010Kilgarriff: Why corpora and how19 Fruit task  Choose fruit  Concordance Lemma, noun, lower case  Frequency: node forms  Write down Plural freq (pl) Singular freq (sing)  Compute proportion: pl/(pl+sing)

20 Madrid April 2010Kilgarriff: Why corpora and how20 What is a corpus?  A collection of texts (as used for linguistic study)  Which texts?  How many?

21 Madrid April 2010Kilgarriff: Why corpora and how21 Which texts?  Written  Spoken

22 Madrid April 2010Kilgarriff: Why corpora and how22 Written  Books Fiction Non-fiction Textbooks  Newspapers  Letters, unpublished  Web pages  Academic journals  Student essays  …

23 Madrid April 2010Kilgarriff: Why corpora and how23 Spoken Must be transcribed, for text corpora  Conversation Who? Region, class, age-group, situation…  Lectures  TV and Radio  Film transcripts  Meetings, seminars  …

24 Madrid April 2010Kilgarriff: Why corpora and how24 Which texts?  Different purposes, different text types  Making dictionaries: Cover the whole language Some of everything

25 Madrid April 2010Kilgarriff: Why corpora and how25 How much?  Most words are rare  Zipf’s Law  To get enough data for most words, we need very big corpora

26 Madrid April 2010Kilgarriff: Why corpora and how26 Zipf’s Law Word (pos) r f r x f the (det) 1 6187267 6187267 to (prep) 10 917579 9175790 as (adv) 100 91583 9158300 playing (vb) 1000 9738 9738000 paint (vb) 2000 4539 9078000 amateur (adj) 10,000 741 7410000

27 Madrid April 2010Kilgarriff: Why corpora and how27 Zipf’s Law  the: 6%  100 most frequent: 45%  7500 most frequent: 90%  all others: rare

28 Madrid April 2010Kilgarriff: Why corpora and how28 Zipf’s Law

29 Madrid April 2010Kilgarriff: Why corpora and how29 Leading English Corpora: Size 10 9 10 8 10 7 10 6 Size of Corpora (in words) 1960s 1970s 1980s 1990s 2000s Brown/LOB COBUILD BNC OEC

30 Madrid April 2010Kilgarriff: Why corpora and how30 Good news  The web

31 Madrid April 2010Kilgarriff: Why corpora and how31 Thank you http://www.sketchengine.co.uk


Download ppt "Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex."

Similar presentations


Ads by Google