Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Corpora, Language Technology and Maltese Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex.

Similar presentations


Presentation on theme: "1 Corpora, Language Technology and Maltese Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex."— Presentation transcript:

1 1 Corpora, Language Technology and Maltese Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex

2 Malta, Nov 2006 Kilgarriff, Lexical Computing Slide: 2 How do you find out about a language? Native speakers Dictionaries and Grammars Corpus

3 Malta, Nov 2006 Kilgarriff, Lexical Computing Slide: 3 Four ages of corpus research

4 Malta, Nov 2006 Kilgarriff, Lexical Computing Slide: 4 Age 1: Pre-computer Oxford English Dictionary: 20 million index cards

5 Malta, Nov 2006 Kilgarriff, Lexical Computing Slide: 5 Age 2: KWIC Concordances From 1980 Computerised

6 Malta, Nov 2006 Kilgarriff, Lexical Computing Slide: 6 Age 2: KWIC Concordance

7 Malta, Nov 2006 Kilgarriff, Lexical Computing Slide: 7 Age 2: KWIC Concordances From 1980 Computerised COBUILD project was innovator the coloured-pens method

8 Malta, Nov 2006 Kilgarriff, Lexical Computing Slide: 8 1 political association 4 person in an agreement/dispute 2 social event 5 to be party to something... 3 group of people The coloured pens method

9 Malta, Nov 2006 Kilgarriff, Lexical Computing Slide: 9 Age 2: limitations as corpora get bigger: too much data 50 lines for a word: read all 500 lines: could read all, takes a long time 5000 lines: no

10 Malta, Nov 2006 Kilgarriff, Lexical Computing Slide: 10 Age 3: Collocation statistics Problem: too much data - how to summarise? Solution: list of words occurring in neighbourhood of headword, with frequencies Sorted by salience

11 Malta, Nov 2006 Kilgarriff, Lexical Computing Slide: 11 Collocation listing For right collocates of save (>5 hits) wordfreqwordfreq forests6life36 $1.26dollars8 lives37costs7 enormous6thousands6 annually7face9 jobs20estimated6 money64your7

12 Malta, Nov 2006 Kilgarriff, Lexical Computing Slide: 12 Age 4: The word sketch A corpus-derived one-page summary of a word’s grammatical and collocational behaviour

13 Malta, Nov 2006 Kilgarriff, Lexical Computing Slide: 13 Age 4: The word sketch Large well-balanced corpus Parse to find subjects, objects, heads, modifiers etc One list for each grammatical relation Statistics to sort each list

14 Malta, Nov 2006 Kilgarriff, Lexical Computing Slide: 14 Macmillan English Dictionary For Advanced Learners Ed: Rundell, 2002

15 Malta, Nov 2006 Kilgarriff, Lexical Computing Slide: 15 Developer: Pavel Rychly, Brno Users: OUP, Chambers, CUP Universities for teaching and research ELT textbook authors Demo: http://www.sketchengine.co.uk/ Self-registration for free account

16 Malta, Nov 2006 Kilgarriff, Lexical Computing Slide: 16 How to develop language technologies? Introspection Copy others Corpus

17 Malta, Nov 2006 Kilgarriff, Lexical Computing Slide: 17 Last two decades Corpora have moved centre-stage for Spellcheckers Grammar checkers Automatic translation Question-answering … Machine learning from big data

18 Malta, Nov 2006 Kilgarriff, Lexical Computing Slide: 18 The corpus English British National Corpus 100M words, range of text types Led the world Late 1980s/ early 1990s Maltese …

19 Malta, Nov 2006 Kilgarriff, Lexical Computing Slide: 19 Corpus design: ideals Very large Most words, phrases are rare Cover all types of text All texts labelled by text type, author Available for anyone to use Only real language

20 Malta, Nov 2006 Kilgarriff, Lexical Computing Slide: 20 Corpus development: pragmatics What you can get Some sources easy, others hard Often unlabelled expensive to label Copyright Lots of junk expensive to clean

21 Malta, Nov 2006 Kilgarriff, Lexical Computing Slide: 21 Sources Traditional Book publishers Newspapers, magazines Official Web

22 Malta, Nov 2006 Kilgarriff, Lexical Computing Slide: 22 Traditional Write to them and ask Wooing needed Appeal to national pride Newspapers: Large Other sources: Often small amounts

23 Malta, Nov 2006 Kilgarriff, Lexical Computing Slide: 23 Web Lots of data (even for Maltese) Free access (but copyright) Some formal Laws, government pages some informal Chatroom, email

24 Malta, Nov 2006 Kilgarriff, Lexical Computing Slide: 24 Laws vs web crawl

25 Malta, Nov 2006 Kilgarriff, Lexical Computing Slide: 25 Agenda Data cleaning More kinds of text Spelling: standardisation Morphology inventing  invent (v, -ing) Grammar

26 Malta, Nov 2006 Kilgarriff, Lexical Computing Slide: 26 In sum Dictionaries, language technology need corpus Maltese Some components available More work needed Solid foundation for LT is within reach


Download ppt "1 Corpora, Language Technology and Maltese Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex."

Similar presentations


Ads by Google