Download presentation
Presentation is loading. Please wait.
Published byCaroline Long Modified over 9 years ago
1
1 Corpora, Language Technology and Maltese Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex
2
Malta, Nov 2006 Kilgarriff, Lexical Computing Slide: 2 How do you find out about a language? Native speakers Dictionaries and Grammars Corpus
3
Malta, Nov 2006 Kilgarriff, Lexical Computing Slide: 3 Four ages of corpus research
4
Malta, Nov 2006 Kilgarriff, Lexical Computing Slide: 4 Age 1: Pre-computer Oxford English Dictionary: 20 million index cards
5
Malta, Nov 2006 Kilgarriff, Lexical Computing Slide: 5 Age 2: KWIC Concordances From 1980 Computerised
6
Malta, Nov 2006 Kilgarriff, Lexical Computing Slide: 6 Age 2: KWIC Concordance
7
Malta, Nov 2006 Kilgarriff, Lexical Computing Slide: 7 Age 2: KWIC Concordances From 1980 Computerised COBUILD project was innovator the coloured-pens method
8
Malta, Nov 2006 Kilgarriff, Lexical Computing Slide: 8 1 political association 4 person in an agreement/dispute 2 social event 5 to be party to something... 3 group of people The coloured pens method
9
Malta, Nov 2006 Kilgarriff, Lexical Computing Slide: 9 Age 2: limitations as corpora get bigger: too much data 50 lines for a word: read all 500 lines: could read all, takes a long time 5000 lines: no
10
Malta, Nov 2006 Kilgarriff, Lexical Computing Slide: 10 Age 3: Collocation statistics Problem: too much data - how to summarise? Solution: list of words occurring in neighbourhood of headword, with frequencies Sorted by salience
11
Malta, Nov 2006 Kilgarriff, Lexical Computing Slide: 11 Collocation listing For right collocates of save (>5 hits) wordfreqwordfreq forests6life36 $1.26dollars8 lives37costs7 enormous6thousands6 annually7face9 jobs20estimated6 money64your7
12
Malta, Nov 2006 Kilgarriff, Lexical Computing Slide: 12 Age 4: The word sketch A corpus-derived one-page summary of a word’s grammatical and collocational behaviour
13
Malta, Nov 2006 Kilgarriff, Lexical Computing Slide: 13 Age 4: The word sketch Large well-balanced corpus Parse to find subjects, objects, heads, modifiers etc One list for each grammatical relation Statistics to sort each list
14
Malta, Nov 2006 Kilgarriff, Lexical Computing Slide: 14 Macmillan English Dictionary For Advanced Learners Ed: Rundell, 2002
15
Malta, Nov 2006 Kilgarriff, Lexical Computing Slide: 15 Developer: Pavel Rychly, Brno Users: OUP, Chambers, CUP Universities for teaching and research ELT textbook authors Demo: http://www.sketchengine.co.uk/ Self-registration for free account
16
Malta, Nov 2006 Kilgarriff, Lexical Computing Slide: 16 How to develop language technologies? Introspection Copy others Corpus
17
Malta, Nov 2006 Kilgarriff, Lexical Computing Slide: 17 Last two decades Corpora have moved centre-stage for Spellcheckers Grammar checkers Automatic translation Question-answering … Machine learning from big data
18
Malta, Nov 2006 Kilgarriff, Lexical Computing Slide: 18 The corpus English British National Corpus 100M words, range of text types Led the world Late 1980s/ early 1990s Maltese …
19
Malta, Nov 2006 Kilgarriff, Lexical Computing Slide: 19 Corpus design: ideals Very large Most words, phrases are rare Cover all types of text All texts labelled by text type, author Available for anyone to use Only real language
20
Malta, Nov 2006 Kilgarriff, Lexical Computing Slide: 20 Corpus development: pragmatics What you can get Some sources easy, others hard Often unlabelled expensive to label Copyright Lots of junk expensive to clean
21
Malta, Nov 2006 Kilgarriff, Lexical Computing Slide: 21 Sources Traditional Book publishers Newspapers, magazines Official Web
22
Malta, Nov 2006 Kilgarriff, Lexical Computing Slide: 22 Traditional Write to them and ask Wooing needed Appeal to national pride Newspapers: Large Other sources: Often small amounts
23
Malta, Nov 2006 Kilgarriff, Lexical Computing Slide: 23 Web Lots of data (even for Maltese) Free access (but copyright) Some formal Laws, government pages some informal Chatroom, email
24
Malta, Nov 2006 Kilgarriff, Lexical Computing Slide: 24 Laws vs web crawl
25
Malta, Nov 2006 Kilgarriff, Lexical Computing Slide: 25 Agenda Data cleaning More kinds of text Spelling: standardisation Morphology inventing invent (v, -ing) Grammar
26
Malta, Nov 2006 Kilgarriff, Lexical Computing Slide: 26 In sum Dictionaries, language technology need corpus Maltese Some components available More work needed Solid foundation for LT is within reach
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.