Download presentation
Presentation is loading. Please wait.
Published byKaley Collman Modified over 10 years ago
1
A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing Ltd http://www.sketchengine.co.uk
2
English Profile From 2006 Cambridge Univ, Univ Press, ESOL (+ others) Goal – for each CEFR level, find characteristic lexis and grammar CEFR: Common European Framework of Reference – A1, A2: Beginner – B1, B2: Intermediate – C1, C2: Advanced – Main resource: CLC NTNU Nov 2011KIlgarriff2
3
Cambridge Learner Corpus (CLC) Since 1993 Leading resource CUP and Cambridge Assessment – For better dictionaries, ELT courses, tests – Material: all from exams (levels A1-C2) 45m words; 22m error-tagged 200,000 scripts, 138 L1s, 203 nationalities NTNU Nov 2011KIlgarriff3
4
Sketch Engine Leading corpus tool Word sketches – One-page summaries of a word’s grammatical and collocational behaviour In use at OUP, CUP, Collins, Macmillan, INL … 55 languages – 175 corpora – Since May including CHILDES: demodemo – Since last year including CLC NTNU Nov 2011KIlgarriff4
5
NTNU Nov 2011KIlgarriff5 Macmillan English Dictionary For Advanced Learners Ed: Rundell, 2002
6
Error-coded corpus Challenge – Intuitive to search for x anywhere only where it is part of an error only where it is part of a correction where x can be a word, phrase, grammar pattern … Requirement for CLC in Sketch Engine NTNU Nov 2011KIlgarriff6
7
Error-coded corpora in SkE demo NTNU Nov 2011KIlgarriff7
8
HOO / HOO+ Helping Our Own HOO: English-NNS NLP researchers – Developer = user: motivation – Shared task/competitive evaluation Organisers define task and prepare ‘gold standard’ Teams participate by running their software over test data Six teams (incl Tübingen), workshop end Sept NTNU Nov 2011KIlgarriff8
9
HOO+ (2012) Probably – English: learner data from CLC – Other languages? – Tasks Essay scoring Determiner, preposition errors ? http://www.clt.mq.edu.au/research/projects/hoo/ NTNU Nov 2011KIlgarriff9
10
DANTE Highlights of English lexicography NTNU Nov 2011KIlgarriff10
11
DANTE NTNU Nov 2011KIlgarriff11
12
DANTE NTNU Nov 2011KIlgarriff12
13
DANTE NTNU Nov 2011KIlgarriff13
14
DANTE http://webdante.com NTNU Nov 2011KIlgarriff14
15
The KELLY Project EU Lifelong Learning Project Word cards – 9 languages Arabic Chinese English Greek Italian Norwegian Polish Russian Swedish – All 36 pairs – Words the learner should know (at A1 … C2) Partners Stockholm Univ, Gotheburg Univ, Adam Mickiewicz Univ, ILSP Athens, CNR Pisa, Oslo Univ, Leeds Univ, Keewords A/S, Lexical Computing Ltd NTNU Nov 2011KIlgarriff15
16
Interesting question How close to purely corpus-based can a pedagogic list be? NTNU Nov 2011KIlgarriff16
17
Method Take a general corpus Count Review, add, delete using other lists and corpora Translate (72 directed-lg-pairs) Words not in source list which occur in translations: – Review source list http://kelly.sketchengine.co.uk NTNU Nov 2011KIlgarriff17
18
Symmatrical pairs: and Cliques: – For x, y, z, … all pairs are symmetrical – 9-language cliques (English members) hospital library music sun theory NTNU Nov 2011KIlgarriff18
19
NTNU Nov 2011KIlgarriff19 Web corpora Replaceable or replacable? – http://googlefight.com http://googlefight.com – http://looglefight.com http://looglefight.com
20
NTNU Nov 2011KIlgarriff20 The web is – Very very large – Most languages – Most language types – Up-to-date – Free – Instant access
21
NTNU Nov 2011KIlgarriff21 Web corpus types Large, general corpora Small, specialised corpora – Specially for translators
22
NTNU Nov 2011KIlgarriff22 Basic steps Gather pages – CSE hits – Select and gather whole sites – General crawl Filter De-duplicate Linguistic processing Load into corpus tool
23
NTNU Nov 2011KIlgarriff23 WaC family corpora 100m – 2b word corpora 2-month project each All major world languages available in Sketch Engine – Currently 42 languages – Growing monthly Pioneers: Marco Baroni, Serge Sharoff Corpus Factory Seeds: – mid-frequency words from ‘core vocab’ lists and corpora Google on seed words, then crawl
24
NTNU Nov 2011KIlgarriff24 How good are they? How to assess? – Hard question, open research topic Good coverage – Newspapers: news, politics bias – Web corpora: also cover personal, kitchen vocab Web corpus / BNC / journalism corpus – First two are close
25
NTNU Nov 2011KIlgarriff25 Evaluating word sketches 11 years – 1999-2011 Feedback – Good but anecdotal Formal evaluation Method also lets us evaluate corpora
26
KIlgarriff26 Goal Collocations dictionary – Model: Oxford Collocations Dictionary – Publication-quality Ask a lexicographer – For 42 headwords For 20 best collocates per headwords – “should we include this collocation in a published dictionary?” NTNU Nov 2011
27
KIlgarriff27 Sample of headwords Nouns verbs adjectives, random High (Top 3000) N space solution opinion mass corporation leader V serve incorporate mix desire Adj high detailed open academic Mid (3000- 9999) N cattle repayment fundraising elder biologist sanitation V grieve classify ascertain implant Adj adjacent eldest prolific ill Low (10,000- 30,000) N predicament adulterer bake bombshell candy shellfish V slap outgrow plow traipse Adj neoclassical votive adulterous expandable NTNU Nov 2011
28
KIlgarriff28 Precision and recall a request for information – Find me all the fat cats
29
NTNU Nov 2011 KIlgarriff29 High recall Lots of responses Maybe not all good
30
NTNU Nov 2011KIlgarriff30 High precision Fewer hits Higher confidence
31
KIlgarriff31 Precision and recall We test precision Recall is harder How do we find all the collocations that the system should have found? Current work 200 collocates per headword Selected from All the corpora we have Various parameter settings Plus just-in-time evaluation for 'new' collocates NTNU Nov 2011
32
KIlgarriff32 Four languages, three families Dutch – ANW, 102m-word lexicographic corpus English – UKWaC, 1.5b web corpus Japanese – JpWaC, 400m web corpus Slovene – FidaPlus, 620m lexicographic corpus NTNU Nov 2011
33
KIlgarriff33 User evaluation Evaluate whole system – Will it help with my task Eg preparing a collocations dictionary Contrast: developer evaluation – Can I make the system better? Evaluate each module separately Current work NTNU Nov 2011
34
KIlgarriff34 Components Corpus NLP tools – Segmenter, lemmatiser, POS-tagger Sketch grammar Statistics NTNU Nov 2011
35
KIlgarriff35 Practicalities Interface – Good, Good-but Merge to good – Maybe, Maybe-specialised, Bad Merge to bad For each language – Two/three linguists/lexicographers – If they disagree Don't use for computing performance NTNU Nov 2011
36
KIlgarriff36 Results Dutch 66% English71% Japanese 87% Slovene71% NTNU Nov 2011
37
KIlgarriff37 Two thirds of a collocations dictionary can be gathered automatically
38
Thank you http://www.sketchengine.co.uk http://www.sketchengine.co.uk NTNU Nov 2011KIlgarriff38
39
NTNU Nov 2011KIlgarriff39
40
NTNU Nov 2011KIlgarriff40 Lexicography: finding facts about words collocations grammatical patterns idioms synonyms meanings translations
41
NTNU Nov 2011KIlgarriff41 Four ages of corpus lexicography
42
NTNU Nov 2011KIlgarriff42 Age 1: Pre computer Oxford English Dictionary: 5 million index cards
43
NTNU Nov 2011KIlgarriff43 Age 2: KWIC Concordances From 1980 Computerised Overhauled lexicography
44
NTNU Nov 2011KIlgarriff44 Age 2: limitations as corpora get bigger: too much data 50 lines for a word: :read all 500 lines: could read all, takes a long time, slow 5000 lines: no
45
NTNU Nov 2011KIlgarriff45 Age 3: Collocation statistics Problem: too much data - how to summarise? Solution: list of words occurring in neighbourhood of headword, with frequencies Sorted by salience
46
NTNU Nov 2011KIlgarriff46 Age-3 collocation statistics: limitations Lists contain junk unsorted for type – mixes together adverbs, subjects, objects, prepositions What we really want: noise-free lists one list for each grammatical relation
47
NTNU Nov 2011KIlgarriff47 Age 4: The word sketch Large well-balanced corpus Parse to find – subjects, objects, heads, modifiers etc One list for each grammatical relation Statistics to sort each list, as before
48
NTNU Nov 2011KIlgarriff48 Working practice Lexicographers mainly used sketches not concordances – missed less, more consistent – Faster
49
NTNU Nov 2011KIlgarriff49 Euralex 2002
50
NTNU Nov 2011KIlgarriff50 Euralex 2002 Can I have them for my language please
51
NTNU Nov 2011KIlgarriff51 The Sketch Engine Input: – any corpus, any language Lemmatised, part-of-speech tagged – specification of grammatical relations Word sketches integrated with Corpus query system – Supports complex searching, sorting etc Credit: Pavel Rychly, Masaryk Univ
52
NTNU Nov 2011KIlgarriff52 Customers Dictionary publishers – Oxford University Press – Cambridge University Press – Collins – National dictionary projects in Czech Republic, Estonia, Ireland, Netherlands, Slovakia, Slovenia Universities – Teaching and research – Languages, linguistics, language technology – UK, Germany, US, Greece, Taiwan, Japan, China, … Other – Language teaching, textbook writing – Information management, web search
53
NTNU Nov 2011KIlgarriff53 Demo – http://sketchengine.co.uk http://sketchengine.co.uk – Free trial
54
NTNU Nov 2011KIlgarriff54 What is there on the web? Web1T – Present from google – All 1-, 2-, 3-, 4, 5-grams with f>40 in one trillion (10 12) words of English 1,000,000,000,000 Compare with BNC – Take top 50,000 items of each – 105 Web1T words not in BNC top50k – 50 words with highest Web1T:BNC ratio – 50 words with lowest ratio
55
NTNU Nov 2011KIlgarriff55 Web-high (155 terms) 61 web and computing – config browser spyware url www forum 38 porn 22 US English 18 business/products common on web – poker viagra lingerie ringtone dvd casino rental collectible tiffany – NB: BNC is old 4 legal – trademarks pursuant accordance herein
56
NTNU Nov 2011KIlgarriff56 Web-low Exclude British English, transcription/tokenisation anomalies – herself stood seemed she looked yesterday sat considerable had council felt perhaps walked round her towards claimed knew obviously remained himself he him
57
NTNU Nov 2011KIlgarriff57 Observations Pronouns and past tense verbs – Fiction Masc vs fem Yesterday – Probably daily newspapers Constancy of ratios: – He/him/himself – She/her/herself
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.