Download presentation
Presentation is loading. Please wait.
Published byNoreen Marsh Modified over 9 years ago
1
Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK
2
IDS Mannheim 2010Kilgarriff2 Outline Precision and recall History of corpus lexicography The Sketch Engine –Demo Web corpora Corpus and dictionary
3
IDS Mannheim 2010Kilgarriff3 Find me all the fat cats a request for information
4
IDS Mannheim 2010Kilgarriff4 High recall Lots of responses Maybe not all good
5
IDS Mannheim 2010Kilgarriff5 High precision Fewer hits Higher confidence
6
IDS Mannheim 2010Kilgarriff6 Information-seeking RecallPrecision Computers good bad People bad good
7
IDS Mannheim 2010Kilgarriff7 Lexicography: finding facts about words collocations grammatical patterns idioms synonyms meanings translations
8
IDS Mannheim 2010Kilgarriff8 Four ages of corpus lexicography
9
IDS Mannheim 2010Kilgarriff9 Age 1: Pre computer Oxford English Dictionary: 5 million index cards
10
IDS Mannheim 2010Kilgarriff10 Age 2: KWIC Concordances From 1980 Computerised Overhauled lexicography
11
IDS Mannheim 2010Kilgarriff11 Age 2: limitations as corpora get bigger: too much data 50 lines for a word: :read all 500 lines: could read all, takes a long time, slow 5000 lines: no
12
IDS Mannheim 2010Kilgarriff12 Age 3: Collocation statistics Problem: too much data - how to summarise? Solution: list of words occurring in neighbourhood of headword, with frequencies Sorted by salience
13
IDS Mannheim 2010Kilgarriff13 Age-3 collocation statistics: limitations Lists contain junk unsorted for type – mixes together adverbs, subjects, objects, prepositions What we really want: noise-free lists one list for each grammatical relation
14
IDS Mannheim 2010Kilgarriff14 Collocation listing For collocates of save (>5 hits), window 1-5 words to right of nodeword word forestslife $1.2dollars livescosts enormousthousands annuallyface jobsestimated moneyyour
15
IDS Mannheim 2010Kilgarriff15 Age 4: The word sketch Large well-balanced corpus Parse to find – subjects, objects, heads, modifiers etc One list for each grammatical relation Statistics to sort each list, as before
16
IDS Mannheim 2010Kilgarriff16 Macmillan English Dictionary For Advanced Learners Ed: Rundell, 2002
17
IDS Mannheim 2010Kilgarriff17 Working practice Lexicographers mainly used sketches not concordances –missed less, more consistent –Faster
18
IDS Mannheim 2010Kilgarriff18 Euralex 2002
19
IDS Mannheim 2010Kilgarriff19 Euralex 2002 Can I have them for my language please
20
IDS Mannheim 2010Kilgarriff20 The Sketch Engine Input: –any corpus, any language Lemmatised, part-of-speech tagged –specification of grammatical relations Word sketches integrated with Corpus query system –Supports complex searching, sorting etc Credit: Pavel Rychly, Masaryk Univ
21
IDS Mannheim 2010Kilgarriff21 Customers Dictionary publishers –Oxford University Press –Cambridge University Press –Collins –Macmillan –FrameNet Project (Berkeley, US) –National dictionary projects in Czech Republic, Estonia, Ireland, Netherlands, Slovakia, Slovenia Universities –Teaching and research –Languages, linguistics, language technology –UK, Germany, US, Greece, Taiwan, Japan, China, Slovenia,… Other –Language teaching, textbook writing –Information management, web search companies –Automatic translation
22
IDS Mannheim 2010Kilgarriff22 Web corpora Replaceable or replacable? –http://googlefight.comhttp://googlefight.com –http://looglefight.comhttp://looglefight.com
23
IDS Mannheim 2010Kilgarriff23 The web is –Very very large –Most languages –Most language types –Up-to-date –Free –Instant access
24
IDS Mannheim 2010Kilgarriff24 Web corpus types Large, general corpora Small, specialised corpora –Specially for translators
25
IDS Mannheim 2010Kilgarriff25 Basic steps Gather pages –CSE hits –Select and gather whole sites –General crawl Filter De-duplicate Linguistic processing Load into corpus tool
26
IDS Mannheim 2010Kilgarriff26 WaC family corpora 100m – 2b word corpora 2-month project each All major world languages available in Sketch Engine –Currently 30 languages –Growing monthly Pioneers: Marco Baroni, Serge Sharoff Corpus Factory Seeds: –mid-frequency words from ‘core vocab’ lists and corpora Google on seed words, then crawl
27
IDS Mannheim 2010Kilgarriff27 Corpora Arabic174Hindi31Russian188 Chinese456Indonesian102Slovak536 Czech800Irish34Slovene738 Dutch128Italian1910Spanish117 English5508Japanese409Swedish114 French126Norwegian95Telugu5 German1627Persian6Thai108 Greek149Portuguese66Vietnamese174 Estonian11Romanian53Welsh63 Korean77Polish156Malay230
28
IDS Mannheim 2010Kilgarriff28 How good are they? How to assess? –Hard question, open research topic Good coverage –Newspapers: news, politics bias –Web corpora: also cover personal, kitchen vocab Web corpus / BNC / journalism corpus –First two are close
29
IDS Mannheim 2010Kilgarriff29 Evaluating word sketches 11 years –1999-2010 Feedback –Good but anecdotal Formal evaluation Method also lets us evaluate corpora
30
Kilgarriff30 Goal Collocations dictionary –Model: Oxford Collocations Dictionary –Publication-quality Ask a lexicographer –For 42 headwords For 20 best collocates per headwords –“should we include this collocation in a published dictionary?”
31
Kilgarriff31 Sample of headwords Nouns verbs adjectives, random High (Top 3000) N space solution opinion mass corporation leader V serve incorporate mix desire Adj high detailed open academic Mid (3000- 9999) N cattle repayment fundraising elder biologist sanitation V grieve classify ascertain implant Adj adjacent eldest prolific ill Low (10,000- 30,000) N predicament adulterer bake bombshell candy shellfish V slap outgrow plow traipse Adj neoclassical votive adulterous expandable
32
Kilgarriff32 Precision and recall We test precision Recall is harder How do we find all the collocations that the system should have found? Current work 200 collocates per headword Selected from All the corpora we have Various parameter settings Plus just-in-time evaluation for 'new' collocates
33
Kilgarriff33 Four languages, three families Dutch –ANW, 102m-word lexicographic corpus English –UKWaC, 1.5b web corpus Japanese –JpWaC, 400m web corpus Slovene –FidaPlus, 620m lexicographic corpus
34
Kilgarriff34 User evaluation Evaluate whole system –Will it help with my task Eg preparing a collocations dictionary Contrast: developer evaluation –Can I make the system better? Evaluate each module separately Current work
35
Kilgarriff35 Components Corpus NLP tools –Segmenter, lemmatiser, POS-tagger Sketch grammar Statistics
36
Kilgarriff36 Practicalities Interface –Good, Good-but Merge to good –Maybe, Maybe-specialised, Bad Merge to bad For each language –Two/three linguists/lexicographers –If they disagree Don't use for computing performance
37
Kilgarriff37 Results Dutch 66% English71% Japanese87% Slovene71%
38
IDS Mannheim 2010Kilgarriff38 Two thirds of a collocations dictionary can be gathered automatically
39
IDS Mannheim 2010Kilgarriff39 Small specialised corpora Terminologists Translators needing target-language domain-specific vocab Specialist dictionaries –Don’t exist –Expensive/inaccessible –Out of date Instant small web corpora –BootCaT: Baroni and Bernardini 2004 –WebBootCaT demo
40
IDS Mannheim 2010Kilgarriff40 Cyborgs A creature that is partly human and partly machine –Macmillan English Dictionary
41
IDS Mannheim 2010Kilgarriff41
42
IDS Mannheim 2010Kilgarriff42
43
IDS Mannheim 2010Kilgarriff43
44
IDS Mannheim 2010Kilgarriff44
45
IDS Mannheim 2010Kilgarriff45 Cyborgs and the Information Society The dictionary-making agent is part human (for precision), part computer (for recall).
46
IDS Mannheim 2010Kilgarriff46 Treat your computer with respect. You and it can do great things together.
47
IDS Mannheim 2010Kilgarriff47 Thank you http://www.sketchengine.co.uk
48
IDS Mannheim 2010Kilgarriff48 Corpus and dictionary Established model: –Lexicographers use corpora, users use dictionaries But –Users like collocations, examples –But are not corpus linguists Explore the space between corpus and dictionary
49
IDS Mannheim 2010Kilgarriff49 Collocationality Which words are most ‘collocational’ Dictionary publishers –Where to put ‘collocation boxes’ Language learners
50
IDS Mannheim 2010Kilgarriff50 VerbFreqMLE Prob x log = entropy Take2084-.469 Gain131-.169 Offer117-.157 See110-.150 Enjoy67-.104 ……… Clarify1-0.0031 ……… Total3730-3.909 Calculation of entropy for advantage (object relation)
51
IDS Mannheim 2010Kilgarriff51
52
IDS Mannheim 2010Kilgarriff52 place (17881), attention (8476), door (8426), care (4884), step (4277), advantage (3730), rise (3334), attempt (2825), impression (2596), notice (2462), chapter (2318), mistake (2205), breath (2140), hold (1949), birth (1016), living (953), indication (812), tribute (720), debut (714), button (661), eyebrow (649), anniversary (637), mention (615), glimpse (531), suicide (486), toll (472), refuge (470), spokesman (453), sigh (436), birthday (429), wicket (412), appendix (410), pardon (399), precaution (396), temptation (374), goodbye (372), fuss (366), resemblance (350), goodness (288), precedence (285), havoc (270), tennis (266), comeback (260), farewell (228), prominence (228), go-ahead (202), sip (198),
53
IDS Mannheim 2010Kilgarriff53 What is there on the web? Web1T –Present from google –All 1-, 2-, 3-, 4, 5-grams with f>40 in one trillion (10 12) words of English 1,000,000,000,000 Compare with BNC –Take top 50,000 items of each –105 Web1T words not in BNC top50k –50 words with highest Web1T:BNC ratio –50 words with lowest ratio
54
IDS Mannheim 2010Kilgarriff54 Web-high (155 terms) 61 web and computing –config browser spyware url www forum 38 porn 22 US English 18 business/products common on web –poker viagra lingerie ringtone dvd casino rental collectible tiffany –NB: BNC is old 4 legal –trademarks pursuant accordance herein
55
IDS Mannheim 2010Kilgarriff55 Web-low Exclude British English, transcription/tokenisation anomalies –herself stood seemed she looked yesterday sat considerable had council felt perhaps walked round her towards claimed knew obviously remained himself he him
56
IDS Mannheim 2010Kilgarriff56 Observations Pronouns and past tense verbs –Fiction Masc vs fem Yesterday –Probably daily newspapers Constancy of ratios: –He/him/himself –She/her/herself
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.