Download presentation
Presentation is loading. Please wait.
Published byGeorge Ezra Young Modified over 9 years ago
1
Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex
2
Leeds, April 2010 Kilgarriff: Corpora by Web Services 2 Starting a PhD in NLP Then Prolog Type in a few grammar rules Lexical entries Example sentences We’re off!
3
Leeds, April 2010 Kilgarriff: Corpora by Web Services 3 Now Corpus Which? Budget/schedule Howe much can we afford? Hard disk space Access software Build Big job, makign it fast is hard – or Research, acquire, install, maintain …
4
Leeds, April 2010 Kilgarriff: Corpora by Web Services 4 Resarch question Morphology, syntax, discourse structure, semantics, anaphor First six months at least Acquiring data, software Complications
5
Leeds, April 2010 Kilgarriff: Corpora by Web Services 5
6
Leeds, April 2010 Kilgarriff: Corpora by Web Services 6 If you’re not super-geeky Did I do it properly? Dumbing down Let’s choose an easier question Looking over shoulder
7
Leeds, April 2010 Kilgarriff: Corpora by Web Services 7 Disappointment
8
Leeds, April 2010 Kilgarriff: Corpora by Web Services 8 Making it easy Like picking up a hire car
9
Leeds, April 2010 Kilgarriff: Corpora by Web Services 9 Corpora by web services Possible? Already available
10
Leeds, April 2010 Kilgarriff: Corpora by Web Services 10 Sketch Engine Corpus querying Fast Handles large corpora In use for lexicography at OUP, CUP, Macmillan, Collins, Le Robert Word sketches Data-driven summary of a word’s grammatical and collocational behaviour
11
Leeds, April 2010 Kilgarriff: Corpora by Web Services 11
12
Leeds, April 2010 Kilgarriff: Corpora by Web Services 12 Corpora 63Welsh53Romanian 174Vietnamese66Portuguese149Greek 108Thai6Persian1627German 5Telugu95Norwegian126French 114Swedish409Japanese5508English 117Spanish1910Italian128Dutch 738Slovene34Irish800Czech 536Slovak102Indonesian456Chinese 188Russian31Hindi174Arabic
13
Leeds, April 2010 Kilgarriff: Corpora by Web Services 13 Big, High Quality corpora Big Performance Banko and Brill 2004 There’s no data like more data Ample data for rare phenomena Big subcorpora 5b Medical: 30m
14
Leeds, April 2010 Kilgarriff: Corpora by Web Services 14 Quality Bad data Spam Navigation-bars Duplicates Lists Bungled formatting Wrong language … Less discussed Maybe a footnote I wonder why Quick fixes and run
15
Leeds, April 2010 Kilgarriff: Corpora by Web Services 15 The Google/Yahoo/Bing option Appeal Not setup costs Start googling today
16
Leeds, April 2010 Kilgarriff: Corpora by Web Services 16 Very interesting work Keller and Lapata Validity of SE counts vs BNC counts vs psycholinguistic validity of collocations 36 queries per collocation “fulfil obligation” “fulfil ? Obligation” “fulfilling obligations”... Nakov, Nakov and Hearst Great interest in query syntax
17
Leeds, April 2010 Kilgarriff: Corpora by Web Services 17 but Limited hits-per-query Limited hits-per-day Sort order Not documented 'unsorted' not possible Snippets too short for research No (documented) morphology Limited query syntax
18
Leeds, April 2010 Kilgarriff: Corpora by Web Services 18 and At mercy of commercial company Might change at any time Not replicable
19
Leeds, April 2010 Kilgarriff: Corpora by Web Services 19 So Appeal No setup costs Serious research Many difficult practical issues Not a tool designed for linguists Conclusion If only SE indexes are big enough Yes Else no
20
Leeds, April 2010 Kilgarriff: Corpora by Web Services 20 Strategy More languages Corpus Factory, as Sharoff Bigger Big Web Corpus (BiWeC) Currently 5.5b fully processed Target 20b Better
21
Leeds, April 2010 Kilgarriff: Corpora by Web Services 21 New Model Corpus BNC is past its sell-by Early 1990s Pre web Still dominant model New model needed
22
Leeds, April 2010 Kilgarriff: Corpora by Web Services 22 Model Small: model train Model train Design: software model NMC 1:100 for BiWeC-scale 100m Update of BNC as design model Data from web but Text type avalable
23
Leeds, April 2010 Kilgarriff: Corpora by Web Services 23 Open-source/collaboration We distribute You annotate Pos-tags, parses, anaphor, discourse moves, semantics, multiwords, entity- types... Domain, register, region... Send us annotations We integrate And give access in SkE
24
Leeds, April 2010 Kilgarriff: Corpora by Web Services 24 Divide and rule Bigger (BiWeC) Better (NMC) Take best annotations Accuracy Speed Usefulness Good collaboration from NMC, apply to BiWeC
25
Leeds, April 2010 Kilgarriff: Corpora by Web Services 25 TEDDCLOG Taiwan English Data-Driven CLOze Generation with Simon Smith and colleagues, Taipei API case study
26
Leeds, April 2010 Kilgarriff: Corpora by Web Services 26 Cloze 'fill-the gap' Several metal _____ violently with cold water A: behave B: react C: realise D: respond Popular with students, teachers, testers Unpopular with theorists :-(
27
Leeds, April 2010 Kilgarriff: Corpora by Web Services 27 One objection Test item writers make them up Not naturally-occurring language The Sinclair-Johns critique Also: expensive TEDDCLOG Uses corpus sentences and distractors
28
Leeds, April 2010 Kilgarriff: Corpora by Web Services 28 react Thesaurus module Several metals react violently with cold water. Diffs module Concordance module behave, interact, respond Text processing module Several metals ___ violently with cold water. (a) behave (b) react (c) realise (d) respond behave realise respond metals behave x metals respond x metals realise x metals react √
29
Leeds, April 2010 Kilgarriff: Corpora by Web Services 29 API calls Find distractorts thesaurus Find key-only collocate Sketch diffs Needs optimising Find carrier sentence Concordance with GDEX module Good Dictionary Example Finder
30
Leeds, April 2010 Kilgarriff: Corpora by Web Services 30 Current status TEDDCLOG Next phase: produccing decent results Corpora by Web Services Upping server capacity Looking for users (currently with UKWaC) New Model Corpus Nervous over copyright but Available in SkE, for download
31
Leeds, April 2010 Kilgarriff: Corpora by Web Services 31 Another announcement: DANTE Lexical database for English Detailed Accurate Extensive of English Highly corpus-driven 3 yr project 18 expert lexicographers Led by Sue Atkins BNC, FrameNet, Euralex, COBUILD... English side, New English-Irish dictionary Available for NLP research imminently
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.