Presentation is loading. Please wait.

Presentation is loading. Please wait.

Malta, May 2010Kilgarriff: Corpora by Web Services1 Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities.

Similar presentations


Presentation on theme: "Malta, May 2010Kilgarriff: Corpora by Web Services1 Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities."— Presentation transcript:

1 Malta, May 2010Kilgarriff: Corpora by Web Services1 Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

2 Malta, May 2010 Kilgarriff: Corpora by Web Services 2 Starting a PhD in NLP  Then Prolog Type in a few  grammar rules  Lexical entries  Example sentences We’re off!

3 Malta, May 2010 Kilgarriff: Corpora by Web Services 3 Now  Corpus Which? Budget/schedule Howe much can we afford? Hard disk space  Access software Build  Big job, making it fast is hard – or Research, acquire, install, maintain …

4 Malta, May 2010 Kilgarriff: Corpora by Web Services 4  Resarch question Morphology, syntax, discourse structure, semantics, anaphor  First six months at least Acquiring data, software Complications

5 Malta, May 2010 Kilgarriff: Corpora by Web Services 5

6 Malta, May 2010 Kilgarriff: Corpora by Web Services 6 If you’re not super-geeky  Did I do it properly?  Dumbing down Let’s choose an easier question  Looking over shoulder

7 Malta, May 2010 Kilgarriff: Corpora by Web Services 7 Disappointment

8 Malta, May 2010 Kilgarriff: Corpora by Web Services 8 Making it easy  Like picking up a hire car

9 Malta, May 2010 Kilgarriff: Corpora by Web Services 9 Corpora by web services  Possible?  Already available

10 Malta, May 2010 Kilgarriff: Corpora by Web Services 10 Sketch Engine  Corpus querying  Fast  Handles large corpora  In use for lexicography at OUP, CUP, Macmillan, Collins, Le Robert  Word sketches Data-driven summary of a word’s grammatical and collocational behaviour

11 Malta, May 2010 Kilgarriff: Corpora by Web Services 11

12 Malta, May 2010 Kilgarriff: Corpora by Web Services 12 Corpora 63Welsh53Romanian 174Vietnamese66Portuguese149Greek 108Thai6Persian1627German 5Telugu95Norwegian126French 114Swedish409Japanese5508English 117Spanish1910Italian128Dutch 738Slovene34Irish800Czech 536Slovak102Indonesian456Chinese 188Russian31Hindi174Arabic

13 Malta, May 2010 Kilgarriff: Corpora by Web Services 13 Big, High Quality corpora  Big Performance  Banko and Brill 2004  There’s no data like more data Ample data for rare phenomena Big subcorpora  5b  Medical: 30m

14 Malta, May 2010 Kilgarriff: Corpora by Web Services 14 Quality  Bad data Spam Navigation-bars Duplicates Lists Bungled formatting Wrong language …  Less discussed Maybe a footnote  Quick fixes and run

15 Malta, May 2010 Kilgarriff: Corpora by Web Services 15 The Google/Yahoo/Bing option  Appeal Not setup costs Start googling today

16 Malta, May 2010 Kilgarriff: Corpora by Web Services 16 but  Limited hits-per-query  Limited hits-per-day  Sort order 'unsorted' not possible  Snippets too short for research  No (documented) morphology  Limited query syntax

17 Malta, May 2010 Kilgarriff: Corpora by Web Services 17 and  At mercy of commercial company  Might change at any time  Not replicable

18 Malta, May 2010 Kilgarriff: Corpora by Web Services 18 So  Appeal No setup costs  Serious research Many difficult practical issues Not a tool designed for linguists  Conclusion If only SE indexes are big enough  Yes Else no

19 Malta, May 2010 Kilgarriff: Corpora by Web Services 19 Strategy  More languages Corpus Factory, as Sharoff  Bigger and better (English) Big Web Corpus (BiWeC) ‏ 5.5b fully processed Rich markup  New Model Corpus  Collaboration model

20 Malta, May 2010 Kilgarriff: Corpora by Web Services 20 TEDDCLOG Taiwan English Data-Driven CLOze Generation with Simon Smith and colleagues, Taipei  API case study

21 Malta, May 2010 Kilgarriff: Corpora by Web Services 21 Cloze  'fill-the gap' Several metal _____ violently with cold water  A: behave  B: react  C: realise  D: respond  Popular with students, teachers, testers Unpopular with theorists :-(

22 Malta, May 2010 Kilgarriff: Corpora by Web Services 22 One objection  Test item writers make them up  Not naturally-occurring language The Sinclair-Johns critique Also: expensive  TEDDCLOG Uses corpus sentences and distractors

23 Malta, May 2010 Kilgarriff: Corpora by Web Services 23 react Thesaurus module Several metals react violently with cold water. Diffs module Concordance module behave, interact, respond Text processing module Several metals ___ violently with cold water. (a) behave (b) react (c) realise (d) respond behave realise respond metals behave x metals respond x metals realise x metals react √

24 Malta, May 2010 Kilgarriff: Corpora by Web Services 24 API calls  Find distractorts thesaurus  Find key-only collocate Sketch diffs  Needs optimising  Find carrier sentence Concordance with GDEX module  Good Dictionary Example Finder

25 Malta, May 2010 Kilgarriff: Corpora by Web Services 25 Current status  TEDDCLOG Next phase: producing decent results  Corpora by Web Services Increasing server capacity Looking for users

26 Malta, May 2010 Kilgarriff: Corpora by Web Services 26 Not just like picking up a hire car

27 Malta, May 2010 Kilgarriff: Corpora by Web Services 27 Not just like picking up a hire car more like picking up a Ferrari

28 Malta, May 2010 Kilgarriff: Corpora by Web Services 28 Another announcement: DANTE  Lexical database for English Detailed Accurate Extensive of English Highly corpus-driven 3 yr project 18 expert lexicographers Led by Sue Atkins  BNC, FrameNet, Euralex, COBUILD...  English side, New English-Irish dictionary  Available for NLP research imminently


Download ppt "Malta, May 2010Kilgarriff: Corpora by Web Services1 Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities."

Similar presentations


Ads by Google