Malta, May 2010Kilgarriff: Corpora by Web Services1 Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities.

Slides:



Advertisements
Similar presentations
Grammar is to Meaning as the Law if to Good Behaviour Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Advertisements

IAC (ACCESS INTERFACE CORPUS) DEVELOPED BY BARCELONA MEDIA & UNIVERSITAT POMPEU FABRA TONI BADIA (BARCELONA MEDIA - UNIVERSITAT POMPEU FABRA) JUDITH DOMINGO.
The Cambridge Learner Corpus, English Profile, the Sketch Engine and the Kelly Project Adam Kilgarriff Lexical Computing Ltd
Finding multiwords of more than two words Adam Kilgarriff, Pavel Rychly, Vojtech Kovar, Vıt Baisa Lexical Computing Ltd; Masaryk Univ., Cz.
1 Corpora for all Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
WG3: Innovative e-dictionaries Simon Krek „Jožef Stefan“ Institute, Ljubljana, Slovenia Carole Tiberius Institute of Dutch Lexicology, Leiden, the Netherlands.
Linking Dictionary and Corpus Adam Kilgarriff Lexicography MasterClass Ltd Lexical Computing Ltd University of Sussex UK.
1 Corpora for the coming decade Adam Kilgarriff. Dublin June 2009 Kilgarriff: Corpora for the coming decade2 How should they be different?  Bigger 
L EARNERS ’ D ICTIONARY Deny A. Kwary
The Bulgarian National Corpus and Its Application in Bulgarian Academic Lexicography Diana Blagoeva, Sia Kolkovska, Nadezhda Kostova, Cvetelina Georgieva.
Augmenting online dictionary entries with corpus data for Search Engine Optimisation Holger Hvelplund, 1 Adam Kilgarriff, 2 Vincent Lannoy, 1 Patrick White.
Corpus Creation for Lexicography Adam Kilgarriff, Michael Rundell Lexicography MasterClass, UK Elaine Ui Dhonnchadha ITE (Linguistics Institute of Ireland)
1 Developing Statistic-based and Rule-based Grammar Checkers for Chinese ESL Learners Howard Chen Department of English National Taiwan Normal University.
Using Corpora for Teaching Chinese Dr. Adam Kilgarriff Lexical Computing Ltd Leeds University UK.
The Sketch Engine -What is The Sketch Engine? -What is a corpus? -Looking at the BASE and the BAWE corpora. -How can this help.
Making useful wordlists for ELT Topical vocabulary from the WWW Simon Smith & Scott Sommers Ming Chuan University, Taipei Adam Kilgarriff, Lexical Computing.
Today Listening test Corpus linguistics talk, Part 3 News task NEOs Life on Mars.
Talking about your homework News story? –What made you choose…? One of your words? –What made you choose…? (Give your vocabulary books to another student.
1 Corpora for the coming decade Adam Kilgarriff Lexical Computing Ltd.
1 CS 430: Information Discovery Lecture 21 Web Search 3.
Data-Driven South Asian Language Learning SALRC Pedagogy Workshop June 8, 2005 J. Scott Payne Penn State University
Today Writing: using the comma –Writing task Corpus linguistics talk, Part 2 Re-organize groups –Group news discussion.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Tools for Historical corpus research, and a corpus of Latin Barbara McGillivray Oxford University Press Adam Kilgarriff Lexical Computing Ltd.
Labels: automation Adam Kilgarriff. Auckland 2012Kilgarriff / Labels: automation2 Which words are:  Most distinctive of business English?  Most often.
1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Using Corpora for Teaching Chinese Dr. Adam Kilgarriff Lexical Computing Ltd Leeds University UK.
Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd.
First International Sketch Grammar Workshop Ljubljana 3-4 February 2010.
1 Corpora, Language Technology and Maltese Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex.
Worshipping at the Shrine: Myths and Legends from comp.text.xml Kerry “the heretic” Raymond, CiTR.
1 The Long Road from Text to Meaning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd.
Word senses Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex.
GDEX: Automatically finding good dictionary examples in a corpus Adam Kilgarriff, Miloš Husák, Katy McAdam, Michael Rundell, Pavel Rychlý Lexical Computing.
1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of.
Auckland 2012Kilgarriff: Web Corpora1 Web Corpora Adam Kilgarriff.
Researching language with computers Paul Thompson.
1 Corpora, Language Technology and Maltese Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex.
1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.
Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.
Without data, nothing Adam Kilgarriff Lexical Computing Ltd University of Leeds.
Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass.
The BNC Design Model Adam Kilgarriff, Sue Atkins, Michael Rundell The Lexicography MasterClass
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Comparable Corpora BootCaT (CCBC) (or: In Praise of BootCaT) Adam Kilgarriff, Jan Pomikalek, Avinesh PVS Lexical Computing Ltd. Work Supported by EU FP7.
1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds.
Project Overview Graduate Selection Process Project Goal Automate the Selection Process.
1 Evaluating word sketches and corpora Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Corpus Evaluation Adam Kilgarriff Lexical Computing Ltd Corpus evaluationPortsmouth Nov
Using Corpora in Language Research Adam Kilgarriff Lexical Computing Ltd Universities of Leeds January 2013Adam Kilgarriff.
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
CL 2005, Birmingham Web as Corpus Workshop Intro: Adam Kilgarriff 1 Web as Corpus Workshop Co-chairs: Marco Baroni Adam Kilgarriff Sebastian Hoffman.
The Sketch Engine as Infrastructure for Large Scale Text Collections for Humanities Research Adam Kilgarriff Lexical Computing Ltd. & Univ of Leeds, UK.
Do we need lexicographers? Prospects for automatic lexicography Adam Kilgarriff Lexical Computing Ltd University of Leeds UK.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Sketch engine for Chinese Discussion notes. Wordsketch, subsequently Sketch Engine Was developed by Kilgarriff et al at Brighton Gives automatic, corpus-based.
Grammar is to Meaning as the Law if to Good Behaviour Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
GDEX: Automatically finding good dictionary examples in a corpus Auckland 2012Kilgarriff: GDEX1.
Knowledge Based Systems ExpertSystems Difficulties in Expert System Development u Scarce resources – new technology – demand for trained personnel u Development.
Exploring Variation in Lexis and Genre in the Sketch Engine Adam Kilgarriff Lexical Computing Ltd., UK Supported by EU Project PRESEMT.
GDEX: Automatically finding good dictionary examples in a corpus Kivik 2013Kilgarriff: GDEX1.
GDEX: Automatically finding good dictionary examples in a corpus.
Making useful wordlists for ELT
Difficulties in Expert System Development
Evaluating word sketches and corpora
A Latin corpus for Sketch Engine
Corpora, Language Technology and Maltese
Presentation transcript:

Malta, May 2010Kilgarriff: Corpora by Web Services1 Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Malta, May 2010 Kilgarriff: Corpora by Web Services 2 Starting a PhD in NLP  Then Prolog Type in a few  grammar rules  Lexical entries  Example sentences We’re off!

Malta, May 2010 Kilgarriff: Corpora by Web Services 3 Now  Corpus Which? Budget/schedule Howe much can we afford? Hard disk space  Access software Build  Big job, making it fast is hard – or Research, acquire, install, maintain …

Malta, May 2010 Kilgarriff: Corpora by Web Services 4  Resarch question Morphology, syntax, discourse structure, semantics, anaphor  First six months at least Acquiring data, software Complications

Malta, May 2010 Kilgarriff: Corpora by Web Services 5

Malta, May 2010 Kilgarriff: Corpora by Web Services 6 If you’re not super-geeky  Did I do it properly?  Dumbing down Let’s choose an easier question  Looking over shoulder

Malta, May 2010 Kilgarriff: Corpora by Web Services 7 Disappointment

Malta, May 2010 Kilgarriff: Corpora by Web Services 8 Making it easy  Like picking up a hire car

Malta, May 2010 Kilgarriff: Corpora by Web Services 9 Corpora by web services  Possible?  Already available

Malta, May 2010 Kilgarriff: Corpora by Web Services 10 Sketch Engine  Corpus querying  Fast  Handles large corpora  In use for lexicography at OUP, CUP, Macmillan, Collins, Le Robert  Word sketches Data-driven summary of a word’s grammatical and collocational behaviour

Malta, May 2010 Kilgarriff: Corpora by Web Services 11

Malta, May 2010 Kilgarriff: Corpora by Web Services 12 Corpora 63Welsh53Romanian 174Vietnamese66Portuguese149Greek 108Thai6Persian1627German 5Telugu95Norwegian126French 114Swedish409Japanese5508English 117Spanish1910Italian128Dutch 738Slovene34Irish800Czech 536Slovak102Indonesian456Chinese 188Russian31Hindi174Arabic

Malta, May 2010 Kilgarriff: Corpora by Web Services 13 Big, High Quality corpora  Big Performance  Banko and Brill 2004  There’s no data like more data Ample data for rare phenomena Big subcorpora  5b  Medical: 30m

Malta, May 2010 Kilgarriff: Corpora by Web Services 14 Quality  Bad data Spam Navigation-bars Duplicates Lists Bungled formatting Wrong language …  Less discussed Maybe a footnote  Quick fixes and run

Malta, May 2010 Kilgarriff: Corpora by Web Services 15 The Google/Yahoo/Bing option  Appeal Not setup costs Start googling today

Malta, May 2010 Kilgarriff: Corpora by Web Services 16 but  Limited hits-per-query  Limited hits-per-day  Sort order 'unsorted' not possible  Snippets too short for research  No (documented) morphology  Limited query syntax

Malta, May 2010 Kilgarriff: Corpora by Web Services 17 and  At mercy of commercial company  Might change at any time  Not replicable

Malta, May 2010 Kilgarriff: Corpora by Web Services 18 So  Appeal No setup costs  Serious research Many difficult practical issues Not a tool designed for linguists  Conclusion If only SE indexes are big enough  Yes Else no

Malta, May 2010 Kilgarriff: Corpora by Web Services 19 Strategy  More languages Corpus Factory, as Sharoff  Bigger and better (English) Big Web Corpus (BiWeC) ‏ 5.5b fully processed Rich markup  New Model Corpus  Collaboration model

Malta, May 2010 Kilgarriff: Corpora by Web Services 20 TEDDCLOG Taiwan English Data-Driven CLOze Generation with Simon Smith and colleagues, Taipei  API case study

Malta, May 2010 Kilgarriff: Corpora by Web Services 21 Cloze  'fill-the gap' Several metal _____ violently with cold water  A: behave  B: react  C: realise  D: respond  Popular with students, teachers, testers Unpopular with theorists :-(

Malta, May 2010 Kilgarriff: Corpora by Web Services 22 One objection  Test item writers make them up  Not naturally-occurring language The Sinclair-Johns critique Also: expensive  TEDDCLOG Uses corpus sentences and distractors

Malta, May 2010 Kilgarriff: Corpora by Web Services 23 react Thesaurus module Several metals react violently with cold water. Diffs module Concordance module behave, interact, respond Text processing module Several metals ___ violently with cold water. (a) behave (b) react (c) realise (d) respond behave realise respond metals behave x metals respond x metals realise x metals react √

Malta, May 2010 Kilgarriff: Corpora by Web Services 24 API calls  Find distractorts thesaurus  Find key-only collocate Sketch diffs  Needs optimising  Find carrier sentence Concordance with GDEX module  Good Dictionary Example Finder

Malta, May 2010 Kilgarriff: Corpora by Web Services 25 Current status  TEDDCLOG Next phase: producing decent results  Corpora by Web Services Increasing server capacity Looking for users

Malta, May 2010 Kilgarriff: Corpora by Web Services 26 Not just like picking up a hire car

Malta, May 2010 Kilgarriff: Corpora by Web Services 27 Not just like picking up a hire car more like picking up a Ferrari

Malta, May 2010 Kilgarriff: Corpora by Web Services 28 Another announcement: DANTE  Lexical database for English Detailed Accurate Extensive of English Highly corpus-driven 3 yr project 18 expert lexicographers Led by Sue Atkins  BNC, FrameNet, Euralex, COBUILD...  English side, New English-Irish dictionary  Available for NLP research imminently