1 Corpora for the coming decade Adam Kilgarriff. Dublin June 2009 Kilgarriff: Corpora for the coming decade2 How should they be different?  Bigger 

Slides:



Advertisements
Similar presentations
The Cambridge Learner Corpus, English Profile, the Sketch Engine and the Kelly Project Adam Kilgarriff Lexical Computing Ltd
Advertisements

Feed Corpus : An Ever Growing Up to Date Corpus Akshay Minocha, Siva Reddy, Adam Kilgarriff Lexical Computing Ltd.
WebBootCaT usage Adam Kilgarriff Lexical Computing Ltd.
XML Services and Needs in NOAA’s National Weather Service Ron Jones NOAA’s National Weather Service Office of the CIO.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
CL Research ACL Pattern Dictionary of English Prepositions (PDEP) Ken Litkowski CL Research 9208 Gue Road Damascus,
1 Corpora for all Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Measuring Distance between Language Varieties Adam Kilgarriff, Jan Pomikalek, Pavel Rychly, Vit Suchomel Supported by EU Project PRESEMT.
Improving Machine Translation Quality via Hybrid Systems and Refined Evaluation Methods Andreas Eisele DFKI GmbH and Saarland University Helsinki, November.
Linking Dictionary and Corpus Adam Kilgarriff Lexicography MasterClass Ltd Lexical Computing Ltd University of Sussex UK.
Between Corpus and Dictionary Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex.
Corpus Creation for Lexicography Adam Kilgarriff, Michael Rundell Lexicography MasterClass, UK Elaine Ui Dhonnchadha ITE (Linguistics Institute of Ireland)
IAEA International Atomic Energy Agency INIS Collection Search: Introduction and main features INIS Training Seminar 7-11 October 2013, Vienna Domenico.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Making useful wordlists for ELT Topical vocabulary from the WWW Simon Smith & Scott Sommers Ming Chuan University, Taipei Adam Kilgarriff, Lexical Computing.
Measuring Monolinguality Chris Biemann NLP Department, University of Leipzig LREC-06 Workshop on Quality Assurance and Quality Measurement for Language.
Talking about your homework News story? –What made you choose…? One of your words? –What made you choose…? (Give your vocabulary books to another student.
1 Corpora for the coming decade Adam Kilgarriff Lexical Computing Ltd.
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.
Feed Corpus : An Ever Growing Up to Date Corpus Akshay Minocha, Siva Reddy, Adam Kilgarriff Lexical Computing Ltd.
What's on the Web? The Web as a Linguistic Corpus Adam Kilgarriff Lexical Computing Ltd University of Leeds.
 Simplify Your Life. Use Google Docs. ELIB 570 Final Presentation: Web 2.0 Tool.
1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
First International Sketch Grammar Workshop Ljubljana 3-4 February 2010.
1 Corpora, Language Technology and Maltese Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex.
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd.
Word senses Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex.
GDEX: Automatically finding good dictionary examples in a corpus Adam Kilgarriff, Miloš Husák, Katy McAdam, Michael Rundell, Pavel Rychlý Lexical Computing.
1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of.
Auckland 2012Kilgarriff: Web Corpora1 Web Corpora Adam Kilgarriff.
Visualization, analysis and mining of geo- spatial information in educational data sets using web-based tools Aniruddha Desai |Winter 2013 Presentation.
Experiments on Building Language Resources for Multi-Modal Dialogue Systems Goals identification of a methodology for adapting linguistic resources for.
1 Corpora, Language Technology and Maltese Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex.
1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.
Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.
Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
The BNC Design Model Adam Kilgarriff, Sue Atkins, Michael Rundell The Lexicography MasterClass
Methods for the Automatic Construction of Topic Maps Eric Freese, Senior Consultant ISOGEN International.
A Language Independent Method for Question Classification COLING 2004.
Comparable Corpora BootCaT (CCBC) (or: In Praise of BootCaT) Adam Kilgarriff, Jan Pomikalek, Avinesh PVS Lexical Computing Ltd. Work Supported by EU FP7.
1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds.
1 Evaluating word sketches and corpora Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Corpus Evaluation Adam Kilgarriff Lexical Computing Ltd Corpus evaluationPortsmouth Nov
Using Corpora in Language Research Adam Kilgarriff Lexical Computing Ltd Universities of Leeds January 2013Adam Kilgarriff.
Malta, May 2010Kilgarriff: Corpora by Web Services1 Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities.
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
Terminology-finding in the Sketch Engine Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý, Vit Suchomel Lexical Computing Ltd., Brighton,
CL 2005, Birmingham Web as Corpus Workshop Intro: Adam Kilgarriff 1 Web as Corpus Workshop Co-chairs: Marco Baroni Adam Kilgarriff Sebastian Hoffman.
Huda Sarfraz Center for Research in Urdu Language Processing, National University of Computer and Emerging Sciences cases of local language content development.
The Sketch Engine as Infrastructure for Large Scale Text Collections for Humanities Research Adam Kilgarriff Lexical Computing Ltd. & Univ of Leeds, UK.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Gurleen Ahluwalia Lecturer in Communication Skills BBSBEC, Fatehgarh Sahib Punjab.
Learning Usage of English KWICly with WebLEAP/DSR Takashi Yamanoue Kagoshima University, Japan Toshiro Minami Kyushu Institute of Information Sciences.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
A Semantic Knowledge Base for the UK Government Web Archive Tom Storrar & Claire Newing Applying records management processes principles to the open government.
THE SEMANTIC WEB By Conrad Williams. Contents  What is the Semantic Web?  Technologies  XML  RDF  OWL  Implementations  Social Networking  Scholarly.
GDEX: Automatically finding good dictionary examples in a corpus Auckland 2012Kilgarriff: GDEX1.
WEB MONITORING E6125 Web enHanced Information Management Presentation on Design of Web Monitoring applications. By Satyajeet Shaligram Columbia University.
AQUAINT Mid-Year PI Meeting – June 2002 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
PAIR project progress report Yi-Ting Chou Shui-Lung Chuang Xuanhui Wang.
GDEX: Automatically finding good dictionary examples in a corpus Kivik 2013Kilgarriff: GDEX1.
GDEX: Automatically finding good dictionary examples in a corpus.
Measuring Monolinguality
Making useful wordlists for ELT
Evaluating word sketches and corpora
Writing Analytics Clayton Clemens Vive Kumar.
CS224N Section 3: Corpora, etc.
Corpora, Language Technology and Maltese
Presentation transcript:

1 Corpora for the coming decade Adam Kilgarriff

Dublin June 2009 Kilgarriff: Corpora for the coming decade2 How should they be different?  Bigger  Better  Communal

Dublin June 2009 Kilgarriff: Corpora for the coming decade3 Bigger  Motivation Ample data for rare phenomena Big subcorpora For language modelling  More like Google-scale but without Google disadvantages  See Googleology is Bad Science, CL 2007

Dublin June 2009 Kilgarriff: Corpora for the coming decade4 Better  Less noise  Fewer duplicates  Richer markup At word, sentence level At document level (text type, subcorpora) ‏

Dublin June 2009 Kilgarriff: Corpora for the coming decade5 Divide and rule  Bigger (+ cleaning + deduplication) ‏ Big Web Corpus (BiWeC) ‏  Currently 5.5b fully processed  Target 20b words  Jan Pomikalek, Pavel Rychly  Better New Model Corpus

Dublin June 2009 Kilgarriff: Corpora for the coming decade6 New Model Corpus  model 1.small version: model train 2.design: data model  New Model Corpus 1:100 scale model To replace BNC as design model

Dublin June 2009 Kilgarriff: Corpora for the coming decade7 BNC design model  Most often used Eg for other languages  pre-web f(blog)=0  Corpora now bigger, far quicker, far cheaper, different issues  BNC design model past its sell-by Kilgarriff Atkins Rundell, Corpus Lg 2007

Dublin June 2009 Kilgarriff: Corpora for the coming decade8 New model  Data  Markup

Dublin June 2009 Kilgarriff: Corpora for the coming decade9 Data  From the web  100m words  Small sample size Copyright ??Creative Commons Licence

Dublin June 2009 Kilgarriff: Corpora for the coming decade10 Composition  General crawl50  Targeted Fiction 7 Blog 7 Newspaper (RSS feed) 7 Speech10  Film transcripts, chatshow Domain-specific19  Business, medical, law

Dublin June 2009 Kilgarriff: Corpora for the coming decade11 Markup  Collaborative We distribute data Anyone applies their tools  Pos-tagger, parser, co-ref resolution, domain classifier, WSD, semantic classifier, time phrases, named entities... We integrate, display in Sketch Engine Research potential from multiple markup

Dublin June 2009 Kilgarriff: Corpora for the coming decade12 Two strands  Apply methods with good accuracy (and fast) to BiWeC

Dublin June 2009 Kilgarriff: Corpora for the coming decade13 Two strands  Apply methods with good accuracy (and fast) to BiWeC  Bigger  Better

Dublin June 2009 Kilgarriff: Corpora for the coming decade Communal  Collective effort to produce Markup, see above  Access Free and open Integrate into applications

Dublin June 2009 Kilgarriff: Corpora for the coming decade15 NLP by web services?  Big corpora big to hold, hard to access fast  Sketch Engine: corpus specialist  Web API FrameNet TEDDCLOG: Taiwan English Data Driven Cloze (test sentence) Generation  All welcome

Dublin June 2009 Kilgarriff: Corpora for the coming decade16 Practicalities  Free trial accounts  Collaborators, innovative users free longer-term accounts Wikinomics, Tapscott and Williams  API Details under 'help' on SkE home page  New Model Corpus Available by end 2009: watch Corpora

Dublin June 2009 Kilgarriff: Corpora for the coming decade  Thank you 