First International Sketch Grammar Workshop Ljubljana 3-4 February 2010.

Slides:



Advertisements
Similar presentations
Terminology-finding in the Sketch Engine Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý, Vit Suchomel Lexical Computing Ltd., Brighton,
Advertisements

The Cambridge Learner Corpus, English Profile, the Sketch Engine and the Kelly Project Adam Kilgarriff Lexical Computing Ltd
Feed Corpus : An Ever Growing Up to Date Corpus Akshay Minocha, Siva Reddy, Adam Kilgarriff Lexical Computing Ltd.
WebBootCaT usage Adam Kilgarriff Lexical Computing Ltd.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
CL Research ACL Pattern Dictionary of English Prepositions (PDEP) Ken Litkowski CL Research 9208 Gue Road Damascus,
Linking Dictionary and Corpus Adam Kilgarriff Lexicography MasterClass Ltd Lexical Computing Ltd University of Sussex UK.
1 Corpora for the coming decade Adam Kilgarriff. Dublin June 2009 Kilgarriff: Corpora for the coming decade2 How should they be different?  Bigger 
Between Corpus and Dictionary Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex.
J. Kunzmann, K. Choukri, E. Janke, A. Kießling, K. Knill, L. Lamel, T. Schultz, and S. Yamamoto Automatic Speech Recognition and Understanding ASRU, December.
< Translator Team > 25+ Languages, …and growing!.
Measuring Monolinguality Chris Biemann NLP Department, University of Leipzig LREC-06 Workshop on Quality Assurance and Quality Measurement for Language.
Constructing and Evaluating Web Corpora: ukWaC Adriano Ferraresi University of Bologna Aston University Postgraduate Conference.
Talking about your homework News story? –What made you choose…? One of your words? –What made you choose…? (Give your vocabulary books to another student.
1 Corpora for the coming decade Adam Kilgarriff Lexical Computing Ltd.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Feed Corpus : An Ever Growing Up to Date Corpus Akshay Minocha, Siva Reddy, Adam Kilgarriff Lexical Computing Ltd.
What's on the Web? The Web as a Linguistic Corpus Adam Kilgarriff Lexical Computing Ltd University of Leeds.
Talk, Translate, and Voice By: Jill Gruttadauro, Amanda Swetish, Porter Waung.
Tools for Historical corpus research, and a corpus of Latin Barbara McGillivray Oxford University Press Adam Kilgarriff Lexical Computing Ltd.
1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3.
4th project meeting 27-29/05/2013, Budapest, Hungary FP 7-INFRASTRUCTURES programme agINFRA agINFRA A data infrastructure for agriculture.
Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd.
1 Linguistic evidence within and across languages, word frequency lists and language learning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass.
1 Corpora, Language Technology and Maltese Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex.
1 The Long Road from Text to Meaning Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Terminology, translation, and PRESEMT; word frequency lists and KELLY 1 Adam Kilgarriff Lexical Computing Ltd SKEW-2, March 2011Kilgarriff: PRESEMT and.
Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd.
JRC-Ispra, , Slide 1 Next Steps / Technical Details Bruno Pouliquen & Ralf Steinberger Addressing the Language Barrier Problem in the Enlarged.
Word senses Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex.
GDEX: Automatically finding good dictionary examples in a corpus Adam Kilgarriff, Miloš Husák, Katy McAdam, Michael Rundell, Pavel Rychlý Lexical Computing.
1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of.
Auckland 2012Kilgarriff: Web Corpora1 Web Corpora Adam Kilgarriff.
Betsy L. Humphreys Betsy L. Humphreys Associate Director for Library Operations NLM, NIH, HHS NLM, NIH, HHS National Library.
Experiments on Building Language Resources for Multi-Modal Dialogue Systems Goals identification of a methodology for adapting linguistic resources for.
1 Googleology is bad science Adam Kilgarriff Lexical Computing Ltd Universities of Sussex, Leeds.
Without data, nothing Adam Kilgarriff Lexical Computing Ltd University of Leeds.
Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
1 Comparable Corpora Within and Across Languages, Word Frequency Lists and the KELLY Project Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass.
A Language Independent Method for Question Classification COLING 2004.
Comparable Corpora BootCaT (CCBC) (or: In Praise of BootCaT) Adam Kilgarriff, Jan Pomikalek, Avinesh PVS Lexical Computing Ltd. Work Supported by EU FP7.
1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds.
Sketch Engine development: Done and To-do. Done (in last 18 months):  Corpus Architect Replaces the home page, CorpusBuilder, WebBootCat, Account mgt.
1 Evaluating word sketches and corpora Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Corpus Evaluation Adam Kilgarriff Lexical Computing Ltd Corpus evaluationPortsmouth Nov
1 Word senses: a computational response Adam Kilgarriff Auckland 2012Kilgarriff: Word senses: a computational response.
arTenTen A new, vast corpus for Arabic
Using Corpora in Language Research Adam Kilgarriff Lexical Computing Ltd Universities of Leeds January 2013Adam Kilgarriff.
Malta, May 2010Kilgarriff: Corpora by Web Services1 Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities.
New RCLayout. Do product layout 3 improvements All products Local databases New functionalities.
Terminology-finding in the Sketch Engine Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý, Vit Suchomel Lexical Computing Ltd., Brighton,
CL 2005, Birmingham Web as Corpus Workshop Intro: Adam Kilgarriff 1 Web as Corpus Workshop Co-chairs: Marco Baroni Adam Kilgarriff Sebastian Hoffman.
Do we need lexicographers? Prospects for automatic lexicography Adam Kilgarriff Lexical Computing Ltd University of Leeds UK.
1 Word senses: a computational response Adam Kilgarriff.
Subcorpus configuration Adam Kilgarriff. Feb 2010Kilgarriff: IWSG: Subcorpora2 “you can’t get away from genre” Bonnie Weber, Keynote Lecture ICON (Indian.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Luis Avila Tics. We have to recognize all the operating systems we have nowadays in the different smartphones Blackberry: Bb OS Iphone: iOS Nokia: symbian.
11/23/00UNU/IAS/UNL Centre1 The Universal Networking Language United Nations University Institute of Advanced Studies United Networking Language ® UNU/IAS.
GDEX: Automatically finding good dictionary examples in a corpus Auckland 2012Kilgarriff: GDEX1.
1 Word senses: a computational response Adam Kilgarriff.
GDEX: Automatically finding good dictionary examples in a corpus Kivik 2013Kilgarriff: GDEX1.
GDEX: Automatically finding good dictionary examples in a corpus.
Measuring Monolinguality
Profiling Web Archive Coverage for Top-Level Domain & Content Language
Evaluating word sketches and corpora
Using Corpora for Language Research
Tomaž Erjavec1, Adam Kilgarriff2, Irena Srdanović Erjavec3
A Latin corpus for Sketch Engine
COUNTRIES NATIONALITIES LANGUAGES.
Presentation transcript:

First International Sketch Grammar Workshop Ljubljana 3-4 February 2010

Feb 2010Kilgarriff: IWSG, Ljubljana2 Workshop goals  (as I see them)  Share grammar-writing experience  Feedback to LCL  LCL tells you Other possibilities What is in the pipeline

Feb 2010Kilgarriff: IWSG, Ljubljana3 Apologies  Masha Kholkova, Carole Tiberius

Feb 2010Kilgarriff: IWSG, Ljubljana4 LCL projects and plans  Corpora Corpus Factory English: bigger and better  Corpus NLP with remote corpora Web-API use of SkE  Far horizons From text towards meaning  Tomorrow SkE Interface, extra functionality Formalism (Pavel)

Feb 2010Kilgarriff: IWSG, Ljubljana5 Corpus Factory  Goal All medium-large world lgs  All EU languages  About m word web corpus  Hyderabad team

Feb 2010Kilgarriff: IWSG, Ljubljana6  Done Dutch Thai Vietnamese Hindi  but Indians mainly use English on web  Earlier projects Greek Japanese  Next Swedish Norwegian Korean  Collab: WaCKY Bologna (Marco Baroni  German  Italian Leeds (Serge Sharoff)  Arabic Chinese Polish Russian French Spanish …

Feb 2010Kilgarriff: IWSG, Ljubljana7 BootCat method  Wikipedia for the lg Word freq list  Mid-freq words: seeds  Highest-freq words: use for filtering  Queries of n words to search engine  Clean, dedupe, filter  Tokenise, POS-tag, lemmatise  Load in SkE Word Sketches

Feb 2010Kilgarriff: IWSG, Ljubljana8 English  Bigger  Better

Feb 2010Kilgarriff: IWSG, Ljubljana9 Bigger  Motivation Ample data for rare phenomena Big subcorpora For language modelling  More like Google-scale but without Google disadvantages  See Googleology is Bad Science, CL 2007

Feb 2010Kilgarriff: IWSG, Ljubljana10 Better  Less noise  Fewer duplicates  Richer markup At word, sentence level At document level (text type, subcorpora) ‏

Feb 2010Kilgarriff: IWSG, Ljubljana11 Divide and rule  Bigger (+ cleaning + deduplication) ‏ Big Web Corpus (BiWeC) ‏  Currently 5.5b fully processed  Target 20b words  Jan Pomikalek, Pavel Rychly  Better New Model Corpus

Feb 2010Kilgarriff: IWSG, Ljubljana12 New Model Corpus  model 1.small version: model train 2.design: data model  New Model Corpus 1:100 scale model To replace BNC as design model

Feb 2010Kilgarriff: IWSG, Ljubljana13 BNC design model  Most often used Eg for other languages  pre-web f(blog)=0  Corpora now bigger, far quicker, far cheaper, different issues  BNC design model past its sell-by Kilgarriff Atkins Rundell, Corpus Lg 2007

Feb 2010Kilgarriff: IWSG, Ljubljana14 New model  Data  Markup

Feb 2010Kilgarriff: IWSG, Ljubljana15 Data  From the web  100m words  Small sample size Copyright ??Creative Commons Licence

Feb 2010Kilgarriff: IWSG, Ljubljana16 Composition  General crawl50  Targeted Fiction 7 Blog 7 Newspaper (RSS feed) 7 Speech10  Film transcripts, chatshow Domain-specific19  Business, medical, law

Feb 2010Kilgarriff: IWSG, Ljubljana17 Markup  Collaborative We distribute data Anyone applies their tools  Pos-tagger, parser, co-ref resolution, domain classifier, WSD, semantic classifier, time phrases, named entities... We integrate, display in Sketch Engine Research potential from multiple markup

Feb 2010Kilgarriff: IWSG, Ljubljana18 Recombine the two strands  Apply methods with good accuracy (and fast) to BiWeC  Result will be Bigger Better

Feb 2010Kilgarriff: IWSG, Ljubljana19 Corpus NLP with Remote Corpora/ NLP by web services?  Big corpora big to hold, hard to access fast  Sketch Engine: corpus specialist  Web API FrameNet TEDDCLOG: Taiwan English Data Driven Cloze (test sentence) Generation  All welcome

Feb 2010Kilgarriff: IWSG, Ljubljana20 Practicalities  Free trial accounts  Collaborators, innovative users free longer-term accounts Wikinomics, Tapscott and Williams  API Details under 'help' on SkE home page  New Model Corpus Available soon: watch Corpora

Feb 2010Kilgarriff: IWSG, Ljubljana21 Far horizons

Feb 2010Kilgarriff: IWSG, Ljubljana22 The long journey from text towards meaning Raw text Pure meaning Rationalists Empiricists lemmatizer POS-tagger parser thesaurus thematic relations/frame elements

Feb 2010Kilgarriff: IWSG, Ljubljana23 Next steps  Semantic tagging Extra positional attribute Use in Sketch Grammar patterns  English: Lancaster system  Russian: ABBYY system  Learn Hanks: Corpus Pattern Analysis Melcuk Lexical Functions Frame semantics: frames

Feb 2010Kilgarriff: IWSG, Ljubljana24 -- and WSD  Semi-Automatic Dictionary Drafting  SADD Builds on WASPS Shares CPA technology  Senses as clusters of instances ‘one sense per collocate’  Shortcut: clusters of collocates  In pictures

Feb 2010Kilgarriff: IWSG, Ljubljana25 Clustered word sketch

Feb 2010Kilgarriff: IWSG, Ljubljana26

Feb 2010Kilgarriff: IWSG, Ljubljana27

Feb 2010Kilgarriff: IWSG, Ljubljana28

Feb 2010Kilgarriff: IWSG, Ljubljana29

Feb 2010Kilgarriff: IWSG, Ljubljana30 LCL projects and plans  Corpora Many languages English: bigger and better  Corpus NLP with remote corpora Web-API use of SkE  Far horizons From text towards meaning  Tomorrow SkE Interface, extra functionality  Subcorpora / text types Formalism (Pavel)