Tomaž Erjavec1, Adam Kilgarriff2, Irena Srdanović Erjavec3

Slides:



Advertisements
Similar presentations
Terminology-finding in the Sketch Engine Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý, Vit Suchomel Lexical Computing Ltd., Brighton,
Advertisements

The Cambridge Learner Corpus, English Profile, the Sketch Engine and the Kelly Project Adam Kilgarriff Lexical Computing Ltd
MULTEXT-East Version 4: Multilingual Morphosyntactic Specifications, Lexicons and Corpora Tomaž Erjavec Department of Knowledge Technologies Jožef.
WebBootCaT usage Adam Kilgarriff Lexical Computing Ltd.
1 Corpora for all Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
1 Corpora for the coming decade Adam Kilgarriff. Dublin June 2009 Kilgarriff: Corpora for the coming decade2 How should they be different?  Bigger 
Corpus Creation for Lexicography Adam Kilgarriff, Michael Rundell Lexicography MasterClass, UK Elaine Ui Dhonnchadha ITE (Linguistics Institute of Ireland)
The Sketch Engine -What is The Sketch Engine? -What is a corpus? -Looking at the BASE and the BAWE corpora. -How can this help.
Information Retrieval in Practice
Making useful wordlists for ELT Topical vocabulary from the WWW Simon Smith & Scott Sommers Ming Chuan University, Taipei Adam Kilgarriff, Lexical Computing.
Constructing and Evaluating Web Corpora: ukWaC Adriano Ferraresi University of Bologna Aston University Postgraduate Conference.
1 Corpora for the coming decade Adam Kilgarriff Lexical Computing Ltd.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
Chinese Information Processing (I): Basic Concepts and Practice Unit 7: Web Pages in Chinese.
Research methods in corpus linguistics Xiaofei Lu.
What's on the Web? The Web as a Linguistic Corpus Adam Kilgarriff Lexical Computing Ltd University of Leeds.
Tools for Historical corpus research, and a corpus of Latin Barbara McGillivray Oxford University Press Adam Kilgarriff Lexical Computing Ltd.
Labels: automation Adam Kilgarriff. Auckland 2012Kilgarriff / Labels: automation2 Which words are:  Most distinctive of business English?  Most often.
1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3.
Using Corpora for Teaching Chinese Dr. Adam Kilgarriff Lexical Computing Ltd Leeds University UK.
Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd.
Using corpora for bespoke language teaching
First International Sketch Grammar Workshop Ljubljana 3-4 February 2010.
1 Corpora, Language Technology and Maltese Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex.
Terminology, translation, and PRESEMT; word frequency lists and KELLY 1 Adam Kilgarriff Lexical Computing Ltd SKEW-2, March 2011Kilgarriff: PRESEMT and.
GDEX: Automatically finding good dictionary examples in a corpus Adam Kilgarriff, Miloš Husák, Katy McAdam, Michael Rundell, Pavel Rychlý Lexical Computing.
1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of.
Auckland 2012Kilgarriff: Web Corpora1 Web Corpora Adam Kilgarriff.
1 Corpora, Language Technology and Maltese Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd University of Sussex.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex.
Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
GoogleDictionary Paul Nepywoda Alla Rozovskaya. Goal Develop a tool for English that, given a word, will illustrate its usage.
Comparable Corpora BootCaT (CCBC) (or: In Praise of BootCaT) Adam Kilgarriff, Jan Pomikalek, Avinesh PVS Lexical Computing Ltd. Work Supported by EU FP7.
1 Using Corpora in Language Research -also Introduction to the Sketch Engine (WS15) part 1 Adam Kilgarriff Lexical Computing Ltd Universities of Leeds.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Corpus Evaluation Adam Kilgarriff Lexical Computing Ltd Corpus evaluationPortsmouth Nov
Using Corpora in Language Research Adam Kilgarriff Lexical Computing Ltd Universities of Leeds January 2013Adam Kilgarriff.
Malta, May 2010Kilgarriff: Corpora by Web Services1 Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities.
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
Terminology-finding in the Sketch Engine Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý, Vit Suchomel Lexical Computing Ltd., Brighton,
CL 2005, Birmingham Web as Corpus Workshop Intro: Adam Kilgarriff 1 Web as Corpus Workshop Co-chairs: Marco Baroni Adam Kilgarriff Sebastian Hoffman.
LINGUATECA FLUP/CLUP The Corpógrafo – a Web-based environment for corpora research extract Term Candidates.
Grammar is to Meaning as the Law if to Good Behaviour Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex.
Exploring Variation in Lexis and Genre in the Sketch Engine Adam Kilgarriff Lexical Computing Ltd., UK Supported by EU Project PRESEMT.
What is a Corpus? What is not a corpus?  the Web  collection of citations  a text Definition of a corpus “A corpus is a collection of pieces of language.
Find International Driving Document Translator Online
Information Retrieval in Practice
Using language corpora in developing Arabic lessons & syllabuses
HTML5 Basics.
Measuring Monolinguality
Introduction to Corpus Linguistics
Making useful wordlists for ELT

Computational and Statistical Methods for Corpus Analysis: Overview
Evaluating word sketches and corpora
Topics in Linguistics ENG 331
French / English Translation
Using Corpora for Language Research
Corpus Linguistics I ENG 617
Searching EIT, Author Gay Robertson, 2017.
WordNet WordNet, WSD.
ICEweb 2 a new way of compiling high-quality web-based components for ICE corpora Martin Weisser Center for Linguistics & Applied Linguistics, Guangdong.
A Latin corpus for Sketch Engine
Using GOLD to Tracking L2 Development
HTML 5 SEMANTIC ELEMENTS.
Corpora, Language Technology and Maltese
Extracting Why Text Segment from Web Based on Grammar-gram
Presentation transcript:

Tomaž Erjavec1, Adam Kilgarriff2, Irena Srdanović Erjavec3 A large public-access Japanese corpus and its query tool - JapWaC and Sketch Engine - Tomaž Erjavec1, Adam Kilgarriff2, Irena Srdanović Erjavec3 1Jožef Stefan Institute, Slovenia 2Lexical Computing Ltd. and University of Leeds, UK 3Tokyo Institute of Technology, Japan

Overview The case for corpora The case for web corpora How JapWaC was created Sketch Engine (SkE) Demo-ing JapWaC & SkE Future work Access to JapWaC & SkE

Corpora A sample of a language Useful for studying the language Language is diverse Big samples needed, to catch everything Good tools needed, for large amounts of data Last 15 years Big samples are easier to gather Tools are better Rapid growth in corpus methods

Web corpora Web is huge, free, easily accessible (Non-)linguists use it for lang. check/research Skewed? Keller and Lapata 03: web results match human judgements well the large amount of data outweighs the “noise” problem Web importance as a resource is growing David Crystal “Language and the Internet” 06: “new linguistic medium that we cannot ignore” Web-corpus expertise is growing (WaCky etc.) as shown by Keller and Lapata, web results match human judgements well, often better than cleaner but smaller corpora the large amount of data available outweighs the problems associated with using the web as a corpus (such as the fact that it is noisy and unbalanced). Preliminary evidence suggests the ’balance’ of our German corpus compares favourably with that of a newswire corpus (though of course any such claim begs a number of open research questions about corpus comparability).

Steps to compile web corpora (Sharoff, Baroni) 1. Get URL list for required language ~500 most frequent word forms not function words; for general-purpose corpora, words that do not belong to a spec. domain 5000-6000 queries, 4 words, top 10 URLs 2. Download HTML pages 3. Normalize encoding (to UTF-8) 4. HTML clean-up boilerplate removal: HTML tags, Java code, navigation frames,… 5. Extract meta-data (URL, title, date,…) 6. Linguistic annotation

Steps for JapWaC URL list of pages in Japanese provided by Serge Sharoff word, lemma and PoS frequency lists for Japanese, c.f. http://corpus.leeds.ac.uk/list.html Files downloaded and cleaned with BootCat by Marco Baroni and others from the WaCky project, c.f. http://wacky.sslmit.unibo.it/ Segmented, tokenised, tagged with Chasen Translated Chasen tags to English Converted to Sketch Engine format and loaded

Example file The file size is 7038669 kB, showing first 1 kB. <doc id="http://www.0start-hp.com/voice/index.php"> <s> 月々 月々 N.Adv 2 2 N.Num 6 6 N.Num 3 3 N.Num 円 円 N.Suff.msr で だ Aux 、 、 Sym.c あなた あなた N.Pron.g も も P.bind ブログデビュー ブログデビュー Unknown し する V.free て て P.Conj み みる V.bnd ませ ます Aux ん ん Aux か か P.advcoordfin ? ? Sym.g </s>

Basic corpus statistics 49,554 URLs (i.e. HTML files) 16,072 sites (2 domains) 12,759,201 sentences (Chasen) 409,384,411 tokens (Chasen) 7.3 GB filesize   tokens/file: 8,263 Average 5,001 Median 3 Min 170,693 Max

URL statistics (top ranking domains, sites and keywords)

Chasen POS statistics

The Sketch Engine Leading corpus query system Any corpus, any language Web-based No software to install Concordance Word sketches “one-page, corpus based account of a word’s grammatical and collocational behaviour” Thesaurus Word Sketch Difference

Macmillan English Dictionary Use of Sketch Engine Macmillan English Dictionary For Advanced Learners Ed: Rundell, 2002 Lexicography Language learning Linguistic research .

extensive comparisons in James Curran‘s thesis (about statistics) [[Tomaz is right ...  'salience' is just a name we are using (rather presumptuously) for the statistic we use, which is one of the same family as all the others (though we reckon it is the best suited of the family, that's why we use it, see also extensive comparisons in James Curran's thesis).  Stat we currently use is based on Dice coefficient.  This is (basically)    freq(colloc)/(freq(word1) + freq(word2)) (We then do some scaling so the numbers "look nice") We should really list it as "Logdice" rather than "salience" - salience is what you want to measure, but it's presumptuous to imply we have found a measure that succeeds in measuring it.  Stats aren't properly documented yet but will be soon. ]]

Creating SkE for Japanese Load JapWaC into SkE Write gram relations for Japanese Chasen POS (as used for jaSlo) Compile word sketches Recompute scores in WS Compile thesaurus

Word Sketch examples WS for 女の子 (noun) WS for 冷たい (adjective) WS for 書く (verb)

Thesaurus, WS Diff example(1) WS Diff for 女の子 and 男の子

Thesaurus, WS Diff example (2) WS Diff for 寒い and 冷たい

「温泉」 example WS for 温泉

Future work More metadata in the corpus: More data cleaning date, title, author; text typology More data cleaning Japanese corpus for HLT research: sampling only 10 consecutive sentences, 100M would be available for download with Creative Commons license For native speakers’ and learners’ use: original Chasen tags, Chasen kana Ruby romaji, furigana in examples Connecting to jaSlo, Natsume system More advanced relations (MWU etc.), Cabocha? Load other corpora into SkE (Kotonoha, AB) avoid copyright problems by sampling only 10 consecutive sentences from many URLs to obtain 100M corpus – make available for download with Creative Commons licence

Access to JapWaC & SkE http://www.sketchengine.co.uk Free 30-day trial Self-registration Japanese, Chinese, English, French, German, Italian, Spanish, Portuguese, Slovene Also gives access to WebBootCaT “instant web corpora”

Thank you for your attention!