Tomaž Erjavec1, Adam Kilgarriff2, Irena Srdanović Erjavec3 A large public-access Japanese corpus and its query tool - JapWaC and Sketch Engine - Tomaž Erjavec1, Adam Kilgarriff2, Irena Srdanović Erjavec3 1Jožef Stefan Institute, Slovenia 2Lexical Computing Ltd. and University of Leeds, UK 3Tokyo Institute of Technology, Japan
Overview The case for corpora The case for web corpora How JapWaC was created Sketch Engine (SkE) Demo-ing JapWaC & SkE Future work Access to JapWaC & SkE
Corpora A sample of a language Useful for studying the language Language is diverse Big samples needed, to catch everything Good tools needed, for large amounts of data Last 15 years Big samples are easier to gather Tools are better Rapid growth in corpus methods
Web corpora Web is huge, free, easily accessible (Non-)linguists use it for lang. check/research Skewed? Keller and Lapata 03: web results match human judgements well the large amount of data outweighs the “noise” problem Web importance as a resource is growing David Crystal “Language and the Internet” 06: “new linguistic medium that we cannot ignore” Web-corpus expertise is growing (WaCky etc.) as shown by Keller and Lapata, web results match human judgements well, often better than cleaner but smaller corpora the large amount of data available outweighs the problems associated with using the web as a corpus (such as the fact that it is noisy and unbalanced). Preliminary evidence suggests the ’balance’ of our German corpus compares favourably with that of a newswire corpus (though of course any such claim begs a number of open research questions about corpus comparability).
Steps to compile web corpora (Sharoff, Baroni) 1. Get URL list for required language ~500 most frequent word forms not function words; for general-purpose corpora, words that do not belong to a spec. domain 5000-6000 queries, 4 words, top 10 URLs 2. Download HTML pages 3. Normalize encoding (to UTF-8) 4. HTML clean-up boilerplate removal: HTML tags, Java code, navigation frames,… 5. Extract meta-data (URL, title, date,…) 6. Linguistic annotation
Steps for JapWaC URL list of pages in Japanese provided by Serge Sharoff word, lemma and PoS frequency lists for Japanese, c.f. http://corpus.leeds.ac.uk/list.html Files downloaded and cleaned with BootCat by Marco Baroni and others from the WaCky project, c.f. http://wacky.sslmit.unibo.it/ Segmented, tokenised, tagged with Chasen Translated Chasen tags to English Converted to Sketch Engine format and loaded
Example file The file size is 7038669 kB, showing first 1 kB. <doc id="http://www.0start-hp.com/voice/index.php"> <s> 月々 月々 N.Adv 2 2 N.Num 6 6 N.Num 3 3 N.Num 円 円 N.Suff.msr で だ Aux 、 、 Sym.c あなた あなた N.Pron.g も も P.bind ブログデビュー ブログデビュー Unknown し する V.free て て P.Conj み みる V.bnd ませ ます Aux ん ん Aux か か P.advcoordfin ? ? Sym.g </s>
Basic corpus statistics 49,554 URLs (i.e. HTML files) 16,072 sites (2 domains) 12,759,201 sentences (Chasen) 409,384,411 tokens (Chasen) 7.3 GB filesize tokens/file: 8,263 Average 5,001 Median 3 Min 170,693 Max
URL statistics (top ranking domains, sites and keywords)
Chasen POS statistics
The Sketch Engine Leading corpus query system Any corpus, any language Web-based No software to install Concordance Word sketches “one-page, corpus based account of a word’s grammatical and collocational behaviour” Thesaurus Word Sketch Difference
Macmillan English Dictionary Use of Sketch Engine Macmillan English Dictionary For Advanced Learners Ed: Rundell, 2002 Lexicography Language learning Linguistic research .
extensive comparisons in James Curran‘s thesis (about statistics) [[Tomaz is right ... 'salience' is just a name we are using (rather presumptuously) for the statistic we use, which is one of the same family as all the others (though we reckon it is the best suited of the family, that's why we use it, see also extensive comparisons in James Curran's thesis). Stat we currently use is based on Dice coefficient. This is (basically) freq(colloc)/(freq(word1) + freq(word2)) (We then do some scaling so the numbers "look nice") We should really list it as "Logdice" rather than "salience" - salience is what you want to measure, but it's presumptuous to imply we have found a measure that succeeds in measuring it. Stats aren't properly documented yet but will be soon. ]]
Creating SkE for Japanese Load JapWaC into SkE Write gram relations for Japanese Chasen POS (as used for jaSlo) Compile word sketches Recompute scores in WS Compile thesaurus
Word Sketch examples WS for 女の子 (noun) WS for 冷たい (adjective) WS for 書く (verb)
Thesaurus, WS Diff example(1) WS Diff for 女の子 and 男の子
Thesaurus, WS Diff example (2) WS Diff for 寒い and 冷たい
「温泉」 example WS for 温泉
Future work More metadata in the corpus: More data cleaning date, title, author; text typology More data cleaning Japanese corpus for HLT research: sampling only 10 consecutive sentences, 100M would be available for download with Creative Commons license For native speakers’ and learners’ use: original Chasen tags, Chasen kana Ruby romaji, furigana in examples Connecting to jaSlo, Natsume system More advanced relations (MWU etc.), Cabocha? Load other corpora into SkE (Kotonoha, AB) avoid copyright problems by sampling only 10 consecutive sentences from many URLs to obtain 100M corpus – make available for download with Creative Commons licence
Access to JapWaC & SkE http://www.sketchengine.co.uk Free 30-day trial Self-registration Japanese, Chinese, English, French, German, Italian, Spanish, Portuguese, Slovene Also gives access to WebBootCaT “instant web corpora”
Thank you for your attention!