Presentation is loading. Please wait.

Presentation is loading. Please wait.

arTenTen A new, vast corpus for Arabic

Similar presentations


Presentation on theme: "arTenTen A new, vast corpus for Arabic"— Presentation transcript:

1 arTenTen A new, vast corpus for Arabic
Yonatan Belinkov, Nizar Habash, AdamKilgarriff, Noam Ordan, Ryan Roth, Vit Suchomel MIT/Columbia/Lexical Computing Ltd./ Univ Saarlandes/Masaryk Univ Cz

2 We all want corpora to be
Bigger Better More text types Richer metadata Cleaner Better linguistic processing

3 Arabic Since 2003: Arabic Gigaword Leeds Others
Good on most fronts except variety Newswire only Leeds 2005 Arabic web corpus (oldish) Others Mostly small or not available or newswire

4 arTenTen TenTen family Web crawled Cleaning and deduplication
See paper in main conference Web crawled Spiderling Pomikalek and Suchomel, WAC 2012 Cleaning and deduplication justText, Onion (Pomikalek)

5 Size Varieties/dialects 5.8 b space-separated tokens Fully processed:
200M words Tokenise, lemmatise, POS-tag by MADA, Columbia U Sketch grammar: new work (Belinkov) Varieties/dialects We don’t know yet

6 Availability In Sketch Engine demo

7 Encoding ‘Vertical’ format One word per line, tab-separated columns
Sketch Engine input format One word per line, tab-separated columns Twenty-nine Structural markup: XML

8 For each word word (as written, in Arabic)  trans  diac  lemma  lemma_ar  non_voc_lemma  non_voc_lemma_ar  stem  tag  bw  pref3  pref3tag  pref2  pref2tag  pref1 pref1tag pref0 pref0tag person aspect vox modus gender number state case enclitic gloss source


Download ppt "arTenTen A new, vast corpus for Arabic"

Similar presentations


Ads by Google