arTenTen A new, vast corpus for Arabic

arTenTen A new, vast corpus for Arabic
Yonatan Belinkov, Nizar Habash, AdamKilgarriff, Noam Ordan, Ryan Roth, Vit Suchomel MIT/Columbia/Lexical Computing Ltd./ Univ Saarlandes/Masaryk Univ Cz

We all want corpora to be
Bigger Better More text types Richer metadata Cleaner Better linguistic processing

Arabic Since 2003: Arabic Gigaword Leeds Others
Good on most fronts except variety Newswire only Leeds 2005 Arabic web corpus (oldish) Others Mostly small or not available or newswire

arTenTen TenTen family Web crawled Cleaning and deduplication
See paper in main conference Web crawled Spiderling Pomikalek and Suchomel, WAC 2012 Cleaning and deduplication justText, Onion (Pomikalek)

Size Varieties/dialects 5.8 b space-separated tokens Fully processed:
200M words Tokenise, lemmatise, POS-tag by MADA, Columbia U Sketch grammar: new work (Belinkov) Varieties/dialects We don’t know yet

Availability In Sketch Engine demo

Encoding ‘Vertical’ format One word per line, tab-separated columns
Sketch Engine input format One word per line, tab-separated columns Twenty-nine Structural markup: XML

For each word word (as written, in Arabic) trans diac lemma lemma_ar non_voc_lemma non_voc_lemma_ar stem tag bw pref3 pref3tag pref2 pref2tag pref1 pref1tag pref0 pref0tag person aspect vox modus gender number state case enclitic gloss source

arTenTen A new, vast corpus for Arabic

Similar presentations

Presentation on theme: "arTenTen A new, vast corpus for Arabic"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

arTenTen A new, vast corpus for Arabic

Similar presentations

Presentation on theme: "arTenTen A new, vast corpus for Arabic"— Presentation transcript:

Similar presentations

About project

Feedback