Download presentation
Presentation is loading. Please wait.
1
arTenTen A new, vast corpus for Arabic
Yonatan Belinkov, Nizar Habash, AdamKilgarriff, Noam Ordan, Ryan Roth, Vit Suchomel MIT/Columbia/Lexical Computing Ltd./ Univ Saarlandes/Masaryk Univ Cz
2
We all want corpora to be
Bigger Better More text types Richer metadata Cleaner Better linguistic processing
3
Arabic Since 2003: Arabic Gigaword Leeds Others
Good on most fronts except variety Newswire only Leeds 2005 Arabic web corpus (oldish) Others Mostly small or not available or newswire
4
arTenTen TenTen family Web crawled Cleaning and deduplication
See paper in main conference Web crawled Spiderling Pomikalek and Suchomel, WAC 2012 Cleaning and deduplication justText, Onion (Pomikalek)
5
Size Varieties/dialects 5.8 b space-separated tokens Fully processed:
200M words Tokenise, lemmatise, POS-tag by MADA, Columbia U Sketch grammar: new work (Belinkov) Varieties/dialects We don’t know yet
6
Availability In Sketch Engine demo
7
Encoding ‘Vertical’ format One word per line, tab-separated columns
Sketch Engine input format One word per line, tab-separated columns Twenty-nine Structural markup: XML
8
For each word word (as written, in Arabic) trans diac lemma lemma_ar non_voc_lemma non_voc_lemma_ar stem tag bw pref3 pref3tag pref2 pref2tag pref1 pref1tag pref0 pref0tag person aspect vox modus gender number state case enclitic gloss source
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.