Download presentation
Presentation is loading. Please wait.
Published byΣίβύλλα Ζωγράφος Modified over 6 years ago
1
Corpora of social media in minority Uralic languages
Timofey Arkhangelskiy Universität Hamburg / Alexander von Humboldt Foundation
2
Middle-sized Uralic languages
Udmurt, Komi-Zyrian, Komi-Permyak (Permic); Erzya, Moksha (Mordvinic); Hill and Meadow Mari All still spoken by relatively many people, but endangered Some digital presence, but not much; the largest part comes from 1-2 online newspapers for each language What about social media?
3
Uralic social media My project: download all (well, almost all) open social media texts in these languages and make corpora out of them Udmurt, Erzya and Moksha are online, the rest will appear in 2019 All corpora have user metadata (anonymized/aggregated), morphological annotation and annotation of some lexical categories (Russian borrowings, animacy, etc.)
4
Uralic social media Long pipeline with lots of manual labor
Semi-manual search for URLs Download through API, limited crawling Automatic language detection Anonymization, filtering / deduplication Morphological annotation Everything is published online through Tsakorpus corpus platform
5
Some observations People write in these languages, but not much (less than 3M words per language over an 11- year span; less than 100 active users; less than sporadic users) Different trends in different languages Situation in Udmurt and Komi-Zyrian seems healthy In Erzya, not quite so Almost no texts in Moksha (14 thousand words)
6
Thank you for your attention!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.