Spoken Meadow Mari Corpus: Data, Design, and Aims

Anna Volkova Mikhail Voronov

2 Uralic Languages

3 Meadow Mari

4 Meadow Mari Uralic family, Finno-Ugric branch ISO code [mhr]
Spoken by approximately 388,000 (Ethnologue 2013); 525,500 speakers in 1993 (United Bible Societies); 451,000 speakers in the 2002 census Ethnic Mari population (including Hill Mari) 548,000 (2010 census); 604,300 in the 2002 census Dictionaries Grammars Corpus of written texts: work in progress (Tromsø, The Arctic University of Norway) Morphological analyzer (Jeremy Bradley)

5 Aims An algorithm for creating spoken minority languages corpora
A spoken Meadow Mari corpus which may be used as: A research tool for linguists: Typological observations Contact phenomena Dialectal variation A study resource for people learning Meadow Mari: Intonation Language material for creating exercises

6 Applications for Linguists
Checking grammatical hypotheses (e.g. discontinuous past) Assessing grammatical phenomena distribution in discourse Research on: Language interference Borrowing Intonation Etc.

7 Our Data Gathered by two MSU expeditions: 2000 and 2001 to the Staryj Torjal village 79 audio files 322 Mb (WAV) 365 text files Around 59 unique Format: MS-Word and RTF files with Meadow Mari sentences, glosses (not always) and translation Original sentences are written without the use of Unicode Misha wrote a Python script that fixes the original sentences and adds tags to the lines facilitating importing into Fieldworks

8 Original Text … … 1. jEMgEr joMgal-eS, lekcij tUMal-eS
звонок звенеть+PRS-3SG лекция начинаться+PRS-3SG mal-en kod-En-am, Ende mo-m ESt-em ? спать-CONV оставаться-PRT-1SG теперь что-ACC делать+PRS-1S Звенит звонок, начинается лекция. Я проспал, что теперь буду делать? 2. oj Cot Orkan-em lekcij-ES koSt-aS ой очень лениться+PRS-1SG лекция-LAT ходить-INF no vet Skat pal-em — kUleS tunem-aS но ведь сам знать+PRS-1SG нужно учиться-INF Я очень ленюсь ходить на лекции, но ведь сам знаю – надо учиться.

9 Fixed … \Tx 1. jəŋgər joŋgaleš, lekcij tüŋaleš
\gk звонок звенеть.PRS-3SG лекция начинаться.PRS-3SG \Tx malen kodənam, ənde mom əštem ? \gk спать-CONV оставаться-PRT-1SG теперь что-ACC делать.PRS-1SG \Tr Звенит звонок, начинается лекция. Я проспал, что теперь буду делать? \Tx 2. oj č'ot örkanem lekcijəš koštaš \gk ой очень лениться.PRS-1SG лекция-LAT ходить-INF \Tx no vet škat palem — küleš tunemaš \gk но ведь сам знать.PRS-1SG нужно учиться-INF \Tr Я очень ленюсь ходить на лекции, но ведь сам знаю – надо учиться.

10 Work in Progress Sort out the files and find the text-audio correspondences Transcribe the audio-files that haven’t been transcribed Gloss the texts that have been transcribed (FieldWorks) Import glossed texts to ELAN Align the audio and text Annotate Contact phenomena Information structure Referential properties of DPs Add other dialects Dialectal variation many phenomena in MM are discourse-oriented

11 Results for Now and Plans
Done: 25 texts glossed in FieldWorks 3533 words One text imported into ELAN as a try-out To Do: Sort out the remaining texts Gloss them Align text and audio in ELAN


13 Questions and Doubts From to ELAN to an online corpus with a full search functionality – ? FieldWorks usage?

