Uralic multimedia corpora: ISO/TEI corpus data in the project INEL Timofey Arkhangelskiy Universität Hamburg / Alexander von Humboldt Foundation timarkh@gmail.com Anne Ferger Universität Hamburg anne.ferger@uni-hamburg.de Hanna Hedeland hanna.hedeland@uni-hamburg.de
INEL Long-term documentation project at Hamburg, currently corpora of Selkup, Kamas and Dolgan are being prepared Spoken corpora (+ archival transcriptions) All annotated data stored and edited in EXMARaLDA (time-aligned XML format + GUI) Our goal is (a) long-term preservation of the data; (b) providing easy access to corpora through an online user interface
EXMARaLDA > ISO/TEI > tsakorpus We transform EXMARaLDA data to the XML based on the ISO/TEI standard (good for long- term preservation) We use the Tsakorpus corpus platform for online access ISO/TEI files are converted to Tsakorpus JSON The pipeline is applicable to other spoken corpora hosted at Hamburg Center for Language Corpora
Disclaimers (from all of us) INEL-internal data handling (tools, glossing strategies, choice of EXMARaLDA etc.) is outside the scope of our presentation (from me personally) I am only responsible for the ISO/TEI > Tsakorpus conversion and do not participate in INEL
Thank you for your attention!