Presentation is loading. Please wait.

Presentation is loading. Please wait.

Transcripts are stored in a relational database Transcripts are divided up to their smallest constituent (words), while the context is preserved, in a.

Similar presentations


Presentation on theme: "Transcripts are stored in a relational database Transcripts are divided up to their smallest constituent (words), while the context is preserved, in a."— Presentation transcript:

1 Transcripts are stored in a relational database Transcripts are divided up to their smallest constituent (words), while the context is preserved, in a structure basically like this: interview_id interview speaker_idsentence_idword_id speakersentenceword interview_idspeaker_idsentence_id locality start time end time DynaSAND: technology

2 This means that individual words can be addressed, e.g. for POS tagging The POS tags are themselves stored as separate categories, attributes and values, not as opaque strings: attribute_id value_id word_id category_id word_id word category attributes

3 Generating other formats The fact that the data is stored in its smallest constituent parts makes it relatively easy to generate other formats Example: we realize that a binary format like a relational database is not appropriate for long-term archival, so we made the SAND transcriptions available as TEI XML by creating a template and filling that with data from the database with a script Another example: the IMDI metadata for another corpus (The Goeman-Taeldeman-Van Reenen Project, or GTRP corpus) were created in the same way

4 Generating metadata for CLARIN Previous experience with SAND and GTRP indicates that generating XML metadata for CLARIN from our databases should be doable The TEI and IMDI for SAND and GTRP were created once and are static; we plan to make the process more dynamic for CLARIN metadata by creating the XML on the fly (and implementing a caching mechanism for performance reasons) so that the metadata is always up to date

5 Edisyn (European Dialect Syntax) One of the goals of Edisyn is the development of a search engine which uses one tag set to search different corpora, including the SAND, concurrently Central tag set is being developed by Franca Wesseling; we plan to make it compatible with ISOcat Search engine translates these tags to the native tag sets of the corpora Ideal case: corpora are hosted by their own organizations and accessible via a web service In practice: the Meertens has local copies of the corpora Participating corpora: SAND, CORDIAL-SIN (Portuguese), ASIS (Italian), EMK (Estonian); more to come

6 Other Meertens language resources PLAND (Plant Names in Dutch Dialects) NVD (Dutch Database of First Names) NFD (Dutch Database of Family Names) Corpus of free dialect speech (sound recordings) Dutch Database of Toponyms (in development) Dutch Song Database Dutch Folktale Database

7 Other Meertens language resources Apart from part of the sound recordings, all these are web-based and based on the same database technology We plan to make CLARIN metadata available for these resources in a stepwise manner: first metadata on the corpus level, later also metadata on the record level The technologies involved (OAI-PMH) are new to us, so we want to do this in close cooperation with a “harvesting” institution to make sure that our stuff is correct

8 Further in the future The Meertens Institute wants to be part of CLARIN and in the future we also hope to contribute to the development of tools to work with language resources

9 Thank you for your attention!


Download ppt "Transcripts are stored in a relational database Transcripts are divided up to their smallest constituent (words), while the context is preserved, in a."

Similar presentations


Ads by Google