Depositors’ usage of IMDI metadata Daan Broeder & Alex Klassmann MPI Institute for Psycholinguistics DELAMAN meeting London 2006
IMDI metadata Forms with ~150 possible descriptors –Describes bundles of related resources –Extensive set compared with DC/OLAC –But only “name” descriptor is compulsory Archive holds –~40000 IMDI sessions or resource bundles non-local but available in our DB –Describing ~ resources
IMDI Metadata The descriptors hierarchically ordered entries, which concern –the event (recording location, date, etc), –the project, –the languages involved, –the Participants, –the type and nature of speech, –technical information about the resources –access rights values of descriptors can be closed or open vocabularies or free text. user can use prose descriptions at each of these levels + project/user defined keys
Metadata Use Documentation of the resources Retrieval and reuse: archive offers tools for: –Browsing the archives’ corpora –Structured metadata search High precision, low recall –Unstructured google-like metadata search High recall, low precision Large set-> not all elements are always relevant –Sparsely populated metadata space –Search tool to show frequency counts for metadata values. Avoids fruitless searches.
Depositor Guidance In general depositors are urged to be complete as possible for documentation purposes Some projects have an obligatory set of descriptors to fill in. (CGN, DBD, …) Provide training to get familiar with the set and tools Provide documentation Support by student-assistants and corpus managers
Observations II Often researchers do not fill in all the relevant data at their disposal. Some tendency to avoid this time-consuming work oriented to re-usage by others. The sheer size of the set may discourage people to start filling in data at all. Training helps. Best results in projects that decided beforehand what descriptors were needed to fill in. Of course there are also very committed individuals!!! Corpus managers/student assistants may clean things up. –but limited use since only the researcher has specific knowledge –can serve as intermediaries.
Observations II Only that part of the archive where metadata was specified manually (e.g. CGN was excluded as were sessions outside the MPI) Statistics on the basis of ~25000 remaining sessions The data gives an impression of how often fields are actually filled in (e.g. not empty and not default “unknown“ or “unspecified“). Cannot exclude “repairs” where obvious omissions were repaired by corpus management
Descriptor nametotal fl-12000acqui Country Address Region71011 Description Key Project.Name Content.Description Genre SubGenre Task Modalities Subject362 Interactivity PlanningType Involvement SocialContext6109 EventStructure799 Channel81011 Content.Language.Description Content.Language.Id Content.Language.Name919094
Actor.Language.Description Actor.Language.Id Actor.Language.Name Actor.Role Actor.Name Actor.FullName Actor.Code Actor.FamilySocialRole Actor.EthnicGroup Actor.BirthDate588 Actor.Age Actor.Sex Actor.Education Actor.Description Actor.Key MediaFile.Type MediaFile.Format MediaFile.Quality18831 WrittenResource.Type WrittenResource.SubType WrittenResource.Format WrittenResource.ContentEncoding370 WrittenResource.CharacterEncoding3120 WrittenResource.LanguageId411
Conclusions As can be seen the sets are far from being complete. But also every field of the scheme has been used in some sessions, so that it seems that no field in the schema is obsolete People find use for the description fields that are available at different levels (~50%) Also the user/project defined keys are used (~50%) -> IMDI set is not big enough Some keys are not much used –Remove? –But where then to put this information if its available?